Report from the workshop entitled “Information Technology for the Decade of Synthesis: Data Synthesis in the Present and Future”.
2003 All Scientists Meeting.
Todd Ackerman
Thirty-four scientists, students, and data managers participated in this workshop. 50% of those in attendance were LTER IM’s or GIS Coordinators, 10% were ILTER, 10% were students, 10% were collaborators, and 15% were LTER PI’s, the remaining 5% were none of the above. The focus of the workshop was on tools that are currently being developed in order to integrate diverse data sets from individual site-based research programs in order to foster cross-site studies. Four approaches to software tools for data integration were presented, with discussion following. The presentations consisted of: “Hand-crafted Data Management: IT tools built to last”, by Greg Newman, Natural Resource Ecology Laboratory at Colorado State University; “Software tools for automated metadata creation, metadata-mediated data processing and quality control analysis – real-time processing solutions for real-time data”, by Wade Sheldon, Georgia Coastal Ecosystem LTER; “Tools for creating and executing scientific workflows”, by Chad Berkley, NCEAS; “Southwest Environmental Information Network: Using EML to mediate data discovery, access, and visualization”, by Peter McCartney, Central Arizona Phoenix LTER.
Each presenter discussed their approach to the development of new tools for scientific synthesis as well as how the four following themes related to the development process:
1) What resources were available before development of the tool?
2) What was the need for the tool?
3) How much time was invested in order to develop the tool?
4) What level of scalability/portability does it have?
Greg Newman focused on the design of a tool that was general/simple enough to allow use with disparate datasets, as well as focusing on the end user experience.
Wade Sheldon demonstrated a tool he developed from scratch, which generates automatic metadata documentation of every data processing step. This tool was developed using MatLab, which is required to run the tool. MatLab is proprietary software
Chad Berkley discussed the history of NCEAS’s tool, Monarch, and how and why they wound up using another program called Ptolemy developed elsewhere. They have developed components to the already mature program (Ptolemy) that allow ingestion of a document that has metadata in a well-formed EML document. Ptolemy is a workflow model where “actors”, or workflow steps, put data through processing whilst documenting each processing step, thus generating valuable metadata.
Peter McCartney presented SeinNet’s included search application for locating data and literature resources that have been published as Ecological Metadata Language (EML) data sources using open protocols. In this case, the metadata allows the tool to function.
The purpose of the workshop was to provide interaction between the Information Managers and the Investigators to uncover what needs to go into the development of tools for scientific synthesis as well as what has gone into tools that have been developed.
The basic root of each of these tools appeared to be metadata. In order for any tool to work on a data set, the dataset must contain the ever-elusive data about the data.
Generally the discussion segment of the workshop centered around the topic of metadata. Discussion began with the efforts that the network is making in order to improve the ability to perform network synthesis. The tiered information management framework was discussed which then led the discussion to metadata and the importance of machine parsable metadata in order to aggregate data from individual network sites. The need for a plan to allow sites to scale up from the current structure to the tiered model to be developed was addressed. Once all of the sites have complete parsable metadata, tools can be more easily written in order to perform network data aggregation. The metadata must be able to describe the data to the tools in order for processing to be performed.
We then touched on the subject of where priorities lie for metadata development. The quality metadata is the basis of all cross-site analysis. Each of the tools presented had some sort of automatic logging of processing steps as metadata. It was suggested that NSF needs to be presented with the fact that metadata is a painstaking and laborious task that needs funding.
When the few researchers in attendance were asked, “what to researchers need?”, it was suggested that the researchers need to be empowered to manage metadata in a time efficient manner (this is seen in Wade Sheldon’s GCE Data Toolbox). An automated log for data processing, i.e., metadata that writes itself, was suggested, which most of the tools presented exhibited.
The discussion then started to head in the direction of ontologies, brought
up by Michael Mirtl from
Participants:
Greg Newman, John Anderson, Micael Mirtl, Barbara Benson, Emilio Mayorga, Dave Balsiger, Suzanne Sippel, Scott Miller, Jamie Hollingsworth, Karl Kaufmann, Xiaoming Qin, Geoff Poole, Doug Moore, David White, Michael Right, Phil Bayer, Mike Rugge, Don Henshaw, Corinna Gries, Suzanne Remillard, Dylan Keon, Barbara Nolen, Kirsten Schwarz, Chen Chaur-Tauhn, Jeff Hepinsfall, Karen Baker, Chad Delany, Steven Paton, Norbert Kraeuchi, Hsing-Juh Lin, Peter McCartney, Wade Sheldon, Chad Berkley, Todd Ackerman