Most ecologists will agree on the necessity and importance of synthesis to address new ecological questions, yet synthesizing desired data products from a diverse array of complex datasets in a robust and reproducible way is a challenging task. Now, teams of researchers from the Harvard Forest Long-Term Ecological Research site (HFR) and the LTER Network Office (LNO) have advanced the knowledge of designing and building scientifically rigorous on-line information systems that will directly and significantly enhance ecological synthesis.
 |
Figure 1 - Provenance Aware SynThesis Architecture (PASTA) diagram.
|
The LNO team responsible for the design and development of the Network Information
System (NIS) has designed and prototyped a data warehouse framework to support
ecological synthesis, building on successful deployment of ecological metadata
language (EML), the Metacat repository, and Metacat Harvester. This framework,
code-named PASTA for Provenance Aware SynThesis Architecture (see Figure 1),
is (1) efficient because it builds on existing investments and experiences,
(2) integrative because it adopts standard interfaces and approaches, and (3)
innovative because it incorporates data provenance and data quality into the
design. The PASTA data warehousing architecture has been prototyped against
the dynamic part of the Trends project as a case study and demonstrated to
scientists on the Trends editorial committee. PASTA has received positive reviews
by the Network Information System Advisory Committee (NISAC), members of the
Science Environment for Ecological Knowledge (SEEK) development team, the Trends
technical committee, and the LTER IM committee. According to Mark Servilla,
Lead NIS developer, “The project draws upon current and advanced computing
science in the management of data provenance and data quality metrics….
Early prototyping will pay off and accelerate development by giving us material
with which to solicit partners and proposals.”
 |
Figure 2 – Data flow graph of the processes used for the analysis
of eddy covariance and carbon flux. |
While there is much to be done to bring PASTA into production, a major milestone
was reached recently in developing and testing the EML Parser/Loader. The
EML Parser/Loader, developed in partnership with SEEK and the National
Center for Ecological Analysis and Synthesis (NCEAS), reads an EML document
and uses the information there to retrieve and load a dataset into a relational
database management system. In early tests, datasets from the Georgia Coastal
Ecosystem (GCE) LTER site have been successfully extracted, loaded, and
queried. The success of the EML Parser/Loader is the next big step in being
able to automate part of the synthetic process.
One major hurdle in the deployment of PASTA or any architecture that recognizes
provenance is defining the mechanism for representing data lineage with complete
and precise definitions of the scientific processes that are used to produce
scientific datasets. Enter the researchers from Harvard Forest LTER and their
partners at the University of Massachusetts. Through a concept called “analytic
webs” (first reported as an update to Network News in July, www.lternet.edu/news/Article98.html)
analytic and synthetic processes can be described accurately through a concordance
of directed graphs describing data flow, dataset derivation, and data processes
(Figure 2). The precise and formal definitions of these graphs present a promising
development in describing data provenance in a robust and reproducible way
that can work in harmony with the LTER Network investments in EML. The team
comprising A.M. Ellison, L.J. Osterweil, L. Clarke, J.L. Haldley, A. Wise,
E. Boose, D.R. Foster, A. Hanson, D. Jensen, P. Kuzeja, E. Riseman, and H.
Schultz, whose work was supported by the National Science Foundation, also
developed a prototype software tool called SciWalker that is used to create
the analytic webs and synthesize the data. The researchers successfully applied
analytic webs to the analysis and synthesis of forest carbon-dioxide exchange
data from eddy flux towers located at Harvard Forest’s Prospect Hill.
These independent developments by researchers in the LTER Network and their
partners fit together like the pieces of a puzzle to form a promising picture
of the future. Look to this space for continued reporting on advances in Ecological
Informatics.
Further reading
Ellison, A. M., L.J. Osterweil, L. Clarke, J.L. Haldley, A. Wise, E. Boose,
D.R. Foster, A. Hanson, D. Jensen, P. Kuzeja, E. Riseman, and H. Schultz. 2006.
Analytic Webs Support the Synthesis of Ecological Data Sets. Ecology, 87(6):
1345-1358.
Servilla, M.S., J.W. Brunt, I Sangil, and D Costa. 2006. Pasta: A Network-level
Architecture Design for Generating Synthetic Data Products in the LTER Network.
Databits – Fall 2006. Long Term Ecological Research Network.
By James W. Brunt, Associate Director for Information Management, LNO
|