Mining Spatio-Temporal Data
Earth Systems Science
Tony Fountain, San Diego Supercomputer Center
Last modified: 30-AUG-04


Contact Information:
Tony Fountain
Phone: 858.534.8374
Fax: 858.534.5152
fountain@sdsc.edu


The NPACI ESS Mining Spatio-temporal data project focuses on the development of cyberinfrastructure for data management and analytical services for ecological data. The long-term goal is an integrated knowledge management system that uses extensible and scalable infrastructure to provide data and analytical services to the broad LTER Network community of field biologists and associated collaborators working at these sites (over 1500 individuals). The tools provide access to complex datasets, in particular the LTER Network hyperspectral remote sensing data, to researchers who would otherwise not be able to use the data. The hyperspectral data collection is large and complex and raises challenges for traditional management and analysis approaches. The current system consists of several components, including a SRB-based collection management system for remote sensing imagery, analytical pipelines (workflows) for high-throughput processing of remote sensing data, mining algorithms for analysis, band selection and classification, and dynamic internet mapping of analysis results.

For FY2004 the main goal is to transition the tools and infrastructure so that resources are available to related projects and the broader scientific community. This included enhancing the analysis routines for hyperspectral data analysis, installing these in the on-line Spatial Data Workbench for experimentation and testing, migrating the analysis pipelines to the Ptolemy/Kepler workflow system (used by the SEEK and GEON projects). During Fy04 we have extended our analytical toolkit (SKIDLkit -- http://daks.sdsc.edu/skidl/skidldownloads.html) that was developed for analysis of the hyperspectral data and automated key processing tasks by embedding SKIDLkit routines in the Kepler workflow system (http://kepler.ecoinformatics.org/).

The analysis of hyperspectral data is a complex task, requiring compute-intensive statistical and machine-learning approaches to pattern discovery and model development.
We have focused on algorithms that extend traditional statistical modeling techniques, by shifting the focus from the correct parameterization of statistical models with well-known analytical properties to the generation and tuning of models closely adapted to the data. These models are then checked against overfitting by using compute-intensive cross-validation techniques.

The development of this toolkit is motivated by hyperspectral image analysis. A hyperspectral image is an image from the surface of the earth consisting of more than a hundred layers (high dimensional features), each layer measuring the intensity of the image in a certain wavelength. Thus, each pixel (location) has more than a hundred values, creating a spectral signature, describing the spectral intensity of the location, from infrared to ultra-violet. In order to identify new pixels, the spectral signature is fed into classification methods. The whole spectrum is often unnecessary, and choosing several key bands, i.e. reducing the dimensionality, brings positive effects: computational advantage, ease of analysis, and stability. Taking out unnecessary and redundant features also leads to simpler and more intuitive analysis, and a smaller set of features translates to less time to compute. In addition, feature selection is a way to avoid overfitting, a situation where the identification works well on the data on which it is built, but not on a new image, for the reason that the model is too specifically tailored.

Two ways to select important bands are implemented in SKIDLkit. First is the filter method. In filter methods, all the bands are first ordered according to their distance measure (the ability to discriminate the target material from the others), and then only the high-ranked bands become an input to the classification algorithms. SKIDLkit implements three types of distance measures: t-test, prediction strength and Bhattacharya distance. SKIDLkit also enables those selected features to be used either on Support Vector Machines (SVM) or Naive Bayesian classifier (NBC). An alternative to this method is the wrapper method, which includes the induction algorithm itself as a part of the feature selection process. The induction algorithm (such as SVM or NBC) is trained on different subsets of features, and a good subset is chosen according to the accuracy of the classifiers. SKIDLkit implements three kinds of wrapper methods: SVM wrapped in a genetic algorithm (GA + SVM), NBC wrapped in a GA (GA + NBC), and recursive feature elimination (RFE) in Support Vector Machine. Both GA + SVM and GA + NBC evolve a subset of features to be a good input to the induction algorithms (SVM and NBC), and for the two methods, the n-fold cross validation (specifically, jack-knife) accuracy measure is implemented. RFE, on the other hand, starts with the whole feature set, and recursively eliminates a small set of features in a greedy way. After the features are selected, a model is created, and new, unseen samples can be classified using the model. In addition, domain scientists can gain insight into the domain by studying the selected features. SKIDLkit is built on existing open software when possible. This includes the following: Naïve and Full Bayesian Classifier by Christian Borgelt, svmLight by Thorsten Joachims, netLab toolkit for Matlab by Chris Bisho. The SKIDLkit routines automate much of this analysis process. This automation is further enhanced by combining multiple processing into analysis workflows implemented in the Kepler system.

Publications/Presentations:

Supercomputing 2002 Demonstration – Managing and Mining Large Geospatial Collections for Ecology, November 2002.

LTER Network All Scientists Conference Presentation – Data Mining and Machine Learning in Ecology, September 2003.

Hector Jasso, Peter Shin, Tony Fountain, Deana Pennington: Using Wavelets for the Classification of Hyperspectral Images. The Fourth International Workshop on Environmental Applications of Machine Learning/ Fourth European Conference on Ecological Modeling (EAML/ECEM, Sep 27-Oct 1, 2004)

Fountain, Tony, Vande Castle, John, Asner, Greg, Moore, Reagan, and Rajasekar, Arcot: The
LTER Hyper SRB System: A Collections management System for LTER Hyperspectral Remote
Sensing Data. LTER All Scientists Meeting 2000, Snowbird, UT, 8/2/2000-8/4/2000.

Pennington, Deana, Fountain, Tony, and Wang, Guillan (Jenny), and Vande Castle, John: Spatio-temporal Data Mining of Remotely Sensed Imagery for Ecology, GIScience 2002
Second International Conference on Geographic Information Science, Boulder, Colorado, USA
September 25-28, 2002.

Pennington, Deanna, Jasso, Hector, Shin, Peter, and Fountain, Tony: The Effect of Landscape Heterogeneity on Classification Accuracy: a comparison of classifier prediction in sub-optimal sampling conditions. SIAM International Conference on Data Mining, April 22-24, Lake Buena Vista, Florida.




Image 1
hyperspectral-result.ppt: Caption: Hyperspectral image analysis results





Image 2

hyperspectral-workflow.ppt  Caption: A screen capture of the hyperspectral analysis workflow in the Kepler system