Mining Spatio-Temporal
Data
Earth Systems Science
Tony Fountain,
Last modified: 30-AUG-04
Contact
Information:
Tony Fountain
Phone: 858.534.8374
Fax: 858.534.5152
fountain@sdsc.edu
The NPACI ESS
Mining Spatio-temporal data project focuses on the development of
cyberinfrastructure for data management and analytical services for ecological
data. The long-term goal is an integrated knowledge management system that uses
extensible and scalable infrastructure to provide data and analytical services
to the broad LTER Network community of field biologists and associated
collaborators working at these sites (over 1500 individuals). The tools provide
access to complex datasets, in particular the LTER Network hyperspectral remote
sensing data, to researchers who would otherwise not be able to use the data.
The hyperspectral data collection is large and complex and raises challenges
for traditional management and analysis approaches. The current system consists
of several components, including a SRB-based collection management system for
remote sensing imagery, analytical pipelines (workflows) for high-throughput
processing of remote sensing data, mining algorithms for analysis, band
selection and classification, and dynamic internet mapping of analysis results.
For FY2004 the main goal is to transition the tools and infrastructure so that
resources are available to related projects and the broader scientific
community. This included enhancing the analysis routines for hyperspectral data
analysis, installing these in the on-line Spatial Data Workbench for
experimentation and testing, migrating the analysis pipelines to the
Ptolemy/Kepler workflow system (used by the SEEK and GEON projects). During
Fy04 we have extended our analytical toolkit (SKIDLkit --
http://daks.sdsc.edu/skidl/skidldownloads.html) that was developed for analysis
of the hyperspectral data and automated key processing tasks by embedding
SKIDLkit routines in the Kepler workflow system (http://kepler.ecoinformatics.org/).
The analysis of hyperspectral data is a complex task, requiring
compute-intensive statistical and machine-learning approaches to pattern
discovery and model development.
We have focused on algorithms that extend traditional statistical modeling techniques,
by shifting the focus from the correct parameterization of statistical models
with well-known analytical properties to the generation and tuning of models
closely adapted to the data. These models are then checked against overfitting
by using compute-intensive cross-validation techniques.
The development of this toolkit is motivated by hyperspectral image analysis. A
hyperspectral image is an image from the surface of the earth consisting of
more than a hundred layers (high dimensional features), each layer measuring
the intensity of the image in a certain wavelength. Thus, each pixel (location)
has more than a hundred values, creating a spectral signature, describing the
spectral intensity of the location, from infrared to ultra-violet. In order to
identify new pixels, the spectral signature is fed into classification methods.
The whole spectrum is often unnecessary, and choosing several key bands, i.e.
reducing the dimensionality, brings positive effects: computational advantage,
ease of analysis, and stability. Taking out unnecessary and redundant features
also leads to simpler and more intuitive analysis, and a smaller set of
features translates to less time to compute. In addition, feature selection is
a way to avoid overfitting, a situation where the identification works well on
the data on which it is built, but not on a new image, for the reason that the
model is too specifically tailored.
Two ways to select important bands are implemented in SKIDLkit. First is the
filter method. In filter methods, all the bands are first ordered according to
their distance measure (the ability to discriminate the target material from
the others), and then only the high-ranked bands become an input to the
classification algorithms. SKIDLkit implements three types of distance
measures: t-test, prediction strength and Bhattacharya distance. SKIDLkit also
enables those selected features to be used either on Support Vector Machines
(SVM) or Naive Bayesian classifier (NBC). An alternative to this method is the
wrapper method, which includes the induction algorithm itself as a part of the
feature selection process. The induction algorithm (such as SVM or NBC) is
trained on different subsets of features, and a good subset is chosen according
to the accuracy of the classifiers. SKIDLkit implements three kinds of wrapper
methods: SVM wrapped in a genetic algorithm (GA + SVM), NBC wrapped in a GA (GA
+ NBC), and recursive feature elimination (RFE) in Support Vector Machine. Both GA + SVM and GA + NBC evolve a subset of features to be
a good input to the induction algorithms (SVM and NBC), and for the two
methods, the n-fold cross validation (specifically, jack-knife) accuracy
measure is implemented. RFE, on the other hand, starts with the whole feature
set, and recursively eliminates a small set of features in a greedy way. After
the features are selected, a model is created, and new, unseen samples can be
classified using the model. In addition, domain scientists can gain insight
into the domain by studying the selected features. SKIDLkit is built on
existing open software when possible. This includes the following: Naïve and
Full Bayesian Classifier by Christian Borgelt, svmLight by Thorsten Joachims,
netLab toolkit for Matlab by Chris Bisho. The SKIDLkit routines automate much
of this analysis process. This automation is further enhanced by combining
multiple processing into analysis workflows implemented in the Kepler system.
Publications/Presentations:
Supercomputing 2002 Demonstration – Managing and Mining Large Geospatial
Collections for Ecology, November 2002.
LTER Network All Scientists Conference Presentation – Data Mining and Machine
Learning in Ecology, September 2003.
Hector Jasso, Peter Shin, Tony Fountain, Deana Pennington: Using Wavelets for
the Classification of Hyperspectral Images. The Fourth International Workshop
on Environmental Applications of Machine Learning/ Fourth European Conference
on Ecological Modeling (EAML/ECEM, Sep 27-Oct 1, 2004)
Fountain, Tony, Vande Castle, John, Asner, Greg, Moore, Reagan, and Rajasekar,
Arcot: The
LTER Hyper SRB System: A Collections management System for LTER Hyperspectral
Remote
Sensing Data. LTER All Scientists Meeting 2000, Snowbird, UT,
8/2/2000-8/4/2000.
Pennington, Deana, Fountain, Tony, and Wang, Guillan (Jenny), and Vande Castle,
John: Spatio-temporal Data Mining of Remotely Sensed Imagery for Ecology,
GIScience 2002
Second International Conference on Geographic Information Science, Boulder,
Colorado, USA
September 25-28, 2002.
Pennington, Deanna, Jasso, Hector, Shin, Peter, and Fountain, Tony: The Effect
of Landscape Heterogeneity on Classification Accuracy: a comparison of
classifier prediction in sub-optimal sampling conditions.
Image 1
hyperspectral-result.ppt: Caption: Hyperspectral
image analysis results

Image
2
hyperspectral-workflow.ppt Caption: A screen capture of the
hyperspectral analysis workflow in the Kepler system
