Results 1 -
6 of
6
Clustering in Massive Data Sets
- Handbook of massive data sets
, 1999
"... We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Sections 2 to 6 relate to nearest neighbor searching, an elemental form of clustering, and a basis for clustering algorithms to follow. Sections 7 to 11 review a number of families of clustering algorithm. Sections 12 to 14 relate to visual or image representations of data sets, from which a number of interesting algorithmic developments arise.
The challenges of clustering high-dimensional data
- In New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition
, 2003
"... Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a concept-based clustering approach. 1
Treelets — An Adaptive Multi-Scale Basis for Sparse Unordered Data
"... In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered — with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typicall ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In many modern applications, including analysis of gene expression and text documents, the data are noisy, high-dimensional, and unordered — with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper, we present treelets — a novel construction of multi-scale bases that extends wavelets to non-smooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonor-mal basis which both reflect the internal structure of the data. Treelets are especially well-suited as a dimensionality reduction and feature selection tool prior to regression and classification, in situ-ations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables. 1
INTERNATIONAL SYMPOSIUM ON ROBOTS AND AUTOMATION 2002 1 Visual feature extraction via PCA-based
, 2002
"... This paper describes the use of a novel method of feature extraction for visual sensor processing. An application to crack detection by a sewer maintenance robot equipped with an infra-red camera is described. A spacefrequency distribution 'signature' is generated via a three step process involving; ..."
Abstract
- Add to MetaCart
This paper describes the use of a novel method of feature extraction for visual sensor processing. An application to crack detection by a sewer maintenance robot equipped with an infra-red camera is described. A spacefrequency distribution 'signature' is generated via a three step process involving; space-frequency decomposition, density function estimation, and parameter extraction. These steps are achieved, respectively via; the wavelet packet transform, empirical cumulative density functions, and principle components analysis. The 'wavelet signature' method is shown to be superior to conventional methods as a feature extractor for a logistic regression model in a crack discrimination task.
Aide-Memoire. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality
, 2000
"... The coming century is surely the century of data. A combination of blind faith and serious purpose makes our society invest massively in the collection and processing of data of all kinds, on scales unimaginable until recently. Hyperspectral Imagery, Internet Portals, Financial tick-by-tick data, an ..."
Abstract
- Add to MetaCart
The coming century is surely the century of data. A combination of blind faith and serious purpose makes our society invest massively in the collection and processing of data of all kinds, on scales unimaginable until recently. Hyperspectral Imagery, Internet Portals, Financial tick-by-tick data, and DNA Microarrays are just a few of the betterknown sources, feeding data in torrential streams into scientific and business databases worldwide. In traditional statistical data analysis, we think of observations of instances of particular phenomena (e.g. instance ↔ human being), these observations being a vector of values we measured on several variables (e.g. blood pressure, weight, height,...). In traditional statistical methodology, we assumed many observations and a few, wellchosen variables. The trend today is towards more observations but even more so, to radically larger numbers of variables – voracious, automatic, systematic collection of hyper-informative detail about each observed instance. We are seeing examples where the observations gathered on individual instances are curves, or spectra, or images, or
Scale-Based Gaussian Coverings: Combining Intra and Inter Mixture Models in Image Segmentation
, 2009
"... By a “covering ” we mean a Gaussian mixture model fit to observed data. Approximations of the Bayes factor can be availed of to judge model fit to the data within a given Gaussian mixture model. Between families of Gaussian mixture models, we propose the Rényi quadratic entropy as an excellent and t ..."
Abstract
- Add to MetaCart
By a “covering ” we mean a Gaussian mixture model fit to observed data. Approximations of the Bayes factor can be availed of to judge model fit to the data within a given Gaussian mixture model. Between families of Gaussian mixture models, we propose the Rényi quadratic entropy as an excellent and tractable model comparison framework. We exemplify this using the segmentation of an MRI image volume, based (1) on a direct Gaussian mixture model applied to the marginal distribution function, and (2) Gaussian model fit through k-means applied to the 4D multivalued image volume furnished by the wavelet transform. Visual preference for one model over another is not immediate. The Rényi quadratic entropy allows us to show clearly that one of these modelings is superior to the other.

