Results 1 - 10
of
11
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
- In Proceedings of the 17th International Conf. on Machine Learning
, 2000
"... Despite its popularity for general clustering, K-means suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the t ..."
Abstract
-
Cited by 196 (5 self)
- Add to MetaCart
Despite its popularity for general clustering, K-means suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached sufficient statistics and a new very efficient test that in one K-means sweep selects the most promising subset of classes for refinement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated K-means for different values of K.
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 171 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
`N-Body' Problems in Statistical Learning
, 2001
"... We present efficient algorithms for all-point-pairs problems, or 'Nbody '-like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection, and the two-point correlation. ..."
Abstract
-
Cited by 70 (12 self)
- Add to MetaCart
We present efficient algorithms for all-point-pairs problems, or 'Nbody '-like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection, and the two-point correlation.
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data
- In Twelfth Conference on Uncertainty in Artificial Intelligence
, 2000
"... This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms. ..."
Abstract
-
Cited by 65 (9 self)
- Add to MetaCart
This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms.
Fast and robust short video clip search using an index structure
- in ACM Multimedia’s Multimedia Information Retrieval Workshop
, 2004
"... In this paper, we present an index structure-based method to fast and robustly search short video clips in large video collections. First we temporally segment a given long video stream into overlapped matching windows, then map extracted features from the windows into points in a high dimensional f ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper, we present an index structure-based method to fast and robustly search short video clips in large video collections. First we temporally segment a given long video stream into overlapped matching windows, then map extracted features from the windows into points in a high dimensional feature space, and construct index structures for these feature points for querying process. Different from linear-scan similarity matching methods, querying process can be accelerated by spatial pruning brought by an index structure. A multi-resolution kd-tree (mrkd-tree) is employed to complete exact K-NN Query and range query with the aim of fast and precisely searching out all short video segments having the same contents as the query. In terms of feature representation, rather than selecting representative key frames, we develop a set of spatial-temporal features in order to globally capture the pattern of a short video clip (e.g. a commercial clip, a lead in/out clip) and combine it with the color range feature to form video signatures. Our experiments have shown the efficiency and effectiveness of the proposed method that the very first instance of a given 10-sec query clip can be identified from a 10.5hour video collection in tens of milliseconds. The proposed method has been also compared with the fast sequential search algorithm.
Repairing Faulty Mixture Models using Density Estimation
- In Proceedings of the 18th International Conf. on Machine Learning
, 2001
"... Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi- mations) can guide a search of the model pa- rameter and structure space. Relatively lit- tle research has addressed th ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi- mations) can guide a search of the model pa- rameter and structure space. Relatively lit- tle research has addressed the issue of how to move through this space. Local optimization techniques, such as expectation maximization, solve only part of the problem; we still need to move between different local optima.
Summary of biosurveillance-relevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Time-weighted averaging This is directly ap ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Time-weighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the half-life is known as the “time window ” parameter. This time-window parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 2000
"... nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached sufficient statistics will be critical to developing automated discovery on massive data. A cached sufficient statistics representation is a data structure that summarizes statistical information in a database. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over billions of records, answer it immediately. The interesting technical challenge is: given that there
Nonparametric Optimization and Galactic Morphology
"... We also introduce a new algorithm for optimization of similarity-based data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored exa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We also introduce a new algorithm for optimization of similarity-based data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored examples to divide the space into many smaller spaces and uses a nonparametric statistic to decide on their promise. The nonparametric statistic is Kendall's tau, which is used to measure the probability that a given point is at an optimum. Because it is fundamentally nonparametric, the algorithm is also robust to non-Gaussian noise and outliers. To my mother and father, to whom I owe just about everything Acknowledgements I would like to thank my advisor, Andrew Moore, who provided such an inspirational model throughout my graduate studies. Andrew was perpetually supportive and enthusiastic, and could be counted on to have a clever idea up each sleeve at any given time.
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- Journal of the American Statistical Association
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", \Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...

