Results 1  10
of
11
Xmeans: Extending Kmeans with Efficient Estimation of the Number of Clusters
 In Proceedings of the 17th International Conf. on Machine Learning
, 2000
"... Despite its popularity for general clustering, Kmeans suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the t ..."
Abstract

Cited by 267 (5 self)
 Add to MetaCart
Despite its popularity for general clustering, Kmeans suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached sufficient statistics and a new very efficient test that in one Kmeans sweep selects the most promising subset of classes for refinement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated Kmeans for different values of K.
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 260 (24 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
`NBody' Problems in Statistical Learning
, 2001
"... We present efficient algorithms for allpointpairs problems, or 'Nbody 'like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearestneighbor classification, kernel density estimation, outlier detection, and the twopoint correlation. ..."
Abstract

Cited by 90 (12 self)
 Add to MetaCart
We present efficient algorithms for allpointpairs problems, or 'Nbody 'like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearestneighbor classification, kernel density estimation, outlier detection, and the twopoint correlation.
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data
 In Twelfth Conference on Uncertainty in Artificial Intelligence
, 2000
"... This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms. ..."
Abstract

Cited by 75 (8 self)
 Add to MetaCart
This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms.
Fast and robust short video clip search using an index structure
 in ACM Multimedia’s Multimedia Information Retrieval Workshop
, 2004
"... In this paper, we present an index structurebased method to fast and robustly search short video clips in large video collections. First we temporally segment a given long video stream into overlapped matching windows, then map extracted features from the windows into points in a high dimensional f ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
In this paper, we present an index structurebased method to fast and robustly search short video clips in large video collections. First we temporally segment a given long video stream into overlapped matching windows, then map extracted features from the windows into points in a high dimensional feature space, and construct index structures for these feature points for querying process. Different from linearscan similarity matching methods, querying process can be accelerated by spatial pruning brought by an index structure. A multiresolution kdtree (mrkdtree) is employed to complete exact KNN Query and range query with the aim of fast and precisely searching out all short video segments having the same contents as the query. In terms of feature representation, rather than selecting representative key frames, we develop a set of spatialtemporal features in order to globally capture the pattern of a short video clip (e.g. a commercial clip, a lead in/out clip) and combine it with the color range feature to form video signatures. Our experiments have shown the efficiency and effectiveness of the proposed method that the very first instance of a given 10sec query clip can be identified from a 10.5hour video collection in tens of milliseconds. The proposed method has been also compared with the fast sequential search algorithm.
Repairing Faulty Mixture Models using Density Estimation
 In Proceedings of the 18th International Conf. on Machine Learning
, 2001
"... Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed th ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed the issue of how to move through this space. Local optimization techniques, such as expectation maximization, solve only part of the problem; we still need to move between different local optima.
Summary of biosurveillancerelevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly ap ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the halflife is known as the “time window ” parameter. This timewindow parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 2000
"... nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached sufficient statistics will be critical to developing automated discovery on massive data. A cached sufficient statistics representation is a data structure that summarizes statistical information in a database. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over billions of records, answer it immediately. The interesting technical challenge is: given that there
Nonparametric Optimization and Galactic Morphology
"... We also introduce a new algorithm for optimization of similaritybased data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored exa ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We also introduce a new algorithm for optimization of similaritybased data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored examples to divide the space into many smaller spaces and uses a nonparametric statistic to decide on their promise. The nonparametric statistic is Kendall's tau, which is used to measure the probability that a given point is at an optimum. Because it is fundamentally nonparametric, the algorithm is also robust to nonGaussian noise and outliers. To my mother and father, to whom I owe just about everything Acknowledgements I would like to thank my advisor, Andrew Moore, who provided such an inspirational model throughout my graduate studies. Andrew was perpetually supportive and enthusiastic, and could be counted on to have a clever idea up each sleeve at any given time.
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 Journal of the American Statistical Association
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", \Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...