Results 1  10
of
13
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 402 (28 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Xmeans: Extending Kmeans with Efficient Estimation of the Number of Clusters
 In Proceedings of the 17th International Conf. on Machine Learning
, 2000
"... Despite its popularity for general clustering, Kmeans suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the t ..."
Abstract

Cited by 368 (5 self)
 Add to MetaCart
Despite its popularity for general clustering, Kmeans suffers three major shortcomings; it scales poorly computationally, the number of clusters K has to be supplied by the user, and the search is prone to local minima. We propose solutions for the first two problems, and a partial remedy for the third. Building on prior work for algorithmic acceleration that is not based on approximation, we introduce a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. The innovations include two new ways of exploiting cached sufficient statistics and a new very efficient test that in one Kmeans sweep selects the most promising subset of classes for refinement. This gives rise to a fast, statistically founded algorithm that outputs both the number of classes and their parameters. Experiments show this technique reveals the true number of classes in the underlying distribution, and that it is much faster than repeatedly using accelerated Kmeans for different values of K.
`NBody' Problems in Statistical Learning
, 2001
"... We present efficient algorithms for allpointpairs problems, or 'Nbody 'like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearestneighbor classification, kernel density estimation, outlier detection, and the twopoint correlation. ..."
Abstract

Cited by 113 (14 self)
 Add to MetaCart
(Show Context)
We present efficient algorithms for allpointpairs problems, or 'Nbody 'like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearestneighbor classification, kernel density estimation, outlier detection, and the twopoint correlation.
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data
 In Twelfth Conference on Uncertainty in Artificial Intelligence
, 2000
"... This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms. ..."
Abstract

Cited by 79 (9 self)
 Add to MetaCart
(Show Context)
This paper is about metric data structures in highdimensional or nonEuclidean space that permit cached sufficient statistics accelerations of learning algorithms.
Fast and robust short video clip search using an index structure
 In 6th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’04
, 2004
"... Abstract. Query by video clip (QVC) has attracted wide research interests in multimedia information retrieval. In general, QVC may include feature extraction, similarity measure, database organization, and search or query scheme. Towards an effective and efficient solution, diverse applications have ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
(Show Context)
Abstract. Query by video clip (QVC) has attracted wide research interests in multimedia information retrieval. In general, QVC may include feature extraction, similarity measure, database organization, and search or query scheme. Towards an effective and efficient solution, diverse applications have different considerations and challenges on the abovementioned phases. In this paper, we firstly attempt to broadly categorize most existing QVC work into 3 levels: concept based video retrieval, video title identification, and video copy detection. This 3level categorization is expected to explicitly identify typical applications, robust requirements, likely features, and main challenges existing between mature techniques and hard performance requirements. A brief survey is presented to concretize the QVC categorization. Under this categorization, in this paper we focus on the copy detection task, wherein the challenges are mainly due to the design of compact and robust low level features (i.e. an effective signature) and a kind of fast searching mechanism. In order to effectively and robustly characterize the video segments of variable lengths, we design a novel global visual feature (a fixedsize 144d sig
Repairing Faulty Mixture Models using Density Estimation
 In Proceedings of the 18th International Conf. on Machine Learning
, 2001
"... Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed th ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Previous work in mixture model clustering has focused primarily on the issue of model selection. Model scoring functions (including penalized likelihood and Bayesian approxi mations) can guide a search of the model pa rameter and structure space. Relatively lit tle research has addressed the issue of how to move through this space. Local optimization techniques, such as expectation maximization, solve only part of the problem; we still need to move between different local optima.
Summary of biosurveillancerelevant technologies
, 2003
"... This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly ap ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
This short report, compiled upon request from Dave Siegrist and Ted Senator, surveys the spectrum of technologies that can help with Biosurveillance. We indicate which we have chosen, so far, to use in our development of analysis methods and our reasons. 1 Timeweighted averaging This is directly applicable to a scalar signal (such as “number of respiratory cases today”. This method, more commonly used in computational finance, simply compares the count during the current time period with the weighted average of the counts of recent days. Exponential weighting is typically used, where the halflife is known as the “time window ” parameter. This timewindow parameter is typically chosen by hand. We prefer the Serfling and Univariate HMM methods described below. 2 Serfling method This method (Serfling, 1963) is a cyclic regression model, and is the standard CDC algorithm for flu detection. It is, again, applicable to scalar signals. It assumes that the signal follows a sinusoid with a period of one year, and thus finds the four parameters ¢¤£¦¥¨ § and © in where the parameters are chosen to minimize the sum of squares of residuals. It is an easy matter of regression analysis to determine, on any date, whether
Summary of Biosurveillancerelevant statistical and data mining technologies
, 2002
"... this document (or elsewhere) construct a detailed spatiotemporal probabilistic causal model of the population and use that model to infer the disease status of (1) the population and (2) each member of the population ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
this document (or elsewhere) construct a detailed spatiotemporal probabilistic causal model of the population and use that model to infer the disease status of (1) the population and (2) each member of the population
Nonparametric Optimization and Galactic Morphology
"... We also introduce a new algorithm for optimization of similaritybased data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored exa ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We also introduce a new algorithm for optimization of similaritybased data. In a problem where only the similarity metric is deo/ned, a gradient is rarely possible. The essence of MBR is the similarity metric among stored examples. The new algorithm, Pairwise Bisection, uses all pairs of stored examples to divide the space into many smaller spaces and uses a nonparametric statistic to decide on their promise. The nonparametric statistic is Kendall's tau, which is used to measure the probability that a given point is at an optimum. Because it is fundamentally nonparametric, the algorithm is also robust to nonGaussian noise and outliers. To my mother and father, to whom I owe just about everything Acknowledgements I would like to thank my advisor, Andrew Moore, who provided such an inspirational model throughout my graduate studies. Andrew was perpetually supportive and enthusiastic, and could be counted on to have a clever idea up each sleeve at any given time.
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 2000
"... nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
nto a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This situation is ironic given the large investment the US has put into gathering scientific data. The only alternative is automated discovery. It is our thesis that the emerging technology of cached sufficient statistics will be critical to developing automated discovery on massive data. A cached sufficient statistics representation is a data structure that summarizes statistical information in a database. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over billions of records, answer it immediately. The interesting technical challenge is: given that there