Results 1 
9 of
9
Learning the k in kmeans
 In Proc. 17th NIPS
, 2003
"... When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis t ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. Gmeans runs kmeans with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each kmeans center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, Gmeans only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does
Simultaneous feature selection and clustering using mixture models
 IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2004
"... Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched u ..."
Abstract

Cited by 73 (1 self)
 Add to MetaCart
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectationmaximization (EM) algorithm to estimate it, in the context of mixturebased clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.
Alternatives to the kMeans Algorithm That Find Better Clusterings
"... We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspect ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a lowquality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions.
Accelerated EMbased clustering of large data sets
 Data Mining and Knowledge Discovery
, 2006
"... Abstract. Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like kmeans, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speed ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Abstract. Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like kmeans, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data loglikelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data loglikelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal reestimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other stateoftheart acceleration techniques.
Very Fast Outlier Detection in Large Multidimensional Data Sets
 In DMKD’02
, 2002
"... Outliers are objects that do not comply with the general behavior of the data. Applications such as exploration in science databases need fast interactive tools for outlier detection in data sets that have unknown distributions, are large in size, and are in high dimensional space. Existing algorith ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
Outliers are objects that do not comply with the general behavior of the data. Applications such as exploration in science databases need fast interactive tools for outlier detection in data sets that have unknown distributions, are large in size, and are in high dimensional space. Existing algorithms for outlier detection are too slow for such applications. We present an algorithm based on an innovative use of kd trees that doesn't assume any probability model and is linear in the number of objects and in the number of dimensions. We also provide experimental results that show that this is indeed a practical solution to the above problem.
PGmeans: learning the number of clusters in data
 ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 19
, 2007
"... We present a novel algorithm called PGmeans which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on onedimensional projections of the data and model to determine if the examples are well repres ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We present a novel algorithm called PGmeans which is able to learn the number of clusters in a classical Gaussian mixture model. Our method is robust and efficient; it uses statistical hypothesis tests on onedimensional projections of the data and model to determine if the examples are well represented by the model. In so doing, we are applying a statistical test for the entire model at once, not just on a percluster basis. We show that our method works well in difficult cases such as nonGaussian data, overlapping clusters, eccentric clusters, high dimension, and many true clusters. Further, our new method provides a much more stable estimate of the number of clusters than existing methods.
Clustering, Dimensionality Reduction, and Side Information
, 2006
"... Recent advances in sensing and storage technology have created many highvolume, highdimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Recent advances in sensing and storage technology have created many highvolume, highdimensional data sets in pattern recognition, machine learning, and data mining. Unsupervised learning can provide generic tools for analyzing and summarizing these data sets when there is no welldefined notion of classes. The purpose of this thesis is to study some of the open problems in two main areas of unsupervised learning, namely clustering and (unsupervised) dimensionality reduction. Instancelevel constraint on objects, an example of sideinformation, is also considered to improve the clustering results. Our first contribution is a modification to the isometric feature mapping (ISOMAP) algorithm when the input data, instead of being all available simultaneously, arrive sequentially from a data stream. ISOMAP is representative of a class of nonlinear dimensionality reduction algorithms that are based on the notion of a manifold. Both the standard ISOMAP and the landmark version of ISOMAP are considered. Experimental results on synthetic data as well as real world images demonstrate that the modified algorithm can maintain an accurate lowdimensional representation of the data in an efficient manner. We study the problem of feature selection in modelbased clustering when the number of clusters
Weighted kmeans for densitybiased clustering
 In DaWaK
, 2005
"... Abstract. Clustering is a task of grouping data based on similarity. A popular kmeans algorithm groups data by firstly assigning all data points to the closest clusters, then determining the cluster means. The algorithm repeats these two steps until it has converged. We propose a variation called w ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. Clustering is a task of grouping data based on similarity. A popular kmeans algorithm groups data by firstly assigning all data points to the closest clusters, then determining the cluster means. The algorithm repeats these two steps until it has converged. We propose a variation called weighted kmeans to improve the clustering scalability. To speed up the clustering process, we develop the reservoirbiased sampling as an efficient data reduction technique since it performs a single scan over a data set. Our algorithm has been designed to group data of mixture models. We present an experimental evaluation of the proposed method. 1
Learning the inmeans
"... When clustering a dataset, the right number of clusters to use is often not obvious, and choosing automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis that a ..."
Abstract
 Add to MetaCart
When clustering a dataset, the right number of clusters to use is often not obvious, and choosing automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. Gmeans runsmeans with increasing in a hierarchical fashion until the test accepts the hypothesis that the data assigned to eachmeans center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, Gmeans only requires one intuitive parameter, the standard statistical significance level. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does