Results 1  10
of
81
How many clusters? Which clustering method? Answers via modelbased cluster analysis
 THE COMPUTER JOURNAL
, 1998
"... ..."
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 259 (24 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
ModelBased Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract

Cited by 124 (8 self)
 Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, modelbased clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
An Analysis of Recent Work on Clustering Algorithms
, 1999
"... This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clust ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clustering application. Finally, it concludes with ideas that could make the selection and use of clustering algorithms for data analysis less difficult.
Model Selection for Probabilistic Clustering Using CrossValidated Likelihood
 Statistics and Computing
, 1998
"... Crossvalidated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of modelbased probabilistic clustering. The conceptual framework for the crossvalidation approach to mod ..."
Abstract

Cited by 65 (4 self)
 Add to MetaCart
Crossvalidated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of modelbased probabilistic clustering. The conceptual framework for the crossvalidation approach to model selection is direct in the sense that models are judged directly on their outofsample predictive performance. The method is applied to a wellknown clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Crossvalidated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using nonprobabilistic clustering techniques. 1 Introduction Crossvalidation is a wellknown technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...
Clustering using Monte Carlo CrossValidation
, 1996
"... Finding the "right" number of clusters, k, for a data set is a difficult, and often illposed, problem. In a probabilistic clustering context, likelihoodratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new crossvalidated likelihood crite ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
Finding the "right" number of clusters, k, for a data set is a difficult, and often illposed, problem. In a probabilistic clustering context, likelihoodratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new crossvalidated likelihood criterion is investigated for determining cluster structure. A practical clustering algorithm based on Monte Carlo crossvalidation (MCCV) is introduced. The algorithm permits the data analyst to judge if there is strong evidence for a particular k, or perhaps weaker evidence over a subrange of k values. Experimental results with Gaussian mixtures on real and simulated data suggest that MCCV provides genuine insight into cluster structure. vfold crossvalidation appears inferior to the penalized likelihood method (BIC), a Bayesian algorithm (AutoClass v2.0), and the new MCCV algorithm. Overall, MCCV and AutoClass appear the most reliable of the methods. MCCV provides the dataminer with a usefu...
MCLUST: Software for Modelbased Cluster Analysis
 Journal of Classification
, 1999
"... MCLUST is a software package for cluster analysis written in Fortran and interfaced to the SPLUS commercial software package1. It implements parameterized Gaussian hierarchical clustering algorithms [16, 1, 7] and the EM algorithm for parameterized Gaussian mixture models [5, 13, 3, 14] with the po ..."
Abstract

Cited by 52 (16 self)
 Add to MetaCart
MCLUST is a software package for cluster analysis written in Fortran and interfaced to the SPLUS commercial software package1. It implements parameterized Gaussian hierarchical clustering algorithms [16, 1, 7] and the EM algorithm for parameterized Gaussian mixture models [5, 13, 3, 14] with the possible addition of a Poisson noise term. MCLUST also includes functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in a comprehensive clustering strategy [4, 8]. Methods of this type have shown promise in a number of practical applications, including character recognition [16], tissue segmentation [1], mine eld and seismic fault detection [4], identi cation of textile aws from images [2], and classi cation of astronomical data [3, 15]. Aweb page with related links can be found at
Variable Selection for ModelBased Clustering
 Journal of the American Statistical Association
, 2006
"... We consider the problem of variable or feature selection for modelbased clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in m ..."
Abstract

Cited by 46 (4 self)
 Add to MetaCart
We consider the problem of variable or feature selection for modelbased clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
AE: MCLUST Version 3 for R: Normal Mixture Modeling and ModelBased Clustering
 Department of Statistics, University of Washington
, 2006
"... MCLUST is a contributed R package for normal mixture modeling and modelbased clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
MCLUST is a contributed R package for normal mixture modeling and modelbased clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine modelbased hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. There is additional functionality for displaying and visualizing the models along with clustering and classification results. A number of features of the software have been changed in this version, and the functionality has been expanded to include regularization for normal mixture models via a Bayesian prior. MCLUST is licensed by the University of Washington and distributed through
Clustering for sparsely sampled functional data
 Journal of the American Statistical Association
, 2003
"... We develop a flexible modelbased procedure for clustering functional data. The technique can be applied to all types of curve data but is particularly useful when individuals are observed at a sparse set of time points. In addition to producing final cluster assignments, the procedure generates pre ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
We develop a flexible modelbased procedure for clustering functional data. The technique can be applied to all types of curve data but is particularly useful when individuals are observed at a sparse set of time points. In addition to producing final cluster assignments, the procedure generates predictions and confidence intervals for missing portions of curves. Our approach also provides many useful tools for evaluating the resulting models. Clustering can be assessed visually via low dimensional representations of the curves, and the regions of greatest separation between clusters can be determined using a discriminant function. Finally, we extend the model to handle multiple functional and finite dimensional covariates and show how it can be applied to standard finite dimensional clustering problems involving missing data.