Results 1 - 10
of
61
How many clusters? Which clustering method? Answers via model-based cluster analysis
- THE COMPUTER JOURNAL
, 1998
"... ..."
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 172 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Model-Based Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
An Analysis of Recent Work on Clustering Algorithms
, 1999
"... This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clust ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
This paper describes four recent papers on clustering, each of which approaches the clustering problem from a different perspective and with different goals. It analyzes the strengths and weaknesses of each approach and describes how a user could could decide which algorithm to use for a given clustering application. Finally, it concludes with ideas that could make the selection and use of clustering algorithms for data analysis less difficult.
Clustering using Monte Carlo Cross-Validation
, 1996
"... Finding the "right" number of clusters, k, for a data set is a difficult, and often ill-posed, problem. In a probabilistic clustering context, likelihood-ratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new cross-validated likelihood crite ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
Finding the "right" number of clusters, k, for a data set is a difficult, and often ill-posed, problem. In a probabilistic clustering context, likelihood-ratios, penalized likelihoods, and Bayesian techniques are among the more popular techniques. In this paper a new cross-validated likelihood criterion is investigated for determining cluster structure. A practical clustering algorithm based on Monte Carlo crossvalidation (MCCV) is introduced. The algorithm permits the data analyst to judge if there is strong evidence for a particular k, or perhaps weaker evidence over a sub-range of k values. Experimental results with Gaussian mixtures on real and simulated data suggest that MCCV provides genuine insight into cluster structure. v-fold cross-validation appears inferior to the penalized likelihood method (BIC), a Bayesian algorithm (AutoClass v2.0), and the new MCCV algorithm. Overall, MCCV and AutoClass appear the most reliable of the methods. MCCV provides the data-miner with a usefu...
Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood
- Statistics and Computing
, 1998
"... Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to mod ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is direct in the sense that models are judged directly on their out-of-sample predictive performance. The method is applied to a well-known clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using non-probabilistic clustering techniques. 1 Introduction Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...
MCLUST: Software for Model-based Cluster Analysis
- Journal of Classification
, 1999
"... MCLUST is a software package for cluster analysis written in Fortran and interfaced to the S-PLUS commercial software package1. It implements parameterized Gaussian hierarchical clustering algorithms [16, 1, 7] and the EM algorithm for parameterized Gaussian mixture models [5, 13, 3, 14] with the po ..."
Abstract
-
Cited by 39 (16 self)
- Add to MetaCart
MCLUST is a software package for cluster analysis written in Fortran and interfaced to the S-PLUS commercial software package1. It implements parameterized Gaussian hierarchical clustering algorithms [16, 1, 7] and the EM algorithm for parameterized Gaussian mixture models [5, 13, 3, 14] with the possible addition of a Poisson noise term. MCLUST also includes functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) in a comprehensive clustering strategy [4, 8]. Methods of this type have shown promise in a number of practical applications, including character recognition [16], tissue segmentation [1], mine eld and seismic fault detection [4], identi cation of textile aws from images [2], and classi cation of astronomical data [3, 15]. Aweb page with related links can be found at
k-Plane Clustering
- Journal of Global Optimization
, 2000
"... A finite new algorithm is proposed for clustering m given points in n-dimensional real space into k clusters by generating k planes that constitute a local solution to the nonconvex problem of minimizing the sum of squares of the 2-norm distances between each point and a nearest plane. The key to th ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
A finite new algorithm is proposed for clustering m given points in n-dimensional real space into k clusters by generating k planes that constitute a local solution to the nonconvex problem of minimizing the sum of squares of the 2-norm distances between each point and a nearest plane. The key to the algorithm lies in a formulation that generates a plane in n-dimensional space that minimizes the sum of the squares of the 2-norm distances to each of m1 given points in the space. The plane is generated by an eigenvector corresponding to a smallest eigenvalue of an n \Theta n simple matrix derived from the m1 points. The algorithm was tested on the publicly available Wisconsin Breast Prognosis Cancer database to generate well separated patient survival curves. In contrast, the k-mean algorithm did not generate such well-separated survival curves. 1 Introduction There are many approaches to clustering such as statistical [2, 9, 6], machine learning [7, 8] and mathematical programming [15...
Health status monitoring through analysis of behavioral patterns
- 8th congress of the Italian Association for Artificial Intelligence (AI*IA) on Ambient Intelligence
, 2003
"... Abstract. With the rapid growth of the elderly population, there is a need to assess the ability of elders to maintain an independent and healthy lifestyle. One possible method is to employ the concepts of ambient intelligence to remotely monitor an elder’s activity. The SmartHouse project uses a sy ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
Abstract. With the rapid growth of the elderly population, there is a need to assess the ability of elders to maintain an independent and healthy lifestyle. One possible method is to employ the concepts of ambient intelligence to remotely monitor an elder’s activity. The SmartHouse project uses a system of basic sensors to monitor a person’s in-home activity, and a prototype of the system is being tested within a subject’s home. We examine whether the system can be used to detect behavioral patterns. Mixture models are used to develop a probabilistic model of behavioral patterns. The results of the mixture model analysis are then compared to a log of events kept by the user. 1
Regularized Gaussian Discriminant Analysis Through Eigenvalue Decomposition
- Journal of the American Statistical Association
, 1996
"... Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis in the Gaussian framework. RDA makes use of two regularization parameters to design an intermediate classi cation rule between linear and quadratic discriminant analysis. In this paper, we propose an alternative a ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis in the Gaussian framework. RDA makes use of two regularization parameters to design an intermediate classi cation rule between linear and quadratic discriminant analysis. In this paper, we propose an alternative approach to design classi cation rules which have also a median position between linear and quadratic discriminant analysis. Our approach is based on the reparametrization of the covariance matrix k of a group Gk in terms of its eigenvalue decomposition, k = kDkAkD 0 k where k speci es the volume of Gk, Ak its shape, and Dk its orientation. Variations on constraints concerning k�Ak and Dk lead to 14 discrimination models of interest. For each model, we derived the maximum likelihood parameter estimates and our approach consists in selecting the model among the 14 possible models by minimizing the sample-based estimate of future misclassi cation risk by cross-validation. Numerical experiments show favorable behavior of this approach as compared to RDA.

