Results 1 - 10
of
29
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 171 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Model-Based Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
AE: MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering
- Department of Statistics, University of Washington
, 2006
"... MCLUST is a contributed R package for normal mixture modeling and model-based clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
MCLUST is a contributed R package for normal mixture modeling and model-based clustering. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. There is additional functionality for displaying and visualizing the models along with clustering and classification results. A number of features of the software have been changed in this version, and the functionality has been expanded to include regularization for normal mixture models via a Bayesian prior. MCLUST is licensed by the University of Washington and distributed through
MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis
- Journal of Classification
, 2002
"... Contents 1 Models 4 2 Obtaining and Installing MCLUST 5 2.1 Using MCLUST with S-PLUS 6 for UNIX/Linux . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Using MCLUST with S-PLUS 6 for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Hierarchical Clustering 6 4 EM for Mix ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Contents 1 Models 4 2 Obtaining and Installing MCLUST 5 2.1 Using MCLUST with S-PLUS 6 for UNIX/Linux . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Using MCLUST with S-PLUS 6 for Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Hierarchical Clustering 6 4 EM for Mixture Models 8 4.1 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 Individual E and M Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5 Bayesian Information Criterion 10 6 Cluster Analysis 11 6.1 Mclust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6.2 EMclust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.3 Clustering with Noise and Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7 Simulation from Mixture Densities 19 8 Density Estimation 21 9 Displays
Bayesian regularization for normal mixture estimation and model-based clustering
, 2005
"... Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observation’s worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC. Key words: BIC; EM algorithm; mixture models; model-based clustering; conjugate prior; posterior mode. 1
Functional Bioinformatics of Microarray Data: From Expression to Regulation
, 2002
"... Microarrays are a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible identify the mechanisms that govern the activation of genes in an organism. Short DNA patterns (called binding sites) in or around the genes se ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Microarrays are a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible identify the mechanisms that govern the activation of genes in an organism. Short DNA patterns (called binding sites) in or around the genes serve as switches that control gene expression. As a result similar patterns of expression can correspond to similar binding site patterns. We integrate clustering of coexpressed genes with the discovery of binding motifs. We overview several important clustering techniques and present a clustering algorithm (called adaptive quality-based clustering), which we have developed to address several shortcomings of existing methods. We overview the dierent techniques for motif nding, in particular the technique of Gibbs sampling, and we present several extension of this technique in our Motif Sampler. Finally, we present an integrated web tool called INCLUSive (http://www.esat.kuleuven.ac.be/ ~dna/BioI/Software.html) that allows the easy analysis of microarray data for motif nding.
Donuts, Scratches and Blanks: Robust Model-Based Segmentation of Microarray Images
- Bioinformatics
, 2005
"... Inner holes, artifacts and blank spots are common in microarray images, but current image analysis methods do not pay them enough attention. We propose a new robust model-based method for processing microarray images so as to estimate foreground and background intensities. The method starts with a v ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Inner holes, artifacts and blank spots are common in microarray images, but current image analysis methods do not pay them enough attention. We propose a new robust model-based method for processing microarray images so as to estimate foreground and background intensities. The method starts with a very simple but effective automatic gridding method, and then proceeds in two steps. The first step applies model-based clustering to the distribution of pixel intensities, using the Bayesian Information Criterion (BIC) to choose the number of groups up to a maximum of three. The second step is spatial, finding the large spatially connected components in each cluster of pixels. The method thus combines the strengths of histogram-based and spatial approaches. It deals effectively with inner holes in spots and artifacts. It also provides a formal inferential basis for deciding when the spot is blank, namely when the BIC favors one group over two or three. In experiments, our method had better stability across replicates than a fixed-circle segmentation method or the seeded region growing method in the SPOT software, without introducing noticeable bias when estimating the intensities of differentially expressed genes. An R software package called spotSegmentation implementing the method is being
Incremental Model-Based Clustering for Large Datasets with Small Clusters
- Journal of Computational and Graphical Statistics
, 2003
"... Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be e#ective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be e#ective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing di#culty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations.
Model-based clustering for image segmentation and large datasets via sampling
- Journal of Classification
"... Abstract: The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract: The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.
Iterative denoising using Jensen-Rényi divergences with an application to unsupervised document categorization
- Proc. IEEE ICASSP
, 2007
"... Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of [1] in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Rényi divergence; this is followed by an iterative Naïve Bayes classification. The single parameter α of the Jensen-Rényi divergence is chosen based on the “strapping ” methodology [2], which learns a meta-classifer on a related task. Compared with the sequential Information Bottleneck method [3], our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the “20 Newsgroups ” dataset.

