Results 1 -
5 of
5
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 171 (23 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Incremental Model-Based Clustering for Large Datasets with Small Clusters
- Journal of Computational and Graphical Statistics
, 2003
"... Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be e#ective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be e#ective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing di#culty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations.
Model-based clustering for image segmentation and large datasets via sampling
- Journal of Classification
"... Abstract: The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Abstract: The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.
Clustering Massive Datasets With Applications in Software Metrics and Tomography
- Technometrics
, 1998
"... Clustering datasets is not an easy problem in general, and the difficulty is compounded for a massive dataset. Restricting attention to a sample from the data ignores minority groups and hence compromises on the available riches. This paper develops, under Gaussian assumptions, a multi-stage cluster ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Clustering datasets is not an easy problem in general, and the difficulty is compounded for a massive dataset. Restricting attention to a sample from the data ignores minority groups and hence compromises on the available riches. This paper develops, under Gaussian assumptions, a multi-stage clustering algorithm. After clustering an initial sample, observations that can be reasonably classi ed in the identified groups are filtered out using a series of likelihood ratio test. The remainder are again sampled, clustered and the procedure iterated until all cases have either been clustered or classified. Final estimates of the class probabilities and the dispersions are obtained after an initial classification of the complete dataset into the identified clusters. Class membership of the observations in the dataset are finally assigned using these estimated probabilities and dispersions. Results on several test experiments indicated good performance. The procedure was also implemented on t...
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- Journal of the American Statistical Association
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", \Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...

