## A Unified Framework for Model-based Clustering (2003)

### Cached

### Download Links

- [www.jmlr.org]
- [www.jmlr.org]
- [www.cse.fau.edu]
- [jmlr.csail.mit.edu]
- [www.ai.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 57 - 6 self |

### BibTeX

@ARTICLE{Zhong03aunified,

author = {Shi Zhong and Joydeep Ghosh},

title = {A Unified Framework for Model-based Clustering},

journal = {Journal of Machine Learning Research},

year = {2003},

volume = {4},

pages = {1001--1037}

}

### Years of Citing Articles

### OpenURL

### Abstract

Model-based clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic model-based clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing model-based clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graph-based view helps to visualize critical/important distinctions between similarity-based approaches and model-based approaches.

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ...Dubes, 1988; Jain et al., 1999; Ghosh, 2003). In this paper we make a fundamental distinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; Scholkopf and Smola, 2001; =-=Vapnik, 1998-=-) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Smyth, 1997) to clustering. With a few exceptions (Vapnik, 1998; Jaakkola and Haussler, 1999), this is not considered the primar... |

8092 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...The model structure (e.g., the number of hidden states in an HMM) can be determined by model selection techniques and parameters estimated using maximum likelihood algorithms, e.g., the EM algorithm (=-=Dempster et al., 1977-=-). Probabilistic model-based clustering techniques have shown promising results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data (Symons, 1981; McL... |

4829 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ... 1986); (b) using maximum a posteriori estimation with an appropriate prior (Gauvain and Lee, 1994); (c) using constrained ML estimation, e.g., lower-bound the variance for spherical Gaussian models (=-=Bishop, 1995-=-). 1011 ZHONG AND GHOSH Algorithm: model-based HAC Input: A set of N data objects X = {x 1 , ..., x N }, and model structure l. Output: An N-level cluster (model) hierarchy and hierarchical partition ... |

3719 |
Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...truct the Lagrangian L = L 1 + x x x ( y P(y|x) -1) , where x's are Lagrange multipliers, and then let the partial derivative L P(y|x) = 0 . The resulting P(y|x) is the well-known Gibbs distribution (=-=Geman and Geman, 1984-=-) given by P(y|x) = P(y)p(x|l y ) 1 Tsy P(y # )p(x|l y # ) 1 T . (5) If P(y) is not known a priori, we can estimate it from the data as P(y) =sx P(x)P(y|x) . Now we get a model-based clustering algori... |

3243 |
Self Organizing Maps
- Kohonen
- 1997
(Show Context)
Citation Context ...ariable-length sequences. Their work, however, does not address model-based hierarchical clustering or specialized model-based partitional clustering algorithms such as the Self-Organizing Map (SOM) (=-=Kohonen, 1997-=-) and the Neural-Gas algorithm (Martinetz et al., 1993), both of which use a varying neighborhood function to control the assignment of data objects to different clusters. This paper provides a charac... |

2307 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...her the partitional or the hierarchical/hybrid procedures. This is an old yet important problem for which a universally satisfactory answer is yet to be obtained. Bayesian model selection techniques (=-=Schwarz, 1978-=-; Banfield and Raftery, 1993; Fraley and Raftery, 1998) have been investigated extensively. Most simple criteria such as BIC (Bayesian Information Criterion) or AIC (Akaike Information criterion) eith... |

2152 | Algorithms for Clustering Data - Jain, Dubes - 1988 |

2029 | Learning with Kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ... (Hartigan, 1975; Jain and Dubes, 1988; Jain et al., 1999; Ghosh, 2003). In this paper we make a fundamental distinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; =-=Scholkopf and Smola, 2001-=-; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Smyth, 1997) to clustering. With a few exceptions (Vapnik, 1998; Jaakkola and Haussler, 1999), this is not conside... |

1502 |
Clustering Algorithms
- Hartigan
- 1975
(Show Context)
Citation Context ... Clustering, Deterministic Annealing 1. Introduction Clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years (=-=Hartigan, 1975-=-; Jain and Dubes, 1988; Jain et al., 1999; Ghosh, 2003). In this paper we make a fundamental distinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; Scholkopf and Sm... |

1300 | Data clustering: A review
- Jain, Murty, et al.
- 1999
(Show Context)
Citation Context .... Introduction Clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years (Hartigan, 1975; Jain and Dubes, 1988; =-=Jain et al., 1999-=-; Ghosh, 2003). In this paper we make a fundamental distinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; Scholkopf and Smola, 2001; Vapnik, 1998) and generative (... |

1097 | On spectral clustering: Analysis and an algorithm - Ng, Jordan, et al. - 2001 |

1055 | Instance-based learning algorithms
- Aha, Kibler, et al.
- 1991
(Show Context)
Citation Context ...inative approaches, the most commonly used distance measures are Euclidean distance and Mahalanobis distance for data that can be represented in a vector space. The instancebased learning literature (=-=Aha et al., 1991-=-) provides several examples of scenarios where customized distance measures perform better than such generic ones. For high-dimensional text clustering, Strehl et al. (2000) studied the impact of diff... |

842 | Least squares quantization in pcm
- Lloyd
- 1982
(Show Context)
Citation Context ...97; Li and Biswas, 2002) that we call model-based kmeanss(mk-means). It is a generalized version of the standard k-means algorithm (MacQueen, 1967; 1006 A UNIFIED FRAMEWORK FOR MODEL-BASED CLUSTERING =-=Lloyd, 1982-=-) and iterates between the following two steps: P(y|x) = # 1, y = argmax y # log p(x|l y # ); 0, otherwise, (2) and l y = argmax l x P(y|x)log p(x|l) . (3) The posterior probability P(y|x) in Equation... |

758 | A comparison of event models for naive Bayes text classification - McCallum, Nigam |

622 | Scatter/gather:a cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ...ybrid Model-based Clustering This section presents a hybrid methodology that combines the advantages of partitional and hierarchical methods. The idea is a "reverse" of the "Scatter/Gat=-=her" approach (Cutting et al., 1992-=-) and has been used by Vaithyanathan and Dom (2000) and Karypis et al. (1999). We shall first analyze the advantages of this hybrid approach for model-based clustering, and then present a new variatio... |

501 |
Hierarchical grouping to optimize an objective function
- Ward
(Show Context)
Citation Context ... a f ter are the set of all parameters before and after merging two models (l k and l j ), respectively. We call this measure (generalized) Ward's distance since this is exactly the Ward's algorithm (=-=Ward, 1963-=-) when equi-variant Gaussian models are used. The above method is not efficient, however, since to find the closest pair one needs to train a merged model for every pair of clusters and then evaluate ... |

490 | C.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...e of the following three ways: (a) restarting the whole clustering algorithm with a different initialization (Juang et al., 1986); (b) using maximum a posteriori estimation with an appropriate prior (=-=Gauvain and Lee, 1994-=-); (c) using constrained ML estimation, e.g., lower-bound the variance for spherical Gaussian models (Bishop, 1995). 1011 ZHONG AND GHOSH Algorithm: model-based HAC Input: A set of N data objects X = ... |

466 |
Mixture Models: Inference and Applications to Clustering
- McLachlan, Basford
- 1988
(Show Context)
Citation Context ...977). Probabilistic model-based clustering techniques have shown promising results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data (Symons, 1981; =-=McLachlan and Basford, 1988-=-; Banfield and Raftery, 1993; Fraley, 1999; Yeung et al., 2001); multinomial models have been shown to be effective for high dimensional text clustering (Vaithyanathan and Dom, 2000; Meila and Heckerm... |

455 |
Objective criteria for the evaluation of clustering methods
- Rand
- 1971
(Show Context)
Citation Context ... accuracy, F1 measure, average purity, average entropy, and mutual information (Ghosh, 2003). There are also several other ways to compare two partitions of the same data set, such as the Rand index (=-=Rand, 1971-=-) and Fowlkes-Mallows measure (Fowlkes and Mallows, 1983) from the statistics community. F1 measure is often used in information retrieval, where clustering serves as a way of improving the quality an... |

426 | A comparison of document clustering techniques - Steinbach, Karypis, et al. - 2000 |

411 | Divergence measures based on the Shannon entropy
- Lin
- 1991
(Show Context)
Citation Context ...r k. This distance can be made symmetric by defining (Juang and Rabiner, 1985) D K s (l k , l j ) = D K (l k , l j ) +D K (l j , l k ) 2 , or using the Jensen-Shannon divergence with p 1 = p 2 = 1 2 (=-=Lin, 1991-=-): D JS (l k , l j ) = 1 2 D K (l k , l k +l j 2 ) + 1 2 D K (l j , l k +l j 2 ) . Compared to classical hierarchical agglomerative clustering (HAC) algorithms, KL divergence is analogous to the centr... |

397 | Exploiting generative models in discriminative classifiers
- Jaakkola, Haussler
- 1998
(Show Context)
Citation Context ...pproaches (Indyk, 1999; Scholkopf and Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Smyth, 1997) to clustering. With a few exceptions (Vapnik, 1998; =-=Jaakkola and Haussler, 1999-=-), this is not considered the primary dichotomy in the vast clustering literature---partitional vs. hierarchical is a more popular choice by far. We shall show that the discriminative vs. generative d... |

315 | Co-clustering documents and words using bipartite spectral graph partitioning
- Dhillon
- 2001
(Show Context)
Citation Context ... computationally inefficient, requiring a complexity of O(N 2 ). Despite this disadvantage, discriminative methods such as graph partitioning and spectral clustering algorithms (Karypis et al., 1999; =-=Dhillon, 2001-=-; Meila and Shi, 2001; Ng et al., 2002; Strehl and Ghosh, 2002) have gained recent popularity due to their ability to produce desirable clustering results. For model-based clustering approaches, the m... |

312 |
Model-based Gaussian and non-Gaussian clustering
- Banfield, Raftery
- 1993
(Show Context)
Citation Context ...ed clustering techniques have shown promising results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data (Symons, 1981; McLachlan and Basford, 1988; =-=Banfield and Raftery, 1993-=-; Fraley, 1999; Yeung et al., 2001); multinomial models have been shown to be effective for high dimensional text clustering (Vaithyanathan and Dom, 2000; Meila and Heckerman, 2001). By deriving a bij... |

310 | Clustering with bregman divergences - Banerjee, Merugu, et al. - 2005 |

278 | How many clusters? which clustering method? answers via model-based cluster analysis
- Fraley, Raftery
- 1998
(Show Context)
Citation Context ...rid procedures. This is an old yet important problem for which a universally satisfactory answer is yet to be obtained. Bayesian model selection techniques (Schwarz, 1978; Banfield and Raftery, 1993; =-=Fraley and Raftery, 1998-=-) have been investigated extensively. Most simple criteria such as BIC (Bayesian Information Criterion) or AIC (Akaike Information criterion) either overestimate or underestimate the number of cluster... |

258 | On clusterings: Good, bad and spectral
- Kannan, Vempala, et al.
(Show Context)
Citation Context ... to graph partitioning approaches to get good clustering results. Indeed, both traditional k-means and HAC algorithms fail miserably on these two datasets, whereas the spectral clustering algorithms (=-=Kannan et al., 2000-=-; Ng et al., 2002) identify the four natural clusters in d4, and a hybrid graph partitioning approach (Karypis et al., 1999; Karypis, 2002) produces all six natural clusters in t4. The hybrid graph pa... |

250 |
Neural-Gas” Network for Vector Quantization and its Application to Time-Series Prediction
- Martinetz, Berkovich, et al.
- 1993
(Show Context)
Citation Context ..., does not address model-based hierarchical clustering or specialized model-based partitional clustering algorithms such as the Self-Organizing Map (SOM) (Kohonen, 1997) and the Neural-Gas algorithm (=-=Martinetz et al., 1993-=-), both of which use a varying neighborhood function to control the assignment of data objects to different clusters. This paper provides a characterization of all existing model-based clustering algo... |

247 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...ndamental distinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; Scholkopf and Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; =-=Rose, 1998-=-; Smyth, 1997) to clustering. With a few exceptions (Vapnik, 1998; Jaakkola and Haussler, 1999), this is not considered the primary dichotomy in the vast clustering literature---partitional vs. hierar... |

226 |
Statistic of Directional Data
- Mardia
- 1972
(Show Context)
Citation Context ...e Gaussian distribution for directional data in the sense that it is the unique distribution of L 2 -normalized data that maximizes the entropy given the first and second moments of the distribution (=-=Mardia, 1975-=-). There is a long-time folklore in the information retrieval community that the direction of a text vector is more important than its magnitude, leading to the practices of using cosine similarity, a... |

154 | Impact of similarity measures on web-page clustering
- STREHL, STREHL, et al.
- 2000
(Show Context)
Citation Context ...d and the accuracy difficult (or sometimes impossible) to calculate. In such situations, the NMI measure is better than purity and entropy measures, both of which are biased towards high k solutions (=-=Strehl et al., 2000-=-; Strehl and Ghosh, 2002). In this paper, several different measures are used, and we explain in each case study what measures we use and why. 4. A Case Study on Document Clustering We recently perfor... |

142 |
Chameleon: hierarchical clustering using dynamic modeling
- Karypis, Han, et al.
- 1999
(Show Context)
Citation Context ...all show that the discriminative vs. generative distinction leads to a useful understanding of existing clustering algorithms. In discriminative approaches, such as clustering via graph partitioning (=-=Karypis et al., 1999-=-), one determines a distance or similarity function between pairs of data objects, and then groups similar objects together into clusters. Parametric, modelbased approaches, on the other hand, attempt... |

140 |
The UCI KDD Archive [http://kdd.ics.uci.edu
- Hettich, Bay
- 1999
(Show Context)
Citation Context ...econd synthetic dataset, syn3-50, is simply a subset of the first one, containing only the first 50 time points of each sequence in syn3. The EEG dataset, EEG2, is extracted from the UCI KDD Archive (=-=Hettish and Bay, 1999-=-) and contains measurements from an electrode (F4) on the scalp. There are 20 measurements from two subjects, a control subject and an alcoholic subject, 10 from each. Each measurement is sampled at 2... |

139 | Clustering sequences with hidden markov models
- Smyth
- 1997
(Show Context)
Citation Context ...stinction between discriminative (or distance/similarity-based) approaches (Indyk, 1999; Scholkopf and Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; =-=Smyth, 1997-=-) to clustering. With a few exceptions (Vapnik, 1998; Jaakkola and Haussler, 1999), this is not considered the primary dichotomy in the vast clustering literature---partitional vs. hierarchical is a m... |

134 |
Numerical continuation methods : an introduction
- Allgower, Georg
- 1990
(Show Context)
Citation Context ...inistic annealing algorithms for vector quantization applications (Gersho and Gray, 1992). They showed that the three types of algorithms are three different implementations of a continuation method (=-=Allgower and Georg, 1990-=-) for vector quantization, with different competitive learning rules. None of these work, however, have analyzed probabilistic model-based clustering or demonstrated the relationship between model-bas... |

127 | Model-based clustering and data transformations for gene expression data
- YEUNG, FRALEY, et al.
- 2001
(Show Context)
Citation Context ...ing results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data (Symons, 1981; McLachlan and Basford, 1988; Banfield and Raftery, 1993; Fraley, 1999; =-=Yeung et al., 2001-=-); multinomial models have been shown to be effective for high dimensional text clustering (Vaithyanathan and Dom, 2000; Meila and Heckerman, 2001). By deriving a bijection between Bregman divergences... |

117 |
A method for comparing two hierarchical clusterings
- Fowlkes, Mallows
- 1983
(Show Context)
Citation Context ...erage entropy, and mutual information (Ghosh, 2003). There are also several other ways to compare two partitions of the same data set, such as the Rand index (Rand, 1971) and Fowlkes-Mallows measure (=-=Fowlkes and Mallows, 1983-=-) from the statistics community. F1 measure is often used in information retrieval, where clustering serves as a way of improving the quality and accelerating the speed of search. The purity of a clus... |

109 | Learning segmentation by random walks
- Meila, Shi
- 2000
(Show Context)
Citation Context ...y inefficient, requiring a complexity of O(N 2 ). Despite this disadvantage, discriminative methods such as graph partitioning and spectral clustering algorithms (Karypis et al., 1999; Dhillon, 2001; =-=Meila and Shi, 2001-=-; Ng et al., 2002; Strehl and Ghosh, 2002) have gained recent popularity due to their ability to produce desirable clustering results. For model-based clustering approaches, the model type is often sp... |

109 | Shneiderman B: Interactively exploring hierarchical clustering results - Seo |

108 |
A probabilistic distance measure for hidden return models
- Juang, Rabiner
- 1985
(Show Context)
Citation Context ...t involve re-estimating models has been commonly used (Sinkkonen and Kaski, 2001; Ramoni et al., 2002). Exact KL divergence is difficult to calculate for complex 4 models. An empirical KL divergence (=-=Juang and Rabiner, 1985-=-) between two models l k and l j can be defined as D K (l k , l j ) = 1 |X k | x#X k (log p(x|l k ) - log p(x|l j )) , (8) where X k is the set of data objects being grouped into cluster k. This dista... |

108 |
CLUTO-A Clustering Toolkit
- Karypis
- 2002
(Show Context)
Citation Context ...ectional distribution defined on a unit hypersphere and does not capture any magnitude information. 4.2 Datasets We used the 20-newsgroups dataset 6 and a number of datasets from the CLUTO toolkit 7 (=-=Karypis, 2002-=-). These datasets provide a good representation of different characteristics: the number of documents ranges from 204 to 19949, the number of terms from 5832 to 43586, the number of classes from 6 to ... |

101 | Information geometry of the EM and em algorithms for neural networks
- Amari
- 1995
(Show Context)
Citation Context ...tering methods with discriminative ones. There has already been some preliminary work in this direction. For example, researchers have started constructing similarity measures from generative models (=-=Amari, 1995-=-; Jaakkola and Haussler, 1999; Tipping, 1999; Tsuda et al., 2003), but their impact on clustering performance is yet to be fully understood. A possible approach is to combine bipartite graph partition... |

88 | An information-theoretic analysis of hard and soft assignment methods for clustering
- Kearns, Mansour, et al.
- 1997
(Show Context)
Citation Context ...l connection weights (log-likelihoods) weighted by the association probabilities, which is to be maximized. Indeed, maximizing this objective function leads to a well-known hard clustering algorithm (=-=Kearns et al., 1997-=-; Li and Biswas, 2002; Banerjee et al., 2003b). We will show in the next section that soft modelbased clustering can be obtained by adding entropy constraints to the objective function. Similar to det... |

86 |
A Clustering Technique for Summarizing Multivariate Data
- Ball, Hall
- 1967
(Show Context)
Citation Context ...r hierarchy. Their method is basically the generic algorithm in Figure 9 instantiated with multinomial models and Ward's inter-cluster distance. Another related work is the classic ISODATA algorithm (=-=Hall and Ball, 1967-=-), which performs a further refinement by splitting and merging the clusters obtained using the standard k-means algorithm. Clusters are merged if either the number of members in a cluster is less tha... |

71 | A general probabilistic framework for clustering individuals
- CADEZ, GAFFNEY, et al.
- 2000
(Show Context)
Citation Context ...ased on a mixture of components from any member of this vast family can be done in an efficient manner. For clustering more complex data such as time sequences, the dominant models are Markov Chains (=-=Cadez et al., 2000-=-; Ramoni et al., 2002) and HMMs (Dermatas and Kokkinakis, 1996; Smyth, 1997; Oates et al., 1999; Law and Kwok, 2000; Li and Biswas, 2002). Compared to similarity-based methods, model-based methods off... |

65 | Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions - Qian, Dolled-Filhart, et al. |

63 |
Maximum likelihood estimation for multivariate mixture observations of markov chains
- Juang, Levinson, et al.
- 1986
(Show Context)
Citation Context ...since p(x|l) is then upper-bounded by 1. The singularity problem is often dealt with in one of the following three ways: (a) restarting the whole clustering algorithm with a different initialization (=-=Juang et al., 1986-=-); (b) using maximum a posteriori estimation with an appropriate prior (Gauvain and Lee, 1994); (c) using constrained ML estimation, e.g., lower-bound the variance for spherical Gaussian models (Bisho... |

61 | An information-theoretic external cluster-validity measure
- DOM
- 2001
(Show Context)
Citation Context ... is 1. It has been argued that the mutual information I(Y ;sY ) between a r.v. Y , governing the cluster labels, and a r.v.sY , governing the class labels, is a superior measure to purity or entropy (=-=Dom, 2001-=-; Strehl and Ghosh, 2002). Moreover, by normalizing this measure to lie in the range [0,1], it becomes relatively impartial to K. There are several choices for normalization based on the entropies H(Y... |

58 |
Distance Measures for Effective Clustering of ARIMA Time-Series
- Kalpakis, Gada, et al.
(Show Context)
Citation Context ...ity measure is very much data dependent and often requires expert domain knowledge. For example, a wide variety of distance measures have been proposed for clustering sequences (Geva and Kerem, 1998; =-=Kalpakis et al., 2001-=-; Qian et al., 2001). Another disadvantage of similarity-based approaches is that calculating the similarities between all pairs of data objects is computationally inefficient, requiring a complexity ... |

56 | Algorithms for model-based Gaussian hierarchical clustering
- Fraley
- 1997
(Show Context)
Citation Context ...e shown promising results in a corpus of applications. Gaussian mixture models are the most popular models used for vector data (Symons, 1981; McLachlan and Basford, 1988; Banfield and Raftery, 1993; =-=Fraley, 1999-=-; Yeung et al., 2001); multinomial models have been shown to be effective for high dimensional text clustering (Vaithyanathan and Dom, 2000; Meila and Heckerman, 2001). By deriving a bijection between... |