## A Comparative Study of Generative Models for Document Clustering (2003)

Venue: | In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications |

Citations: | 38 - 4 self |

### BibTeX

@INPROCEEDINGS{Zhong03acomparative,

author = {Shi Zhong and Joydeep Ghosh},

title = {A Comparative Study of Generative Models for Document Clustering},

booktitle = {In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...lon, 2001). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; =-=Vapnik, 1998-=-) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Cadez et al., 2000). In similarity-based approaches, one optimizes an objective function involving the pairwise document similar... |

2238 | Learning with Kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ... graph partitioning (Dhillon, 2001). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; =-=Scholkopf & Smola, 2001-=-; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Cadez et al., 2000). In similarity-based approaches, one optimizes an objective function involving the pairwise do... |

1186 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
- 2002
(Show Context)
Citation Context ...itioning. We use vcluster in the toolkit with the default setting. The other one is a modification of the bipartite spectral co-clustering algorithm (Dhillon, 2001). The modification is according to (=-=Ng et al., 2002-=-)6 and generates slightly better results than the original bipartite clustering algorithm. Both graph partitioning algorithms uses fast heuristics and thus is dependent on the order of nodes from the ... |

864 | A fast and high quality multilevel scheme for partitioning irregular graphs
- Karypis, Kumar
- 1998
(Show Context)
Citation Context ...state-of-the-art graph-based clustering algorithms are also included in our experiments. The first one is CLUTO (Karypis, 2002), a clustering toolkit based on the Metis graph partitioning algorithms (=-=Karypis & Kumar, 1998-=-). It is worth mentioning that CLUTO is positioned for clustering and drops the strong balance constraints in the original Metis partitioning. We use vcluster in the toolkit with the default setting. ... |

835 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
- 1988
(Show Context)
Citation Context ...als the number of words in the vocabulary used. Next we briefly introduce the three generative models studied in our experiments. 3.1 Multivariate Bernoulli model In the multivariate Bernoulli model (=-=McCallum & Nigam, 1998-=-), a document is represented as a binary vector over the space of words. The l-th dimension of the vector representing document di is denoted by bil, and is either 1 or 0, indicating whether word wl o... |

680 | Scatter/gather: A cluster-based approach to browsing large document collections
- Cutting, Pedersen, et al.
- 1992
(Show Context)
Citation Context ...ll the mid-nineties, hierarchical agglomerative clustering using a suitable similarity measure such as cosine, Dice or Jaccard, formed the dominant paradigm for clustering documents (Rasmussen, 1992; =-=Cutting et al., 1992-=-). The increasing interest in processing larger collections of documents has led to a new emphasis on designing more efficient and effective techniques, leading to an explosion of diverse approaches t... |

465 | A Comparison of Document Clustering Techniques [R
- STEINBACH, KARYPIS, et al.
(Show Context)
Citation Context ... clustering problem, including the (multilevel) self-organizing map (Kohonen et al., 2000), mixture of Gaussians (Tantrum et al., 2002), spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (=-=Steinbach et al., 2000-=-), mixture of multinomials (Vaithyanathan & Dom, 2000; Meila & Heckerman, 2001), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (Dhillon,... |

429 | Cluster ensembles: A knowledge Reuse Framework for Combining Partitionings
- Strehl, Ghosh
- 2004
(Show Context)
Citation Context ...culate. It has been argued that the mutual information I(X;Y ) between a r.v. X, governing the cluster labels, and a r.v. Y , governing the class labels, is a superior measure than purity or entropy (=-=Strehl & Ghosh, 2002-=-). Moreover, by normalizing this measure to lie in the range [0,1], it becomes quite impartial to k. There are several choices for normalization based on the entropies H(X) and H(Y ). We shall follow ... |

335 | Co-clustering documents and words using bipartite spectral graph partitioning
- Dhillon
- 2001
(Show Context)
Citation Context ...l., 2000), mixturesof multinomials (Vaithyanathan & Dom, 2000; Meila & Heckerman, 2001), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (=-=Dhillon, 2001-=-). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1... |

327 | Concept decompositions for large sparse text data using clustering
- Dhillon, Modha
- 2001
(Show Context)
Citation Context ...l., 2000), mixturesof multinomials (Vaithyanathan & Dom, 2000; Meila & Heckerman, 2001), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (=-=Dhillon, 2001-=-). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1... |

283 |
Statistics of Directional Data
- Mardia
(Show Context)
Citation Context ...the Gaussian distribution for directional data in the sense that it is the unique distribution of L2-normalized data that maximizes the entropy given the first and second moments of the distribution (=-=Mardia, 1975).-=- It has recently been shown that the spherical k-means algorithm that uses the cosine similarity metric (to measure the closeness of a data point to its cluster’s centroid) can be derived from a gen... |

274 | OHSUMED: an interactive retrieval evaluation and new large text collection for research
- Hersh, Buckley, et al.
- 1994
(Show Context)
Citation Context ...ined by combining the CACM, CISI, CRANFIELD, and MEDLINE abstracts that were used in the past to evaluate various information retrieval systems 5 . The ohscal dataset was from the OHSUMED collection (=-=Hersh et al., 1994-=-). It contains 11,162 documents from the following ten categories: antibodies, carcinoma, DNA, in-vitro, molecular sequence data, pregnancy, prognosis, receptors, risk factors, and tomography. The k1b... |

266 | On clusterings — good, bad and spectral - Kannan, Vempala, et al. - 2000 |

266 |
Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering
- McCallum
- 1996
(Show Context)
Citation Context ...0 8261 10 69 0.088 The NG20 dataset is a collection of 20,000 messages, collected from 20 different usenet newsgroups, 1,000 messages from each. We preprocessed the raw dataset using the Bow toolkit (=-=McCallum, 1996-=-), including chopping off headers and removing stop words as well as words that occur in less than three documents. In the resulting dataset, each document is represented by a 43,586dimensional sparse... |

259 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; =-=Rose, 1998-=-; Cadez et al., 2000). In similarity-based approaches, one optimizes an objective function involving the pairwise document similarities, aiming to maximize the average similarities within clusters and... |

250 | Refining initial points for k-means clustering - Bradley, Fayyad - 1998 |

234 | Self organization of a massive document collection. Neural Networks
- Kohonen, Kaski, et al.
- 2000
(Show Context)
Citation Context ... a new emphasis on designing more efficient and effective techniques, leading to an explosion of diverse approaches to the document clustering problem, including the (multilevel) self-organizing map (=-=Kohonen et al., 2000-=-), mixture of Gaussians (Tantrum et al., 2002), spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (Steinbach et al., 2000), mixturesof multinomials (Vaithyanathan & Dom, 2000; Meila & Heck... |

166 | Criterion functions for document clustering: Experiments and analysis
- Zhao, Karypis
- 2001
(Show Context)
Citation Context ...the same preprocessing step, the resulting dataset consists of 2,998 documents in a 15,810 dimensional vector space. All the datasets associated with the CLUTO toolkit have already been preprocessed (=-=Zhao & Karypis, 2001-=-) and we further removed those words that appear in two or fewer documents. The classic dataset was obtained by combining the CACM, CISI, CRANFIELD, and MEDLINE abstracts that were used in the past to... |

163 | Impact of Similarity Measures on Web-page Clustering
- Strehl, Ghosh, et al.
- 2002
(Show Context)
Citation Context ...irectional (i.e. only the vectors’ directions are important as they are typically normalized to unit length). In contrast, Gaussian based models such as k-means perform very poorly for such datasets (=-=Strehl et al., 2000-=-). All nine instantiated algorithms are compared on a number of document datasets derived from the TREC collections and internet newsgroups. Our goal is to empirically investigate the suitability of e... |

143 |
Clustering Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ... to the query. Till the mid-nineties, hierarchical agglomerative clustering using a suitable similarity measure such as cosine, Dice or Jaccard, formed the dominant paradigm for clustering documents (=-=Rasmussen, 1992-=-; Cutting et al., 1992). The increasing interest in processing larger collections of documents has led to a new emphasis on designing more efficient and effective techniques, leading to an explosion o... |

121 |
CLUTO - A clustering toolkit
- Karypis
- 2002
(Show Context)
Citation Context ..., spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (Steinbach et al., 2000), mixturesof multinomials (Vaithyanathan & Dom, 2000; Meila & Heckerman, 2001), multi-level graph partitioning (=-=Karypis, 2002-=-), and co-clustering using bipartite spectral graph partitioning (Dhillon, 2001). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (o... |

96 | Efficient Clustering of Very Large Document Collections”, Data Mining for Scientific and Engineering Applications
- Dhillon, Fan, et al.
- 2001
(Show Context)
Citation Context ...is, the more peaked the distribution is. For the vMF-based k-means algorithm, we assume κ is the same for all clusters, i.e., κj = κ, ∀j. This results in the spherical k-means (Dhillon & Modha, 2001; =-=Dhillon et al., 2001-=-). The model estimation in this case simply amounts to µj = 1nj ∑ i:yi=j di, where nj is the number of documents in cluster j. The estimation for κ in the mixture-of-vMFs clustering algorithm, however... |

91 | An information-theoretic analysis of hard and soft assignment methods for clustering - Kearns, Mansour, et al. - 1997 |

80 | Clustering based on conditional distribution in an auxiliary space - Sinkkonen, Kaski |

76 | Webace: A web agent for document categorization and exploration
- Han, Boley, et al.
- 1998
(Show Context)
Citation Context ...the following ten categories: antibodies, carcinoma, DNA, in-vitro, molecular sequence data, pregnancy, prognosis, receptors, risk factors, and tomography. The k1b dataset is from the WebACE project (=-=Han et al., 1998-=-). Each document corresponds to a web page listed in the subject hierarchy of Yahoo! (http://www.yahoo.com). The other datasets are from TREC collections (http://trec.nist.gov). In particular, the hit... |

75 | A general probabilistic framework for clustering individuals
- Cadez, Gaffney, et al.
- 2000
(Show Context)
Citation Context ...divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; =-=Cadez et al., 2000-=-). In similarity-based approaches, one optimizes an objective function involving the pairwise document similarities, aiming to maximize the average similarities within clusters and minimize the averag... |

54 | Using Unlabeled Data to Improve Text Classification
- Nigam
- 2001
(Show Context)
Citation Context ...Pj(wl) = 1. They are different from the Pj(wl) ′ s in (5) and can be estimated by counting the number of documents in each cluster and the number of times wl occurs in all documents in the cluster j=-= (Nigam, 2001). With Laplacia-=-n smoothing, the parameter estimation of multinomial models amounts to Pj(wl) = � l 1 + � i P (j|di, Λ)nil (1 + � i P (j|di, Λ)nil) = l � 1 + i P (j|di, Λ)nil |V | + � � l i P (j|di, Λ... |

50 | Iterative clustering of high dimensional text data augmented by local search - Dhillon, Guan, et al. - 2002 |

46 | An experimental comparison of model-based clustering methods
- Meila, Heckerman
- 2001
(Show Context)
Citation Context ...et al., 2000), mixture of Gaussians (Tantrum et al., 2002), spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (Steinbach et al., 2000), mixture of multinomials (Vaithyanathan & Dom, 2000; =-=Meila & Heckerman, 2001-=-), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (Dhillon, 2001). Most clustering methods proposed for data mining (Berkhin, 2002) can b... |

26 |
A gentle tutorial on the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models
- Blimes
- 1997
(Show Context)
Citation Context ...ng (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (=-=Blimes, 1998-=-; Rose, 1998; Cadez et al., 2000). In similarity-based approaches, one optimizes an objective function involving the pairwise document similarities, aiming to maximize the average similarities within ... |

23 |
A Sublinear-time Approximation Scheme for Clustering in Metric Spaces
- Indyk
- 1999
(Show Context)
Citation Context ...tite spectral graph partitioning (Dhillon, 2001). Most clustering methods proposed for data mining (Berkhin, 2002) can be divided into two categories: discriminative (or similarity-based) approaches (=-=Indyk, 1999-=-; Scholkopf & Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Rose, 1998; Cadez et al., 2000). In similarity-based approaches, one optimizes an objective function ... |

23 | Model-based hierarchical clustering
- Vaithyanathan, Dom
- 2000
(Show Context)
Citation Context ...lf-organizing map (Kohonen et al., 2000), mixture of Gaussians (Tantrum et al., 2002), spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (Steinbach et al., 2000), mixture of multinomials (=-=Vaithyanathan & Dom, 2000-=-; Meila & Heckerman, 2001), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (Dhillon, 2001). Most clustering methods proposed for data min... |

22 |
Scalable clustering methods for data mining
- Ghosh
- 2003
(Show Context)
Citation Context ...obabilities of component models, are introduced. One can generalize these weight parameters to construct a general objective function (to be maximized) for model-based partitional clustering (Zhong & =-=Ghosh, 2003): ⎛ ⎞ n� k� -=-log P (O|Λ) = log ⎝ αijP (oi|λj) ⎠, (1) i=1 where Λ = {λj, αij}i=1,...,n,j=1,...,k is the set of all model parameters to be estimated, and αij’s are the model mixture weights that are sub... |

18 | Hierarchical model-based clustering of large datasets through fractionation and refractionation
- Tantrum, Murua, et al.
- 2002
(Show Context)
Citation Context ...nd effective techniques, leading to an explosion of diverse approaches to the document clustering problem, including the (multilevel) self-organizing map (Kohonen et al., 2000), mixture of Gaussians (=-=Tantrum et al., 2002-=-), spherical k-means (Dhillon & Modha, 2001), bi-secting k-means (Steinbach et al., 2000), mixture of multinomials (Vaithyanathan & Dom, 2000; Meila & Heckerman, 2001), multi-level graph partitioning ... |

16 | Frequency sensitive competitive learning for clustering on high-dimensional hperspheres
- Banerjee, Ghosh
(Show Context)
Citation Context ...osine similarity metric (to measure the closeness of a data point to its cluster’s centroid) can be derived from a generative model based on the vMF distribution under certain restrictive conditions (=-=Banerjee & Ghosh, 2002-=-; Banerjee et al., 2003). The vMF distribution for cluster j can be written as P (di|λj) = 1 Z(κj) exp ( κj dTi µj ‖µj‖ ) , (9) where di is a normalized (unit-length in L2 norm) document vector and th... |

6 |
Clustering on hyperspheres using Expectation Maximization
- Banerjee, Dhillon, et al.
- 2003
(Show Context)
Citation Context ...(to measure the closeness of a data point to its cluster’s centroid) can be derived from a generative model based on the vMF distribution under certain restrictive conditions (Banerjee & Ghosh, 2002=-=; Banerjee et al., 2003). The vMF dis-=-tribution for cluster j can be written as P (di|λj) = 1 Z(κj) exp � κj dT i µj � , (9) �µj� where di is a normalized (unit-length in L2 norm) document vector and the Bessel function Z(κj... |

6 |
A unified framework for modelbased clustering and its applications to clustering time sequences
- Zhong, Ghosh
- 2002
(Show Context)
Citation Context ...re useful for clustering a stream of documents such as news feeds, as well as for incremental learning situations. We recently introduced a unified framework for probabilistic model-based clustering (=-=Zhong & Ghosh, 2002-=-), which includes a generic treatment of model-based partitional clustering methods. Basically, a generic model-based partitional clustering algorithm centers around two steps—a model re-estimation st... |

5 |
Survey of clustering data mining techniques. URL = http://www. ee.ucr.edu/~barth/EE242/clustering_survey.pdf, 2006. AhaD W. ,Incremental, instance-based learning of independent and graded concept descriptions
- Berkhin
(Show Context)
Citation Context ...la & Heckerman, 2001), multi-level graph partitioning (Karypis, 2002), and co-clustering using bipartite spectral graph partitioning (Dhillon, 2001). Most clustering methods proposed for data mining (=-=Berkhin, 2002-=-) can be divided into two categories: discriminative (or similarity-based) approaches (Indyk, 1999; Scholkopf & Smola, 2001; Vapnik, 1998) and generative (or model-based) approaches (Blimes, 1998; Ros... |