## A Comparative Study of Generative Models for Document Clustering (2003)

Venue: | In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications |

Citations: | 38 - 4 self |

### BibTeX

@INPROCEEDINGS{Zhong03acomparative,

author = {Shi Zhong and Joydeep Ghosh},

title = {A Comparative Study of Generative Models for Document Clustering},

booktitle = {In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.

