MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

A Comparison of Document Clustering Techniques (2000) [221 citations — 16 self]

by Michael Steinbach ,  George Karypis ,  Vipin Kumar
Add To MetaCart

Abstract:

This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these r...

Citations

3170 The mathematical theory of communication – Shannon - 1962
1478 Algorithms for Clustering Data – Jain, Dubes - 1988
728 Finding Groups in Data: An Introduction to Cluster Analysis – Kaufman, Rousseeuw - 1990
430 Scatter/gather: a cluster-based approach to browsing large document collections – Cutting, Karger, et al. - 1992
301 Hierarchically classifying documents using very few words – Koller, Sahami - 1997
209 ROCK: A Robust Clustering Algorithm for Categorical Attributes – Guha, Rastogi, et al.
133 Refining initial points for K-Means clustering – Bradley, Fayyad - 1998
86 Information Retrieval Systems - Theory and Implementation – Kowalski - 1996
83 Fast and effective text mining using linear-time document clustering – Larsen, Aone - 1999
77 Optimization of inverted vector searches – Buckley, Lewit - 1985
71 Fast and intuitive clustering of web documents – Zamir, Etzioni, et al. - 1997
54 WebACE: A web agent for document categorization and exploartion – Han, Boley, et al. - 1998
31 On the merits of building categorization systems by supervised clustering – Aggarwal, Gates, et al. - 1999
25 Comparison of hierarchic agglomerative clustering methods for document retrieval – El-Hamdouchi, Willett - 1989
24 Chinatsu Aone. Fast and effective text mining using linear-time document clustering – Larsen - 1999
13 A practical Clustering Algorithm for Static and Dynamic Information Organization – Aslam, Pelekhov, et al. - 1999
10 A Comparison of Document Clustering – Steinbach, Karypis, et al. - 2000
8 Reuters-21578 text categorization text collection 1.0 – Lewis - 1997
4 Dubes and Anil K. Jain, Algorithms for Clustering Data – Richard - 1988