Abstract:
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these r...
Citations
|
3170
|
The mathematical theory of communication
– Shannon
- 1962
|
|
1478
|
Algorithms for Clustering Data
– Jain, Dubes
- 1988
|
|
728
|
Finding Groups in Data: An Introduction to Cluster Analysis
– Kaufman, Rousseeuw
- 1990
|
|
430
|
Scatter/gather: a cluster-based approach to browsing large document collections
– Cutting, Karger, et al.
- 1992
|
|
301
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
209
|
ROCK: A Robust Clustering Algorithm for Categorical Attributes
– Guha, Rastogi, et al.
|
|
133
|
Refining initial points for K-Means clustering
– Bradley, Fayyad
- 1998
|
|
86
|
Information Retrieval Systems - Theory and Implementation
– Kowalski
- 1996
|
|
83
|
Fast and effective text mining using linear-time document clustering
– Larsen, Aone
- 1999
|
|
77
|
Optimization of inverted vector searches
– Buckley, Lewit
- 1985
|
|
71
|
Fast and intuitive clustering of web documents
– Zamir, Etzioni, et al.
- 1997
|
|
54
|
WebACE: A web agent for document categorization and exploartion
– Han, Boley, et al.
- 1998
|
|
31
|
On the merits of building categorization systems by supervised clustering
– Aggarwal, Gates, et al.
- 1999
|
|
25
|
Comparison of hierarchic agglomerative clustering methods for document retrieval
– El-Hamdouchi, Willett
- 1989
|
|
24
|
Chinatsu Aone. Fast and effective text mining using linear-time document clustering
– Larsen
- 1999
|
|
13
|
A practical Clustering Algorithm for Static and Dynamic Information Organization
– Aslam, Pelekhov, et al.
- 1999
|
|
10
|
A Comparison of Document Clustering
– Steinbach, Karypis, et al.
- 2000
|
|
8
|
Reuters-21578 text categorization text collection 1.0
– Lewis
- 1997
|
|
4
|
Dubes and Anil K. Jain, Algorithms for Clustering Data
– Richard
- 1988
|