Hierarchical Document Clustering Using Frequent Itemsets (2003)
Cached
Download Links
- [www.cs.sfu.ca]
- [www.cs.sfu.ca]
- [fas.sfu.ca]
- [fas.sfu.ca]
- DBLP
Other Repositories/Bibliography
| Venue: | IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003 |
| Citations: | 55 - 1 self |
BibTeX
@INPROCEEDINGS{Fung03hierarchicaldocument,
author = {Benjamin C.M. Fung and Ke Wang and Martin Ester},
title = {Hierarchical Document Clustering Using Frequent Itemsets},
booktitle = {IN PROC. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2003 (SDM 2003},
year = {2003},
publisher = {}
}
Years of Citing Articles
OpenURL
Abstract
A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this paper, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.







