Concept Decompositions for Large Sparse Text Data using Clustering (2000)
Cached
Download Links
- [www.cs.utexas.edu]
- [www.almaden.ibm.com]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.public.asu.edu]
- DBLP
Other Repositories/Bibliography
| Venue: | Machine Learning |
| Citations: | 231 - 23 self |
BibTeX
@INPROCEEDINGS{Dhillon00conceptdecompositions,
author = {Inderjit S. Dhillon and Dharmendra S. Modha},
title = {Concept Decompositions for Large Sparse Text Data using Clustering},
booktitle = {Machine Learning},
year = {2000},
pages = {143--175}
}
Years of Citing Articles
OpenURL
Abstract
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...







