## Concept Decompositions for Large Sparse Text Data using Clustering (2000)

### Cached

### Download Links

- [www.cs.utexas.edu]
- [www.almaden.ibm.com]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.public.asu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 302 - 27 self |

### BibTeX

@INPROCEEDINGS{Dhillon00conceptdecompositions,

author = {Inderjit S. Dhillon and Dharmendra S. Modha},

title = {Concept Decompositions for Large Sparse Text Data using Clustering},

booktitle = {Machine Learning},

year = {2000},

pages = {143--175}

}

### Years of Citing Articles

### OpenURL

### Abstract

. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...