## Concept Decompositions for Large Sparse Text Data using Clustering (2000)

### Cached

### Download Links

- [www.cs.utexas.edu]
- [www.almaden.ibm.com]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.public.asu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 341 - 27 self |

### BibTeX

@INPROCEEDINGS{Dhillon00conceptdecompositions,

author = {Inderjit S. Dhillon and Dharmendra S. Modha},

title = {Concept Decompositions for Large Sparse Text Data using Clustering},

booktitle = {Machine Learning},

year = {2000},

pages = {143--175}

}

### Years of Citing Articles

### OpenURL

### Abstract

. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...

### Citations

4259 |
1973 ). Pattern Classification and Scene Analysis
- Duda, Hart
(Show Context)
Citation Context ...e "similarity" between such vectors by their inner product, known as cosine similarity (Salton and McGill, 1983). In this paper, we will use a variant of the well known "Euclidean"=-= k-means algorithm (Duda and Hart, 1973-=-; Hartigan, 1975) that uses cosine similarity (Rasmussen, 1992). We shall show that this algorithm partitions the highdimensional unit sphere using a collection of great hypercircles, and hence we sha... |

3086 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...rthermore, matrix approximations are often used in practice for feature selection and dimensionality reduction prior to building a learning model such as a classifier. In a search/retrieval context, (=-=Deerwester et al., 1990; Berry et-=- al., 1995) have proposed latent semantic indexing (LSI) that uses truncated singular value decomposition (SVD) or principal component analysis to discover "latent" relationships between cor... |

2162 | Matrix computations - Golub, Loan - 1993 |

2102 |
The Fractal Geometry of Nature
- Mandelbrot, B
- 1983
(Show Context)
Citation Context ...r structures to be similar at various resolutions. The only essential difference is the progressive separation of the intrafrom inter-cluster structure. This prompts an obvious analogy with fractals (=-=Mandelbrot, 1988-=-). Thus, any proposed statistical model for text data should be consistent with this fractal behavior. In fact, it might be meaningful to seek maximum entropy distributions subject to such empirical c... |

2013 | On the self-similar nature of ethernet traffic (extended version
- Leland, Taqqu, et al.
- 1994
(Show Context)
Citation Context ...while claiming no such breakthrough, we would like to point out that the discovery of fractal nature of ethernet traffic has greatly impacted the design, control, and analysis of high-speed networks (=-=Leland et al., 1994-=-). In Section 4, we propose a new matrix approximation scheme--concept decomposition-- that solves a least-squares problem after clustering, namely, computes the least-squares approximation onto the l... |

1759 | Term-weighting approaches in automatic text retrieval
- Salton, Buckley
- 1988
(Show Context)
Citation Context ...rall importance of a word in the entire set of documents. The objective of such weighting schemes is to enhance discrimination between various document vectors and to enhance retrieval effectiveness (=-=Salton and Buckley, 1988-=-). There are many schemes for selecting the term, global, and normalization components, for example, (Kolda, 1997) presents 5, 5, and 2 schemes, respectively, for the term, global, and normalization c... |

1756 |
Clustering Algorithms
- Hartigan
- 1975
(Show Context)
Citation Context ...n such vectors by their inner product, known as cosine similarity (Salton and McGill, 1983). In this paper, we will use a variant of the well known "Euclidean" k-means algorithm (Duda and Ha=-=rt, 1973; Hartigan, 1975-=-) that uses cosine similarity (Rasmussen, 1992). We shall show that this algorithm partitions the highdimensional unit sphere using a collection of great hypercircles, and hence we shall refer to this... |

901 | Probabilistic latent semantic indexing
- Hofmann
- 1999
(Show Context)
Citation Context ...iderations, we have removed some experimental results from this paper; complete details appear in our IBM Technical Report (Dhillon and Modha, 1999). Probabilistic latent semantic analysis (PLSA) of (=-=Hofmann, 1999-=-) views the word-bydocument matrix as a co-occurrence table describing the probability that a word is related to a document, and approximates this matrix using the aspect model (Saul and Pereira, 1997... |

696 | Scatter/gather: a cluster-based approach to browsing large document collections
- Cutting, Karger, et al.
- 1992
(Show Context)
Citation Context ...thods have been explored in the text mining literature; for detailed reviews, see (Rasmussen, 1992; Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; =-=Cutting et al., 1992-=-; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clu... |

592 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1995
(Show Context)
Citation Context ...mations are often used in practice for feature selection and dimensionality reduction prior to building a learning model such as a classifier. In a search/retrieval context, (Deerwester et al., 1990; =-=Berry et al., 1995) have pro-=-posed latent semantic indexing (LSI) that uses truncated singular value decomposition (SVD) or principal component analysis to discover "latent" relationships between correlated words and do... |

515 |
Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Chapter 3: New indices for text: Pat trees and Pat arrays
- Gonnet, Baeza-Yates, et al.
- 1992
(Show Context)
Citation Context ...ROCESSING 1. Ignoring case, extract all unique words from the entire set of documents. 2. Eliminate non-content-bearing "stopwords" such as "a", "and", "the", e=-=tc. For sample lists of stopwords, see (Frakes and Baeza-Yates, 1992-=-, Chapter 7). 3. For each document, count the number of occurrences of each word. final.tex; 5/07/2000; 16:06; p.4 Concept Decompositions 5 4. Using heuristic or information-theoretic criteria, elimin... |

455 |
Syntactic clustering of the web
- BRODER, GLASSMAN, et al.
- 1997
(Show Context)
Citation Context ...for routing new information such as that arriving from newsfeeds and new scientific publications. For experiments describing a certain syntactic clustering of the whole web and its applications, see (=-=Broder et al., 1997-=-). We have used clustering for visualizing and navigating collections of documents in (Dhillon et al., 1998). Various classical clustering algorithms such as the k-means algorithm and its variants, hi... |

426 | Reexamining the cluster hypothesis: Scatter/Gather on retrieval results
- Hearst, Pedersen
- 1996
(Show Context)
Citation Context ...ed in the text mining literature; for detailed reviews, see (Rasmussen, 1992; Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; =-=Hearst and Pedersen, 1996-=-; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstr... |

399 | Pivoted document length normalization
- Singhal, Buckley, et al.
- 1996
(Show Context)
Citation Context ...ors have been normalized to have unit L 2 norm, that is, they can be thought of as points on a high-dimensional unit sphere. Such normalization mitigates the effect of differing lengths of documents (=-=Singhal et al., 1996). It is natura-=-l to measure "similarity" between such vectors by their inner product, known as cosine similarity (Salton and McGill, 1983). In this paper, we will use a variant of the well known "Eucl... |

369 | Web Document Clustering: A Feasibility Demonstration
- Zamir, Etzioni
- 1998
(Show Context)
Citation Context ...in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; =-=Zamir and Etzioni, 1998-=-). A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (Salton and McGill, 1983). The basic idea is (a) to extract unique cont... |

269 | Latent semantic indexing: A probabilistic analysis
- Papadimitriou, Raghavan, et al.
- 1998
(Show Context)
Citation Context ...ry efficient matrix approximation scheme known as semidiscrete decomposition. (Gallant, 1994; Caid and Oing, 1997) have used an "implicit" matrix approximation scheme based on their context =-=vectors. (Papadimitriou et al., 1998-=-) have proposed computationally efficient matrix approximations based on random projections. Finally, (Isbell and Viola, 1998) have used independent component analysis for identifying directions repre... |

143 |
Clustering Algorithms
- Rasmussen
- 1992
(Show Context)
Citation Context ...orithms such as the k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods have been explored in the text mining literature; for detailed reviews, see (=-=Rasmussen, 1992-=-; Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997... |

124 | Numerical Methods for Computing Angles between Linear Subspaces
- Björck, Golub
- 1973
(Show Context)
Citation Context ...ubspace spanned by the leading k singular vectors, that is, S k = spanfu 1 ; u 2 ; : : : ; u k g: The closeness of the subspaces C + k and S k can be measured by quantities known as principal angles (=-=Bjrck and Golub, 1973-=-; Golub and Van Loan, 1996). Intuitively, principal angles generalize the notion of an angle between two lines to higher-dimensional subspaces of R d . Let F and G be given subspaces of R d . Assume t... |

108 | Projections for efficient document clustering
- Schtze, Silverstein
- 1997
(Show Context)
Citation Context ...eviews, see (Rasmussen, 1992; Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; =-=Schtze and Silverstein, 1997-=-; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstructured text data is to create a vector space model... |

107 | A data-clustering algorithm on distributed memory multiprocessors - DHILLON, MODHA - 2000 |

91 | Document categorization and query generation on the world wide web using WebACE. AI Review (accepted for publication
- Boley, Gini, et al.
- 1999
(Show Context)
Citation Context ...d graph-theoretic methods have been explored in the text mining literature; for detailed reviews, see (Rasmussen, 1992; Willet, 1988). Recently, there has been a flurry of activity in this area, see (=-=Boley et al., 1998-=-; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting ... |

82 | Aggregate and mixed-order Markov models for statistical language processing
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...LSA) of (Hofmann, 1999) views the word-bydocument matrix as a co-occurrence table describing the probability that a word is related to a document, and approximates this matrix using the aspect model (=-=Saul and Pereira, 1997-=-). While similar in spirit, concept decompositions are distinct from PLSA. Our framework is geometric and is concerned with orthonormal L 2 projections, while PLSA is probabilistic and is concerned wi... |

82 |
Recent trends in hierarchic document clustering: A critical review
- Willet
- 1988
(Show Context)
Citation Context ...he k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods have been explored in the text mining literature; for detailed reviews, see (Rasmussen, 1992; =-=Willet, 1988-=-). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein ... |

53 |
Introduction to Modern Retrieval
- Salton, MJ
- 1983
(Show Context)
Citation Context ...d Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (=-=Salton and McGill, 1983-=-). The basic idea is (a) to extract unique content-bearing words from the set of documents and treat these words as features and (b) to represent each document as a vector of certain weighted word fre... |

51 |
Density estimation by stochastic complexity
- Rissanen, Speed, et al.
- 1992
(Show Context)
Citation Context ...t of data points. Also, a pdf is a nonnegative function with unit area. In this paper, we estimate the pdf of a set of one-dimensional data points using the unweighted mixture of histogram method of (=-=Rissanen et al., 1992-=-). Example 2.D (Average intra-cluster structure) We clustered the NSF data set into k = 8, 64, and 512 clusters. In Figure 2, we plot the estimated pdfs of the n = 13297 numbers in (9) for these three... |

47 | A Microeconomic View of Data Mining
- Kleinberg, Papadimitriou, et al.
- 1998
(Show Context)
Citation Context ...e objective function measures the combined coherence of all the k clusters. Such an objective function has also been proposed and studied theoretically in the context of market segmentation problems (=-=Kleinberg et al., 1998-=-). 3.4. SPHERICAL k-MEANS We seek a partitioning of the document vectors x 1 ; x 2 ; : : : ; x n into k disjoint clusters p ? 1 ; p ? 2 ; : : : ; p ? k that maximizes the objective function in (5), th... |

45 | SONIA: A Service for Organizing Networked Information Autonomously
- Sahami, Yusufali, et al.
- 1998
(Show Context)
Citation Context ...ature; for detailed reviews, see (Rasmussen, 1992; Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; =-=Sahami et al., 1999-=-; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstructured text data is ... |

42 | The complexity of the generalized Lloyd–Max problem - Garey, Johnson, et al. - 1982 |

36 | Global convergence and empirical consistency of the generalized Lloyd algorithm - Sabin, Gray - 1986 |

35 |
Digital image compression by outer product expansion
- O’Leary, Peleg
- 1983
(Show Context)
Citation Context ...scover "latent" relationships between correlated words and documents. Truncated SVD is a popular and well studied matrix approximation scheme (Golub and Van Loan, 1996). Based on the earlier=-= work of (O'Leary and Peleg, 1983) for imag-=-e compression, (Kolda, 1997) has developed a memory efficient matrix approximation scheme known as semidiscrete decomposition. (Gallant, 1994; Caid and Oing, 1997) have used an "implicit" ma... |

33 |
Quantization and the method of k-means
- Pollard
- 1982
(Show Context)
Citation Context ...wever, it is important to realize that the corollary does not imply that the underlying partitioning fp (t) j g k j=1 converges. We refer the reader interested in more general convergence results to (=-=Pollard, 1982-=-; Sabin and Gray, 1986). The spherical k-means algorithm (like other gradient ascent schemes) is prone to local maximas. Nonetheless, the algorithm yielded reasonable results for the experimental resu... |

31 | Visualizing Class Structure of Multidimensional Data
- Dhillon, Spangler
- 1998
(Show Context)
Citation Context ...riments describing a certain syntactic clustering of the whole web and its applications, see (Broder et al., 1997). We have used clustering for visualizing and navigating collections of documents in (=-=Dhillon et al., 1998-=-). Various classical clustering algorithms such as the k-means algorithm and its variants, hierarchical agglomerative clustering, and graph-theoretic methods have been explored in the text mining lite... |

30 | Limited-memory matrix methods with applications. Dissertation, Applied Mathematics
- Kolda
- 1997
(Show Context)
Citation Context ...words and documents. Truncated SVD is a popular and well studied matrix approximation scheme (Golub and Van Loan, 1996). Based on the earlier work of (O'Leary and Peleg, 1983) for image compression, (=-=Kolda, 1997) has deve-=-loped a memory efficient matrix approximation scheme known as semidiscrete decomposition. (Gallant, 1994; Caid and Oing, 1997) have used an "implicit" matrix approximation scheme based on th... |

29 | Restructuring Sparse High Dimensional Data for Effective Retrieval, AIM1636
- Viola
- 1998
(Show Context)
Citation Context ...implicit" matrix approximation scheme based on their context vectors. (Papadimitriou et al., 1998) have proposed computationally efficient matrix approximations based on random projections. Final=-=ly, (Isbell and Viola, 1998) have use-=-d independent component analysis for identifying directions representing sets of highly correlated words, and have used these directions for an "implicit" matrix approximation scheme. As our... |

28 | Scatter/Gatherer: a cluster-based approach to browsing large document collections - Cutting, Karger, et al. |

24 |
ªModel Selection in Unsupervised Learning with Applications to Document Clustering,º
- Vaithyanathan, Dom
- 1999
(Show Context)
Citation Context ...as been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; Silverstein and Pedersen, 1997; =-=Vaithyanathan and Dom, 1999-=-; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (Salton and McGill, 1983). The basic idea is (... |

21 | Almost-Constant-Time Clustering of Arbitrary Corpus Subsets
- Silverstein, O
- 1997
(Show Context)
Citation Context ...Willet, 1988). Recently, there has been a flurry of activity in this area, see (Boley et al., 1998; Cutting et al., 1992; Hearst and Pedersen, 1996; Sahami et al., 1999; Schtze and Silverstein, 1997; =-=Silverstein and Pedersen, 1997-=-; Vaithyanathan and Dom, 1999; Zamir and Etzioni, 1998). A starting point for applying clustering algorithms to unstructured text data is to create a vector space model for text data (Salton and McGil... |

13 | A microeconomic view of data mining. Data mining and knowledge discovery - Kleinberg, Papadimitriou, et al. - 1998 |

3 |
System and Method of Context Vector Generation and Retrieval
- Caid, Oing
- 1997
(Show Context)
Citation Context ...on the earlier work of (O'Leary and Peleg, 1983) for image compression, (Kolda, 1997) has developed a memory efficient matrix approximation scheme known as semidiscrete decomposition. (Gallant, 1994; =-=Caid and Oing, 1997) have use-=-d an "implicit" matrix approximation scheme based on their context vectors. (Papadimitriou et al., 1998) have proposed computationally efficient matrix approximations based on random project... |

2 |
Methods for generating or revising context vectors for a plurality of word stems
- Gallant
- 1994
(Show Context)
Citation Context ..., 1996). Based on the earlier work of (O'Leary and Peleg, 1983) for image compression, (Kolda, 1997) has developed a memory efficient matrix approximation scheme known as semidiscrete decomposition. (=-=Gallant, 1994; Caid and-=- Oing, 1997) have used an "implicit" matrix approximation scheme based on their context vectors. (Papadimitriou et al., 1998) have proposed computationally efficient matrix approximations ba... |

1 | Modha: 2000, ‘A parallel data-clustering algorithm for distributed memory multiprocessors - Dhillon, S |

1 | Witsenhausen: 1982, ‘The complexity of the generalized Lloyd-Max problem - Garey, Johnson, et al. |

1 | final.tex; 6/04/2004; 23:43; p.30 Concept Decompositions 31 - SIGIR - 1996 |