Results 1 - 10
of
24
Concept Decompositions for Large Sparse Text Data using Clustering
- Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract
-
Cited by 231 (23 self)
- Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors--a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractal-like" and "self-similar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned...
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical da ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memory-ecient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented - a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Computation and Uses of the Semidiscrete Matrix Decomposition
, 1999
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, N ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or permissions@acm.org. 2 \Delta T. G. Kolda and D. P. O'Leary 1. INTRODUCTION A semidiscrete decomposition (SDD) approximates a matrix as a weighted sum of outer products formed by vectors with entries constrained to be in the set S = f\Gamma1; 0; 1g. O'Leary and Peleg [1983] introduced the SDD in the context of image compression, and Kolda and O'Leary [1998, 1999] used the SDD for latent semantic indexing (LSI) in information retrieval; these applications are discussed in x5. The primary advantage of the SDD over other types of matrix approximations such as the truncated singular value decomposition (SVD) is that, as we will demonstrate with numeric...
New Term Weighting Formulas For The Vector Space Method In Information Retrieval
, 1999
"... The goal in information retrieval is to enable users to automatically and accurately find data relevant to their queries. One possible approach to this problem is to use the vector space model, which models documents and queries as vectors in the term space. The components of the vectors are determi ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
The goal in information retrieval is to enable users to automatically and accurately find data relevant to their queries. One possible approach to this problem is to use the vector space model, which models documents and queries as vectors in the term space. The components of the vectors are determined by the term weighting scheme, a function of the frequencies of the terms in the document or query as well as throughout the collection. We discuss popular term weighting schemes and present several new schemes that offer improved performance. 1. Introduction Automatic information retrieval is needed because of the volume of information available today --- there is too much information to be indexed manually. Most people have used some type of information retrieval system in the form of Internet search engines. Search engines are based on information retrieval models such as the Boolean system, the probabilistic model, or the vector space model [7]. We focus on the vector space model, de...
TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections
, 2005
"... A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching too ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm’s from text collections and for the incremental modification of these tdm’s by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.
Partitioning Rectangular And Structurally Nonsymmetric Sparse Matrices For Parallel Processing
- SIAM J. Sci. Comput
, 1998
"... . A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
. A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices. Key words. matrix partitioning, iterative method, parallel computing, rectangular matrix, structurally nonsymmetric matrix, bipartite graph AMS subject classifications. 05C50, 65F10, 65F50, 65Y05 1. Introduction. Matrix-vector and matrix-transpose-vector products that repeatedly involve the same large, sparse, structurally nonsymmetric or rectangular matrix arise in many iterative algorithms. Examples include algorithms for solving linear systems, least squares problems, and linear programs. To e...
Methodological Approaches to Online Scoring of Essays
- ERIC DOCUMENT REPRODUCTION SERVICE, NO. ED
, 1997
"... ..."
Seeding non-negative matrix factorization with the spherical k-means clustering
, 2003
"... The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.
Partitioning Sparse Rectangular Matrices for Parallel Processing
- Lecture Notes in Computer Science
, 1998
"... Abstract. We are interested in partitioning sparse rectangular matrices for parallel processing. The partitioning problem has been well�studied in the square symmetric case � but the rectangular problem has received very little attention. We will formalize the rectangular matrix partitioning problem ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract. We are interested in partitioning sparse rectangular matrices for parallel processing. The partitioning problem has been well�studied in the square symmetric case � but the rectangular problem has received very little attention. We will formalize the rectangular matrix partitioning problem and discuss several methods for solving it. We will extend the spectral partitioning method for symmetric matrices to the rectangular case and compare this method to three new methods � the alternating partitioning method and two hybrid methods. The hybrid methods will be shown to be best. 1

