Results 1  10
of
27
Concept Decompositions for Large Sparse Text Data using Clustering
 Machine Learning
, 2000
"... . Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99 ..."
Abstract

Cited by 303 (28 self)
 Add to MetaCart
. Unlabeled document collections are becoming increasingly common and available; mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as highdimensional and sparse vectorsa few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical kmeans algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the highdimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain "fractallike" and "selfsimilar" behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectors; these decompositions are obtained by taking the leastsquares approximation onto the linear subspace spanned...
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical da ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memoryecient multithreaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented  a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.
Lower Dimensional Representation of Text Data based on
 Centroids and Least Squares, BIT
, 2003
"... improvingcomputationaleciencyinhandlingmassivedata. reductionformula.WeillustratehowthecommonlyusedLatentSemanticIndexingbasedon oftextdatainvectorspacebasedinformationretrievalusingminimizationandmatrixrank Inthispaper,weproposeamathematicalframeworkforlowerdimensionalrepresentation fromourmathemat ..."
Abstract

Cited by 39 (14 self)
 Add to MetaCart
improvingcomputationaleciencyinhandlingmassivedata. reductionformula.WeillustratehowthecommonlyusedLatentSemanticIndexingbasedon oftextdatainvectorspacebasedinformationretrievalusingminimizationandmatrixrank Inthispaper,weproposeamathematicalframeworkforlowerdimensionalrepresentation fromourmathematicalframework.Thenweproposeanewapproachwhichismoreecientand SingularValueDecomposition(LSI/SVD)canbederivedasamethodfordimensionreduction SeveraladvantagesofthenewmethodsarediscussedovertheLSI/SVDintermsofcomputational eectivethanLSI/SVDwhenwehaveaprioriinformationontheclusterstructureofthedata. eciencyanddatarepresentationinthereduceddimensionalspace.
TMG: A MATLAB Toolbox for Generating TermDocument Matrices from Text Collections
, 2005
"... A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse termdocument matrices (tdm). We present TMG, a research and teaching too ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse termdocument matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm’s from text collections and for the incremental modification of these tdm’s by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different termweighting policies on the performance of querying and clustering tasks.
Computation and Uses of the Semidiscrete Matrix Decomposition
, 1999
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, N ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 8690481, or permissions@acm.org. 2 \Delta T. G. Kolda and D. P. O'Leary 1. INTRODUCTION A semidiscrete decomposition (SDD) approximates a matrix as a weighted sum of outer products formed by vectors with entries constrained to be in the set S = f\Gamma1; 0; 1g. O'Leary and Peleg [1983] introduced the SDD in the context of image compression, and Kolda and O'Leary [1998, 1999] used the SDD for latent semantic indexing (LSI) in information retrieval; these applications are discussed in x5. The primary advantage of the SDD over other types of matrix approximations such as the truncated singular value decomposition (SVD) is that, as we will demonstrate with numeric...
New Term Weighting Formulas For The Vector Space Method In Information Retrieval
, 1999
"... The goal in information retrieval is to enable users to automatically and accurately find data relevant to their queries. One possible approach to this problem is to use the vector space model, which models documents and queries as vectors in the term space. The components of the vectors are determi ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
The goal in information retrieval is to enable users to automatically and accurately find data relevant to their queries. One possible approach to this problem is to use the vector space model, which models documents and queries as vectors in the term space. The components of the vectors are determined by the term weighting scheme, a function of the frequencies of the terms in the document or query as well as throughout the collection. We discuss popular term weighting schemes and present several new schemes that offer improved performance. 1. Introduction Automatic information retrieval is needed because of the volume of information available today  there is too much information to be indexed manually. Most people have used some type of information retrieval system in the form of Internet search engines. Search engines are based on information retrieval models such as the Boolean system, the probabilistic model, or the vector space model [7]. We focus on the vector space model, de...
Seeding nonnegative matrix factorization with the spherical kmeans clustering
, 2003
"... The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.
Partitioning Rectangular And Structurally Nonsymmetric Sparse Matrices For Parallel Processing
 SIAM J. Sci. Comput
, 1998
"... . A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
. A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtransposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitioning bipartite graphs. We then introduce several algorithms for this partitioning problem and compare their performance on a set of test matrices. Key words. matrix partitioning, iterative method, parallel computing, rectangular matrix, structurally nonsymmetric matrix, bipartite graph AMS subject classifications. 05C50, 65F10, 65F50, 65Y05 1. Introduction. Matrixvector and matrixtransposevector products that repeatedly involve the same large, sparse, structurally nonsymmetric or rectangular matrix arise in many iterative algorithms. Examples include algorithms for solving linear systems, least squares problems, and linear programs. To e...
Methodological Approaches to Online Scoring of Essays
 ERIC DOCUMENT REPRODUCTION SERVICE, NO. ED
, 1997
"... ..."