Results 1  10
of
285
Using Linear Algebra for Intelligent Information Retrieval
 SIAM REVIEW
, 1995
"... Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical ..."
Abstract

Cited by 676 (18 self)
 Add to MetaCart
(Show Context)
Currently, most approaches to retrieving textual materials from scientific databases depend on a lexical match between words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity in the words people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take advantage of the implicit higherorder structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users...
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract

Cited by 323 (7 self)
 Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Fast MonteCarlo algorithms for finding lowrank approximations
 IN PROCEEDINGS OF THE 39TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE
, 1998
"... We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the "best " such approximation. However, it takes time polynomial in m, n which is prohib ..."
Abstract

Cited by 237 (16 self)
 Add to MetaCart
We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the &quot;best &quot; such approximation. However, it takes time polynomial in m, n which is prohibitive for some modern applications. In this paper, we develop an algorithm which is qualitatively faster, provided we may sample the entries of the matrix according to a natural probability distribution. In many applications such sampling can be done efficiently. Our main result is a randomized algorithm to find the description of a matrix D * of rank at most k so that A D*2F < = min D,rank(D)<=k A D
Recovering DocumentationtoSourceCode Traceability Links using Latent Semantic Indexing
"... An information retrieval technique, latent semantic indexing, is used to automatically identi traceability links from system documentation to program source code. The results of two experiments to identi links in existing software systems (i.e., the LEDA library, and Albergate) are presented. These ..."
Abstract

Cited by 237 (13 self)
 Add to MetaCart
An information retrieval technique, latent semantic indexing, is used to automatically identi traceability links from system documentation to program source code. The results of two experiments to identi links in existing software systems (i.e., the LEDA library, and Albergate) are presented. These results are compared with other similar type experimental results of traceability link identification using different types of information retrieval techniques. The method presented proves to give good results by comparison and additionally it is a low cost, highly flexible method to apply with regards to preprocessing and/or parsing of the source code and documentation.
A Neural Network Approach to Topic Spotting
, 1995
"... This paper presents an application of nonlinear neural networks to topic spotting. Neural networks allow us to model higherorder interaction between document terms and to simultaneously predict multiple topics using shared hidden features. In the context of this model, we compare two approaches to d ..."
Abstract

Cited by 188 (1 self)
 Add to MetaCart
This paper presents an application of nonlinear neural networks to topic spotting. Neural networks allow us to model higherorder interaction between document terms and to simultaneously predict multiple topics using shared hidden features. In the context of this model, we compare two approaches to dimensionality reduction in representation: one based on term selection and another based on Latent Semantic Indexing (LSI). Two different methods are proposed for improving LSI representations for the topic spotting task. We find that term selection and our modified LSI representations lead to similar topic spotting performance, and that this performance is equal to or better than other published results on the same corpus. 1 Introduction Topic spotting is the problem of identifying which of a set of predefined topics are present in a natural language document. More formally, given a set of n topics and a document, the task is to output for each topic the probability that the topic is prese...
Hipikat: Recommending pertinent software development artifacts
 In ICSE’03
"... A newcomer to a software project must typically come uptospeed on a large, varied amount of information about the project before becoming productive. Assimilating this information in the opensource context is difficult because a newcomer cannot rely on the mentoring approach that is commonly used ..."
Abstract

Cited by 185 (5 self)
 Add to MetaCart
(Show Context)
A newcomer to a software project must typically come uptospeed on a large, varied amount of information about the project before becoming productive. Assimilating this information in the opensource context is difficult because a newcomer cannot rely on the mentoring approach that is commonly used in traditional software developments. To help a newcomer to an opensource project become productive faster, we propose Hipikat, a tool that forms an implicit group memory from the information stored in a project’s archives, and that recommends artifacts from the archives that are relevant to a task that a newcomer is trying to perform. To investigate this approach, we have instantiated the Hipikat tool for the Eclipse opensource project. In this paper, we describe the Hipikat tool, we report on a qualitative study conducted with a Hipikat mockup on a mediumsized inhouse project, and we report on a case study in which Hipikat recommendations were evaluated for a task on Eclipse. 1.
Matrices, vector spaces, and information retrieval
 SIAM Review
, 1999
"... Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no short ..."
Abstract

Cited by 143 (3 self)
 Add to MetaCart
(Show Context)
Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a user’s query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections. Key words. information retrieval, linear algebra, QR factorization, singular value decomposition, vector spaces
How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans
, 1997
"... How much of the meaning of a naturally occurring English passage is derivable from its combination of words without considering their order? An exploratory approach to this question was provided by asking humans to judge the quality and quantity of knowledge conveyed by short student essays on scien ..."
Abstract

Cited by 130 (4 self)
 Add to MetaCart
How much of the meaning of a naturally occurring English passage is derivable from its combination of words without considering their order? An exploratory approach to this question was provided by asking humans to judge the quality and quantity of knowledge conveyed by short student essays on scientific topics and comparing the interrater reliability and predictive accuracy of their estimates with the performance of a corpusbased statistical model that takes no account of word order within an essay. There was surprisingly little difference between the human judges and the model.
Projections for Efficient Document Clustering
, 1997
"... Clustering is increasing in importance, but linear and even constanttime clustering algorithms are often too slow for realtime applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the ..."
Abstract

Cited by 122 (0 self)
 Add to MetaCart
Clustering is increasing in importance, but linear and even constanttime clustering algorithms are often too slow for realtime applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the cost of distance calculations, LSI and truncation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clusters. We find that the speed increase is significant while  surprisingly  the quality of clustering is not adversely affected. We conclude that truncation yields clusters as good as those produced by fullprofile clustering while offering a significant speed advantage.
Latent Semantic Indexing (LSI) and TREC2
 The Second Text REtrieval Conference (TREC2
, 1994
"... this paper. The "ltc" weights were computed on this matrix. 3.2 SVD analysis ..."
Abstract

Cited by 117 (3 self)
 Add to MetaCart
this paper. The "ltc" weights were computed on this matrix. 3.2 SVD analysis