## Matrices, vector spaces, and information retrieval (1999)

Venue: | SIAM Review |

Citations: | 120 - 2 self |

### BibTeX

@ARTICLE{Berry99matrices,vector,

author = {Michael W. Berry and Zlatko Drmač and Elizabeth and R. Jessup},

title = {Matrices, vector spaces, and information retrieval},

journal = {SIAM Review},

year = {1999},

volume = {41},

pages = {335--362}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a user’s query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections. Key words. information retrieval, linear algebra, QR factorization, singular value decomposition, vector spaces

### Citations

3423 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...tion retrieval using a consistent interface. The purpose of this paper is to show how linear algebra can be used in automated information retrieval. The most basic mechanism is the vector space model =-=[52, 18]-=- of IR in which each document is encoded as a vector, where each vector component reflects the importance of a particular term in representing the semantics or meaning of that document. The vectors fo... |

3025 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...atent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA) is a variant of the vector space model in which a low-rank approximation to the vector space representation of the database is employed =-=[9, 19]-=-. That is, we replace the original matrix by another matrix that is as close as possible to the original matrix but whose column space is only a subspace of the column space of the original matrix. Re... |

819 |
The Symmetric Eigenvalue Problem
- Parlett
- 1997
(Show Context)
Citation Context ...rage formats (e.g., Harwell-Boeing) have been developed for this purpose (see [3]). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi [41], Lanczos =-=[38, 47]-=-, subspace iteration [49, 47], and trace minimization [53]. All of these methods reference the sparse matrix A only through matrix-vector multiplication operations, and all can be implemented in terms... |

666 |
Numerical Methods for Least Squares Problems
- Björck
- 1996
(Show Context)
Citation Context ...Reducing the rank of the matrix is a means of removing extraneous information or noise from the database it represents. Rank reduction is used in various applications of linear algebra and statistics =-=[14, 28, 35]-=- as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections [21, ... |

657 | Improving retrieval performance by relevance feedback
- Salton, Buckley
- 1990
(Show Context)
Citation Context ...d in Section 1), a list of documents retrieved for a given query is almost never perfect, and the user has to ignore some of the items. In practice, precision can be improved using relevance feedback =-=[51]-=-, i.e., specifying which documents from a returned set are most relevant to the information sought and using those documents to clarify the intent of the original query. The term-term comparison proce... |

574 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1996
(Show Context)
Citation Context ...atent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA) is a variant of the vector space model in which a low-rank approximation to the vector space representation of the database is employed =-=[9, 19]-=-. That is, we replace the original matrix by another matrix that is as close as possible to the original matrix but whose column space is only a subspace of the column space of the original matrix. Re... |

529 |
der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Society for Industrial and Applied Mathematics
- Barrett, Berry, et al.
- 1994
(Show Context)
Citation Context ...se term-by-document matrices, it is important to store and use only the nonzero elements of the matrix. Special matrix storage formats (e.g., Harwell-Boeing) have been developed for this purpose (see =-=[3]-=-). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi [41], Lanczos [38, 47], subspace iteration [49, 47], and trace minimization [53]. All of these ... |

409 | A statistical interpretation of term specificity and its application in retrieval
- Jones
- 1972
(Show Context)
Citation Context ...ance of the term in representing the semantics of the document. Typically, the value is a function of the frequency with which the term occurs in the document or in the document collection as a whole =-=[20, 56]-=-. Suppose a document is described for indexing purposes by the three terms applied, linear, and algebra. It can then be represented by a vector in the three corresponding dimensions. Figure 1 depicts ... |

408 | An iteration method for the solution of the eigenvalue problem of linear differential and integral operators
- Lanczos
- 1950
(Show Context)
Citation Context ...rage formats (e.g., Harwell-Boeing) have been developed for this purpose (see [3]). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi [41], Lanczos =-=[38, 47]-=-, subspace iteration [49, 47], and trace minimization [53]. All of these methods reference the sparse matrix A only through matrix-vector multiplication operations, and all can be implemented in terms... |

269 |
The approximation of one matrix by another of lower rank, Psychometrica 1
- Eckart, Young
- 1936
(Show Context)
Citation Context ...lar values of A equal to zero. A fundamental difference between the two factorizations is in the theoretical underpinnings of that approximation. More precisely, a classic theorem by Eckart and Young =-=[23, 44]-=- states that the distance between A and its rank-k approximations is minimized by the approximation Ak. The theorem further shows how the norm ofsMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 15 ... |

268 |
Automatic Information Organization and Retrieval
- Salton
- 1968
(Show Context)
Citation Context ... the word baking counts as the term bake. The use of stemming in information retrieval dates back to the 1960s [43]. Stemming reduces storage requirements by decreasing the number of words maintained =-=[50]-=-. 2.3. Query matching. Using the small collection of titles from Figure 2, we can illustrate query matching based on angles in a 6-dimensional vector space. Suppose that a user in search of cooking in... |

247 |
Personalized information delivery: An analysis of information filtering methods
- Foltz, Dumais
- 1992
(Show Context)
Citation Context ...ating. Using SVD downdating, the LSI model can be modified to reflect the removal of terms and documents and any subsequent changes to term weights. Downdating can be useful for information filtering =-=[24]-=- (e.g., parental screening of Internet sites) and evaluating the importance of a term or document with respect to forming or breaking clusters of semantically related information. See [12] and [60] fo... |

245 |
Improving the retrieval of information from external sources
- Dumais
- 1991
(Show Context)
Citation Context ...ance of the term in representing the semantics of the document. Typically, the value is a function of the frequency with which the term occurs in the document or in the document collection as a whole =-=[20, 56]-=-. Suppose a document is described for indexing purposes by the three terms applied, linear, and algebra. It can then be represented by a vector in the three corresponding dimensions. Figure 1 depicts ... |

240 |
Development of a Stemming Algorithm
- Lovins
- 1968
(Show Context)
Citation Context ...y their word stems [37]. In our example, the word pastries counts as the term pastry, and the word baking counts as the term bake. The use of stemming in information retrieval dates back to the 1960s =-=[43]-=-. Stemming reduces storage requirements by decreasing the number of words maintained [50]. 2.3. Query matching. Using the small collection of titles from Figure 2, we can illustrate query matching bas... |

198 |
Overview of the third Text REtrieval Conference (TREC-3)," in Overview of the Third Text REtrieval Conference (TREC-3
- Harman
- 1994
(Show Context)
Citation Context ... the main purpose of this paper. Experiments have shown that there is a 20% disparity on average in the terms chosen as appropriate to describe a given document by two different professional indexers =-=[29]-=-. These problems of scale and consistency have fueled the development of automated information retrieval (IR) techniques. When implemented on high-performance computer systems, such methods can be app... |

173 | Searching the World WideWeb
- Lawrence, Giles
- 1998
(Show Context)
Citation Context ... of 7, 000 per working day [31]. While these numbers imply daunting indexing problems, the scale is even greater in the digital domain. There are currently about 300 million web pages on the Internet =-=[13, 39]-=-, and a typical search engine updates or acquires pointers to as many as ten million web pages in a single day [32]. Because the pages are indexed at ∗ Department of Computer Science, University of Te... |

130 |
Information Retrieval Systems - Theory and Implementation
- Kowalski
- 1997
(Show Context)
Citation Context ...action, zoning (indexing of parts of a document like its title, abstract or first few paragraphs as opposed to the entire document), and term or phrase weighting may also affect retrieval performance =-=[37]-=-. Indexing approaches (automated and otherwise) are generally judged in terms of their recall and precision ratings. Recall is the ratio of the number of relevant documents retrieved to the total numb... |

105 | Overview of the fifth text retrieval conference
- Voorhees, Harman
- 1996
(Show Context)
Citation Context ...992 with the initiation of the annual Text REtrieval Conference (TREC) sponsored by the Defense Advanced Research Projects Agency (DARPA) and the National Institute of Standards and Technology (NIST) =-=[30]-=-. TREC participants competitively index a large text collection (gigabytes in size) and are provided search statements and relevancy judgments in order to judge the success of their approaches. Anothe... |

100 |
Large Scale Singular Value Computations
- Berry
- 1992
(Show Context)
Citation Context ...tterns and topic domain associated with the document collection. The more new terms each document brings to the global dictionary, the sparser is the matrix overall. The sample IR matrices studied in =-=[5]-=- are typically no more than 1% dense, i.e., the ratio of nonzeros to the product of the row and column dimensions is barely 0.01. Experience has shown that these matrices typically lack any regular no... |

88 |
Pictures of relevance: a geometric analysis of similarity measures
- Jones, Furnas
- 1987
(Show Context)
Citation Context ...t change the cosine value. Thus, we ,sMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 6 may scale the document vectors or queries by any convenient value. Other similarity measures are reviewed in =-=[34]-=-. 2.2. An example. Figure 2 demonstrates how a simple collection of five titles described by six terms leads to a 6×5 term-by-document matrix. Because the content of a document is determined by the re... |

82 | A semidiscrete matrix decomposition for latent semantic indexing in information retrieval
- Kolda, O'Leary
- 1998
(Show Context)
Citation Context ...uires 0.4 Mbytes to store the original matrix, whereas the storage needed for the corresponding single precision matrices Uk,Σk,Vk is 2.6 Mbytes when k = 100. The Semi-Discrete Decomposition (or SDD) =-=[36]-=- provides one means of reducing the storage requirements of LSI. In SDD, only the three values −1, 0, 1 (represented by two bits each) are used to define the elements of Uk and Vk, and an integer prog... |

76 | Deflation techniques for an implicitly restarted Arnoldi iteration, SJMAA 17
- Lehoucq, Sorensen
- 1996
(Show Context)
Citation Context ...ial matrix storage formats (e.g., Harwell-Boeing) have been developed for this purpose (see [3]). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi =-=[41]-=-, Lanczos [38, 47], subspace iteration [49, 47], and trace minimization [53]. All of these methods reference the sparse matrix A only through matrix-vector multiplication operations, and all can be im... |

70 | Svdpackc (version 1.0) user’s guide
- Berry, Do, et al.
- 1993
(Show Context)
Citation Context ... discussed in [40, 41]sMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 26 and implementations of Lanczos, subspace iteration, and trace minimization (SVDPACK (Fortran 77) [6] and SVDPACKC (ANSI C) =-=[8]-=-) as discussed in [5]. Simple descriptions of Lanczos-based methods with Matlab examples are available in [3], and a good survey of public-domain software for Lanczos-type methods is available in [7].... |

68 |
ARPACK: An implementation of an implicitly restarted Arnoldi method for computing some of the eigenvalues and eigenvectors of a large sparse matrix
- Lehoucq, Sorensen, et al.
- 1996
(Show Context)
Citation Context ...mented in terms of the sparse storage formats. Implementations of the aforementioned methods are available at www.netlib.org. These include software for Arnoldi-based methods (ARPACK) as discussed in =-=[40, 41]-=-sMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 26 and implementations of Lanczos, subspace iteration, and trace minimization (SVDPACK (Fortran 77) [6] and SVDPACKC (ANSI C) [8]) as discussed in [... |

59 |
Symmetric gauge functions and unitarily invariant norms
- Mirsky
(Show Context)
Citation Context ...lar values of A equal to zero. A fundamental difference between the two factorizations is in the theoretical underpinnings of that approximation. More precisely, a classic theorem by Eckart and Young =-=[23, 44]-=- states that the distance between A and its rank-k approximations is minimized by the approximation Ak. The theorem further shows how the norm ofsMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 15 ... |

57 | Large-scale information retrieval with latent semantic indexing
- Letsche, Berry
- 1997
(Show Context)
Citation Context ...ce of LSI for any given database remains an open question and is normally decided via empirical testing [9]. For very large databases, the number of dimensions used usually ranges between 100 and 300 =-=[42]-=-, a choice made for computational feasibility as opposed to accuracy. UsingsMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 16 The original term-by-document matrix: A = ⎛ 0.5774 0 0 0.4082 0 ⎞ ⎜ 0.... |

53 |
On Updating Problems in Latent Semantic Indexing
- Zha, Simon
- 1999
(Show Context)
Citation Context ... the SVD of the new term-by-document matrix, but, for large databases, this procedure is very costly in time and space. Less expensive alternatives, folding-in and SVD-updating, have been examined in =-=[9, 46, 55]-=-. The first of these procedures is very inexpensive computationally but results in an inexact representation of the database. It is generally appropriate to fold documents in only occasionally. Updati... |

51 | Low-rank orthogonal decompositions for information retrieval applications," Numerical Linear Algebra with Applications
- Berry, Fierro
- 1996
(Show Context)
Citation Context ...,k). The sparse SVD function SVDS is based on the Arnoldi methods described in [40]. Note that, for practical purposes, less expensive factorizations such as QR or ULV may suffice in place of the SVD =-=[10]-=-. Presently, no effort is made to preserve sparsity in the SVD of the sparse termby-document matrices. Since the singular vector matrices are often dense, the storage requirements for Uk, Σk and Vk ca... |

45 |
A trace, minimization algorithm for the generalized eigenvalue problem
- Sameh, Wisniewski
- 1982
(Show Context)
Citation Context ...is purpose (see [3]). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi [41], Lanczos [38, 47], subspace iteration [49, 47], and trace minimization =-=[53]-=-. All of these methods reference the sparse matrix A only through matrix-vector multiplication operations, and all can be implemented in terms of the sparse storage formats. Implementations of the afo... |

43 | Rank degeneracy and least squares problems
- GOLUB, KLEMA, et al.
- 1976
(Show Context)
Citation Context ...Reducing the rank of the matrix is a means of removing extraneous information or noise from the database it represents. Rank reduction is used in various applications of linear algebra and statistics =-=[14, 28, 35]-=- as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections [21, ... |

28 | Information management tools for updating an SVD-encoded indexing scheme
- O'Brien
- 1994
(Show Context)
Citation Context ... the SVD of the new term-by-document matrix, but, for large databases, this procedure is very costly in time and space. Less expensive alternatives, folding-in and SVD-updating, have been examined in =-=[9, 46, 55]-=-. The first of these procedures is very inexpensive computationally but results in an inexact representation of the database. It is generally appropriate to fold documents in only occasionally. Updati... |

25 |
Sparse Matrix Reordering Schemes for Browsing Hypertext,” in The Mathematics of Numerical Analysis
- Berry, Hendrickson, et al.
- 1996
(Show Context)
Citation Context ...ome recent efforts in the use of both spectral (based on the eigendecomposition or SVD) and non-spectral (usually graph theoretic) approaches to generate banded or envelope matrix forms are promising =-=[11]-=-. In order to compute the SVD of sparse term-by-document matrices, it is important to store and use only the nonzero elements of the matrix. Special matrix storage formats (e.g., Harwell-Boeing) have ... |

23 |
Automatic query expansion using SMART
- Buckley, Salton, et al.
- 1995
(Show Context)
Citation Context ...tion retrieval using a consistent interface. The purpose of this paper is to show how linear algebra can be used in automated information retrieval. The most basic mechanism is the vector space model =-=[52, 18]-=- of IR in which each document is encoded as a vector, where each vector component reflects the importance of a particular term in representing the semantics or meaning of that document. The vectors fo... |

21 |
Simultaneous iteration method for symmetric matrices, Numerische Mathematik 16
- Rutishauser
- 1970
(Show Context)
Citation Context ...oeing) have been developed for this purpose (see [3]). Special techniques for computing the SVD of a sparse matrix include iterative methods such as Arnoldi [41], Lanczos [38, 47], subspace iteration =-=[49, 47]-=-, and trace minimization [53]. All of these methods reference the sparse matrix A only through matrix-vector multiplication operations, and all can be implemented in terms of the sparse storage format... |

21 |
Rank degeneracy
- Stewart
- 1984
(Show Context)
Citation Context ...at our matrix A is only one representative of a whole family of relatively close matrices representing the database, it is reasonable to ask if it makes sense to attempt to determine its rank exactly =-=[57]-=-. For instance, if we find the rank rA and, using linear algebra, conclude that changing A by adding a small change E would result in a matrix A + E of lesser rank k, then we may as well argue that ou... |

14 |
Regularization of nonlinear inverse problems: Imaging the near-surface weathering layer: Inverse Problems
- Scales, Docherty, et al.
- 1990
(Show Context)
Citation Context ...ents. Rank reduction is used in various applications of linear algebra and statistics [14, 28, 35] as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography =-=[17, 54]-=-. LSI has achieved average or above average performance for several TREC collections [21, 22]. In this paper, we do not review LSI but rather show how to apply the vector space model directly to a low... |

12 |
Conserving confluence curbs ill-condition
- Kahan
- 1972
(Show Context)
Citation Context ...Reducing the rank of the matrix is a means of removing extraneous information or noise from the database it represents. Rank reduction is used in various applications of linear algebra and statistics =-=[14, 28, 35]-=- as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections [21, ... |

11 |
Applications of seismic traveltime tomography
- Bording, Gersztenkom, et al.
- 1987
(Show Context)
Citation Context ...ents. Rank reduction is used in various applications of linear algebra and statistics [14, 28, 35] as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography =-=[17, 54]-=-. LSI has achieved average or above average performance for several TREC collections [21, 22]. In this paper, we do not review LSI but rather show how to apply the vector space model directly to a low... |

11 |
Singular value analysis of cryptograms
- Moler, Morrison
- 1983
(Show Context)
Citation Context ...e from the database it represents. Rank reduction is used in various applications of linear algebra and statistics [14, 28, 35] as well as in image processing [2], data compression [48], cryptography =-=[45]-=-, and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections [21, 22]. In this paper, we do not review LSI but rather show how to apply the vec... |

10 | Estimating the relative size and overlap of public web search engines
- Bharat, Broder
- 1998
(Show Context)
Citation Context ... of 7, 000 per working day [31]. While these numbers imply daunting indexing problems, the scale is even greater in the digital domain. There are currently about 300 million web pages on the Internet =-=[13, 39]-=-, and a typical search engine updates or acquires pointers to as many as ten million web pages in a single day [32]. Because the pages are indexed at ∗ Department of Computer Science, University of Te... |

9 |
The Bowker Annual Library and Book Trade Almanac. 51st ed
- Bogart, ed
- 2006
(Show Context)
Citation Context ...d in print worldwide with roughly 12,000 additions each year [58]. There are nearly 1.4 million books in print in the United States alone with approximately 60,000 new titles appearing there annually =-=[15, 16]-=-. The Library of Congress maintains a collection of more than 17 million books and receives new items at a rate of 7, 000 per working day [31]. While these numbers imply daunting indexing problems, th... |

6 |
Outer product expansions and their uses in digital image processing
- Andrews, Patterson
- 1975
(Show Context)
Citation Context ...f removing extraneous information or noise from the database it represents. Rank reduction is used in various applications of linear algebra and statistics [14, 28, 35] as well as in image processing =-=[2]-=-, data compression [48], cryptography [45], and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections [21, 22]. In this paper, we do not revie... |

5 |
A Guide To The Oxford English Dictionary
- Berg
- 1993
(Show Context)
Citation Context ...ituation is reversed. A term-by-document matrix using the content of the largest English language dictionary as terms and the set of all web pages as documents would be about 300, 000 × 300, 000, 000 =-=[4, 13, 39]-=-. As a document generally uses only a small subset of the entire dictionary of terms generated for a given database, most of the elements of a term-by-document matrix are zero. In a vector space IR sc... |

5 | Downdating the Latent Semantic Indexing Model for Information Retrieval
- Witter
- 1997
(Show Context)
Citation Context ...ng [24] (e.g., parental screening of Internet sites) and evaluating the importance of a term or document with respect to forming or breaking clusters of semantically related information. See [12] and =-=[60]-=- for more details on the effects of downdating and how it can be implemented. 8.3. Sparsity . The sparsity of a term-by-document matrix is a function of the word usage patterns and topic domain associ... |

4 | Approximating dominant singular triplets of large sparse matrices via modified moments
- Varadhan, Berry, et al.
- 1996
(Show Context)
Citation Context ...s mentioned thus far are serial in nature, an interesting asynchronous technique for computing several of the largest singular triplets of a sparse matrix on a network of workstations is described in =-=[59]-=-. For relatively small order term-by-document matrices, it may be most convenient to ignore sparsity altogether and consider the matrix A as dense. One Fortran library including the SVD of dense matri... |

3 |
Intelligent information management using latent semantic indexing
- Berry, Witter
- 1997
(Show Context)
Citation Context ...n filtering [24] (e.g., parental screening of Internet sites) and evaluating the importance of a term or document with respect to forming or breaking clusters of semantically related information. See =-=[12]-=- and [60] for more details on the effects of downdating and how it can be implemented. 8.3. Sparsity . The sparsity of a term-by-document matrix is a function of the word usage patterns and topic doma... |

3 |
meets TREC: A status report., in The First Text REtrieval Conference
- LSI
- 1993
(Show Context)
Citation Context ..., 35] as well as in image processing [2], data compression [48], cryptography [45], and seismic tomography [17, 54]. LSI has achieved average or above average performance for several TREC collections =-=[21, 22]-=-. In this paper, we do not review LSI but rather show how to apply the vector space model directly to a low-rank approximation of the database matrix. The operations performed in this version of the v... |

1 |
A Fortran 77 software library for the sparse singular value decomposition
- SVDPACK
- 1992
(Show Context)
Citation Context ...-based methods (ARPACK) as discussed in [40, 41]sMATRICES, VECTOR SPACES, AND INFORMATION RETRIEVAL 26 and implementations of Lanczos, subspace iteration, and trace minimization (SVDPACK (Fortran 77) =-=[6]-=- and SVDPACKC (ANSI C) [8]) as discussed in [5]. Simple descriptions of Lanczos-based methods with Matlab examples are available in [3], and a good survey of public-domain software for Lanczos-type me... |

1 |
of public-domain Lanczos-based software
- Survey
- 1997
(Show Context)
Citation Context ... [8]) as discussed in [5]. Simple descriptions of Lanczos-based methods with Matlab examples are available in [3], and a good survey of public-domain software for Lanczos-type methods is available in =-=[7]-=-. Whereas most of the iterative methods mentioned thus far are serial in nature, an interesting asynchronous technique for computing several of the largest singular triplets of a sparse matrix on a ne... |