## A Fast Approximate Algorithm for Large-Scale Latent Semantic Indexing

### BibTeX

@MISC{Zhang_afast,

author = {Dell Zhang and Zheng Zhu},

title = {A Fast Approximate Algorithm for Large-Scale Latent Semantic Indexing},

year = {}

}

### OpenURL

### Abstract

Latent Semantic Indexing (LSI) is an effective method to discover the underlying semantic structure of data. It has numerous applications in information retrieval and data mining. However, the computational complexity of LSI may be prohibitively high when applied to very large datasets. In this paper, we present a fast approximate algorithm for large-scale LSI that is conceptually simple and theoretically justified. Our main contribution is to show that the proposed algorithm has provable error bound and linear computational complexity. 1

### Citations

8543 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...) ) for j = 1, . . . , n, can be done in one pass over A and requires only O(n) additional time and space. Picking the s largest columns from the n columns of A can be done using selection algorithms =-=[4]-=- which typically have O(n + s log s) computational complexity, but as we do not need those largest s columns A (j1), . . . , A (js) to be themselves ordered, the complexity can be further reduced to O... |

4688 |
Matrix Analysis
- Horn, Johnson
- 1999
(Show Context)
Citation Context ...ntain n distinctive terms can be represented as an m × n document-term matrix [17]. The technique of Latent Semantic Indexing (LSI) [6, 3, 20, 14] employs truncated Singular Value Decomposition (SVD) =-=[13, 11]-=- (see Section 2) to find the best low-rank description of A ∈ R m×n , i.e., the matrix D ∈ R m×n of rank k (k ≪ m, n) with minimum error ‖A − D‖F where ‖ · ‖F denotes the Frobenius norm. Since the int... |

2800 | Eigenfaces for recognition
- Turk, Pentland
- 1991
(Show Context)
Citation Context ...al and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition =-=[25, 19]-=-, and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A is very large. In this paper, we pres... |

2730 | Indexing by Latent Semantic Analysis
- Dumais, T, et al.
- 1990
(Show Context)
Citation Context ... the feature space. For example, a collection of m documents that contain n distinctive terms can be represented as an m × n document-term matrix [17]. The technique of Latent Semantic Indexing (LSI) =-=[6, 3, 20, 14]-=- employs truncated Singular Value Decomposition (SVD) [13, 11] (see Section 2) to find the best low-rank description of A ∈ R m×n , i.e., the matrix D ∈ R m×n of rank k (k ≪ m, n) with minimum error ‖... |

2717 | Authoritative Sources in a Hyperlinked Environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...cations of LSI in information retrieval and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search =-=[15, 26]-=-, face or object recognition [25, 19], and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A ... |

959 |
Visual learning and recognition of 3D objects from appearance
- Murase, Nayar
- 1995
(Show Context)
Citation Context ...al and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition =-=[25, 19]-=-, and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A is very large. In this paper, we pres... |

839 | Introduction to information retrieval
- Manning, Raghavan, et al.
- 2008
(Show Context)
Citation Context ...ach row describes an instances as a point or vector in the feature space. For example, a collection of m documents that contain n distinctive terms can be represented as an m × n document-term matrix =-=[17]-=-. The technique of Latent Semantic Indexing (LSI) [6, 3, 20, 14] employs truncated Singular Value Decomposition (SVD) [13, 11] (see Section 2) to find the best low-rank description of A ∈ R m×n , i.e.... |

615 | Matrix perturbation theory
- Stewart, Sun
- 1990
(Show Context)
Citation Context ...e k largest singular triplets of A is the optimal rank k approximation to A with respect to ‖ · ‖F . It can be shown that ‖A‖ 2 r∑ F = σ 2 t (A) . (12) t=1According to the matrix perturbation theory =-=[22]-=-, the size of the difference between two matrices can be used to bound the difference between the singular value spectrum of the two matrices. In particular, the Hoffman-Wielandt inequality states tha... |

533 | Using linear algebra for intelligent information retrieval
- Berry, Dumais, et al.
- 1995
(Show Context)
Citation Context ... the feature space. For example, a collection of m documents that contain n distinctive terms can be represented as an m × n document-term matrix [17]. The technique of Latent Semantic Indexing (LSI) =-=[6, 3, 20, 14]-=- employs truncated Singular Value Decomposition (SVD) [13, 11] (see Section 2) to find the best low-rank description of A ∈ R m×n , i.e., the matrix D ∈ R m×n of rank k (k ≪ m, n) with minimum error ‖... |

438 | Newsweeder: Learning to filter netnews
- LANG
- 1995
(Show Context)
Citation Context ...mputational complexity of our algorithm is only O(m + n). 5 Experiments We have implemented our algorithm in Matlab, and performed preliminary experiments on a real-world text dataset 20-newsgroups 1 =-=[16]-=-. There are totally 19928 documents and 62061 terms, therefore the document-term matrix A has m = 19928 rows and n = 62061 columns. Assuming the number of latent semantic concepts in this corpus to be... |

389 |
D.: Singular value decomposition for genome-wide expression data processing and modelling
- Alter, Brown, et al.
- 2000
(Show Context)
Citation Context ...etrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis =-=[2, 21, 24]-=-. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A is very large. In this paper, we present a fast approximate algorithm for large-sca... |

358 |
Matrix analysis and applied linear algebra
- Meyer
- 2000
(Show Context)
Citation Context ...thm often performs much better than the theoretical error bound suggests. 6 Related Work The exact solutions of truncated SVD are typically computed using iterative algorithms like the Lanczos method =-=[18]-=-, but the computational complexity of such algorithms is too high to be practical on very large datasets. Gorrell proposed an incremental algorithm for approximate truncated SVD which works in a neura... |

280 | Missing value estimation methods for DNA microarrays
- Troyanskaya, Cantor, et al.
- 2001
(Show Context)
Citation Context ...etrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis =-=[2, 21, 24]-=-. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A is very large. In this paper, we present a fast approximate algorithm for large-sca... |

249 | Latent semantics indexing: a probabilistic analysis
- Papadimitriou, Tamaki, et al.
- 1998
(Show Context)
Citation Context ... the feature space. For example, a collection of m documents that contain n distinctive terms can be represented as an m × n document-term matrix [17]. The technique of Latent Semantic Indexing (LSI) =-=[6, 3, 20, 14]-=- employs truncated Singular Value Decomposition (SVD) [13, 11] (see Section 2) to find the best low-rank description of A ∈ R m×n , i.e., the matrix D ∈ R m×n of rank k (k ≪ m, n) with minimum error ‖... |

141 | Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix
- Drineas, Kannan, et al.
(Show Context)
Citation Context ...on a small sketch 1 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multiclass.html #news20matrix consisting of randomly sampled columns from A according to a certain probability distribution =-=[7, 8, 9]-=-, but their used sketch matrix, unlike ours, contains many duplicate columns and consequently impairs the algorithm’s time and space efficiencies. Achlioptas and McSherry proposed an alternative entry... |

140 | Principal Components Analysis to Summarize Microarray Experiments: Application to Sporulation Time Series. Pacific Symposium on Biocomputing
- Raychaudhuri, Stuart, et al.
(Show Context)
Citation Context ...etrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis =-=[2, 21, 24]-=-. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A is very large. In this paper, we present a fast approximate algorithm for large-sca... |

122 | Fast computation of low rank matrix approximations
- Achlioptas, McSherry
- 2001
(Show Context)
Citation Context ...pairs the algorithm’s time and space efficiencies. Achlioptas and McSherry proposed an alternative entry-wise randomised algorithm for approximate truncated SVD based on the theory of random matrices =-=[1]-=-. 7 Conclusions We have presented a fast approximate algorithm for large-scale LSI that is conceptually simple and theoretically justified. Our main contribution is to show that the proposed algorithm... |

88 | H.: Latent semantic kernels
- Cristianini, Shawe-Taylor, et al.
- 2002
(Show Context)
Citation Context ...e numerous applications of LSI in information retrieval and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation =-=[5]-=-, Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may b... |

62 | Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication
- DRINEAS, KANNAN, et al.
- 2006
(Show Context)
Citation Context ...on a small sketch 1 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/multiclass.html #news20matrix consisting of randomly sampled columns from A according to a certain probability distribution =-=[7, 8, 9]-=-, but their used sketch matrix, unlike ours, contains many duplicate columns and consequently impairs the algorithm’s time and space efficiencies. Achlioptas and McSherry proposed an alternative entry... |

58 | M.L.: Automatic cross-linguistic information retrieval using latent semantic indexing
- Dumais, Landauer, et al.
- 1996
(Show Context)
Citation Context ...e tough problem of synonymy and polysemy [6]. There are numerous applications of LSI in information retrieval and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval =-=[10]-=-, distributed retrieval [23], text categorisation [5], Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of ... |

33 | On scaling latent semantic indexing for large peer-to-peer systems
- Tang, Dwarkadas, et al.
- 2004
(Show Context)
Citation Context ...and polysemy [6]. There are numerous applications of LSI in information retrieval and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval =-=[23]-=-, text categorisation [5], Web search [15, 26], face or object recognition [25, 19], and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and... |

29 | On the use of singular value decomposition for text retrieval
- Husbands, Simon, et al.
- 2001
(Show Context)
Citation Context |

18 |
Online Clustering of Web Search Results
- Semantic
- 2001
(Show Context)
Citation Context ...cations of LSI in information retrieval and data mining, including ad hoc text retrieval [6, 3, 20, 14], cross-language retrieval [10], distributed retrieval [23], text categorisation [5], Web search =-=[15, 26]-=-, face or object recognition [25, 19], and DNA microarray data analysis [2, 21, 24]. However, the computational complexity of LSI is superlinear (in m and n) [13, 11], which may be prohibitive when A ... |

16 | Generalized Hebbian algorithm for incremental Singular Value Decomposition in natural language processing
- Gorrell
- 2006
(Show Context)
Citation Context ... too high to be practical on very large datasets. Gorrell proposed an incremental algorithm for approximate truncated SVD which works in a neural-network-like fashion and requires much less resources =-=[12]-=-. Tang et al. proposed to reduce the cost of truncated SVD through document clustering and term selection [23]. However, those approximate algorithms do not come with a theoretical guarantee of error ... |