Results 1 - 10
of
27
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract
-
Cited by 210 (8 self)
- Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Fast Monte-Carlo algorithms for finding low-rank approximations
- IN PROCEEDINGS OF THE 39TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE
, 1998
"... We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the "best " such approximation. However, it takes time polynomial in m, n which is prohibitive fo ..."
Abstract
-
Cited by 137 (14 self)
- Add to MetaCart
We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the "best " such approximation. However, it takes time polynomial in m, n which is prohibitive for some modern applications. In this paper, we develop an algorithm which is qualitatively faster, provided we may sample the entries of the matrix according to a natural probability distribution. In many applications such sampling can be done efficiently. Our main result is a randomized algorithm to find the description of a matrix D * of rank at most k so that ||A- D*||2F < = min D,rank(D)<=k ||A- D||
From Words to Understanding
- COMPUTING WITH LARGE RANDOM PATTERNS
"... As was discussed in section 22, language is central to a correct understanding of the mind. Compositional analytic models perform well in the domain and subject area they are developed for, but any extension is difficult and the models have incomplete psychological veracity. Here we explore how to c ..."
Abstract
-
Cited by 38 (13 self)
- Add to MetaCart
As was discussed in section 22, language is central to a correct understanding of the mind. Compositional analytic models perform well in the domain and subject area they are developed for, but any extension is difficult and the models have incomplete psychological veracity. Here we explore how to compute representations of meaning based on a lower level of abstraction and how to use the models for tasks that require some form of language understanding.
Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval
, 1992
"... We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing ob ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing objects into a "semantic" structure more appropriate for information retrieval. This is done by modeling the implicit higher-order structure in the association of terms with objects. Initial tests find this completely automatic method to be a promising way to improve users' access to many kinds of textual materials or to objects for which textual descriptions are available. This paper describes some enhancements to the basic LSI method, including differential term weighting and relevance feedback. Appropriate term weighting improves performance by an average of 40%, and feedback based on 3 relevant documents improves performance by an average of 67%. September 1, 1992 D R A F T Dumais - 2 1....
Polynomial Time Approximation Schemes for Geometric k-Clustering
- J. OF THE ACM
, 2001
"... The Johnson-Lindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linea ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
The Johnson-Lindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linear transformations over the Hamming cube. We use these transformations to solve NP-hard clustering problems in the cube as well as in geometric settings. More specifically, we address the following clustering problem. Given n points in a larger set (for example, R^d) endowed with a distance function (for example, L² distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NP-hard in some high dimensional geometric settings, even for k = 2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube {0, 1}^d with Hamming distance, and R^d either with L¹ distance, or with L² distance, or with the square of L² distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the k-median problem (and the related facility location problem), which has been considered in graph-theoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input.
Automatic Bilingual Lexicon Acquisition Using Random Indexing
- Journal of Natural Language Engineering, Special Issue on Parallel Texts
, 2004
"... This paper presents a very simple and effective approach to automatic bilingual lexicon acquisition. The approach is cooccurrence-based, and uses the Random Indexing vector space methodology applied to aligned bilingual data. The approach is simple, efficient and scalable, and generate promising res ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
This paper presents a very simple and effective approach to automatic bilingual lexicon acquisition. The approach is cooccurrence-based, and uses the Random Indexing vector space methodology applied to aligned bilingual data. The approach is simple, efficient and scalable, and generate promising results when compared to a manually compiled lexicon. The paper also discusses some of the methodological problems with the prefered evaluation procedure.
Simfusion: measuring similarity using unified relationship matrix
- In SIGIR
, 2005
"... In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness p ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require com-bination of information from heterogeneous sources. To support our claim, we present a unified similarity-calculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
Iterative Searching In An Online Database
- In Proceedings of Human Factors Society 35th Annual Meeting
, 1991
"... An experiment examined how people use an online retrieval system. Subjects solved general topical search problems using a database containing the full text of news articles (e.g., find articles about the "Background of the new prime minister of Great Britain"). Time, accuracy and content of the sear ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
An experiment examined how people use an online retrieval system. Subjects solved general topical search problems using a database containing the full text of news articles (e.g., find articles about the "Background of the new prime minister of Great Britain"). Time, accuracy and content of the searches were recorded. Of particular interest was the use of two iterative search methods available in the interface - a Lookup function that allowed users to explicitly specify an alternative query; and a LikeThese function that could be used to automatically generate a new query using articles the user marked as relevant. Results showed that subjects could easily use both query reformulation methods. Subjects generated much more effective LikeThese searches than Lookup searches. An analysis of individual subject differences suggests that the LikeThese method is more accessible to a wide range of users. Figure 1. Example of InfoSearch interface. Response of the system to the search problem "...
Vector-Based Semantic Analysis: Representing Word Meanings Based On Random Labels
- In ESSLI Workshop on Semantic Knowledge Acquistion and Categorization
, 2001
"... Vector-based semantic analysis is the practice of using co-occurrence statistics to construct vectors that represent word meanings by virtue of their direction in multi-dimensional semantic space. This paper discusses the theoretical presumptions behind this practice, and a representational scheme b ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Vector-based semantic analysis is the practice of using co-occurrence statistics to construct vectors that represent word meanings by virtue of their direction in multi-dimensional semantic space. This paper discusses the theoretical presumptions behind this practice, and a representational scheme based on the Distributional Hypothesis is identified as the rationale for vector-based semantic analysis. A new method for calculating semantic word vectors is then described. The method uses random labeling of words in narrow context windows to calculate semantic context vectors for each word type in the text data. The method is evaluated with a standardized synonym test, and it is shown that incorporating linguistic information in the context vectors can enhance the results.
Variable latent semantic indexing
- In Proc. of the 11th KDD
, 2005
"... Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or “variable”) low-rank approx ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or “variable”) low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions. Categories and Subject Descriptors G.1.3 [Numerical Analysis]: Numerical Linear Algebra— Singular value decomposition; G.1.3 [Numerical Analysis]:

