Results 1  10
of
35
Latent semantic indexing: A probabilistic analysis
, 1998
"... Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underl ..."
Abstract

Cited by 249 (8 self)
 Add to MetaCart
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
Fast MonteCarlo algorithms for finding lowrank approximations
 IN PROCEEDINGS OF THE 39TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE
, 1998
"... We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the "best " such approximation. However, it takes time polynomial in m, n which is prohibitive fo ..."
Abstract

Cited by 179 (15 self)
 Add to MetaCart
We consider the problem of approximating a given m * n matrix A by another matrix of specified rank k, which is smaller than m and n. The Singular Value Decomposition (SVD) can be used to find the "best " such approximation. However, it takes time polynomial in m, n which is prohibitive for some modern applications. In this paper, we develop an algorithm which is qualitatively faster, provided we may sample the entries of the matrix according to a natural probability distribution. In many applications such sampling can be done efficiently. Our main result is a randomized algorithm to find the description of a matrix D * of rank at most k so that A D*2F < = min D,rank(D)<=k A D
From Words to Understanding
 COMPUTING WITH LARGE RANDOM PATTERNS
"... As was discussed in section 22, language is central to a correct understanding of the mind. Compositional analytic models perform well in the domain and subject area they are developed for, but any extension is difficult and the models have incomplete psychological veracity. Here we explore how to c ..."
Abstract

Cited by 48 (14 self)
 Add to MetaCart
As was discussed in section 22, language is central to a correct understanding of the mind. Compositional analytic models perform well in the domain and subject area they are developed for, but any extension is difficult and the models have incomplete psychological veracity. Here we explore how to compute representations of meaning based on a lower level of abstraction and how to use the models for tasks that require some form of language understanding.
Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval
, 1992
"... We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing ob ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing objects into a "semantic" structure more appropriate for information retrieval. This is done by modeling the implicit higherorder structure in the association of terms with objects. Initial tests find this completely automatic method to be a promising way to improve users' access to many kinds of textual materials or to objects for which textual descriptions are available. This paper describes some enhancements to the basic LSI method, including differential term weighting and relevance feedback. Appropriate term weighting improves performance by an average of 40%, and feedback based on 3 relevant documents improves performance by an average of 67%. September 1, 1992 D R A F T Dumais  2 1....
Automatic Bilingual Lexicon Acquisition Using Random Indexing
 Journal of Natural Language Engineering, Special Issue on Parallel Texts
, 2004
"... This paper presents a very simple and effective approach to automatic bilingual lexicon acquisition. The approach is cooccurrencebased, and uses the Random Indexing vector space methodology applied to aligned bilingual data. The approach is simple, efficient and scalable, and generate promising res ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
This paper presents a very simple and effective approach to automatic bilingual lexicon acquisition. The approach is cooccurrencebased, and uses the Random Indexing vector space methodology applied to aligned bilingual data. The approach is simple, efficient and scalable, and generate promising results when compared to a manually compiled lexicon. The paper also discusses some of the methodological problems with the prefered evaluation procedure.
Polynomial Time Approximation Schemes for Geometric kClustering
 J. OF THE ACM
, 2001
"... The JohnsonLindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linea ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
The JohnsonLindenstrauss lemma states that n points in a high dimensional Hilbert space can be embedded with small distortion of the distances into an O(log n) dimensional space by applying a random linear transformation. We show that similar (though weaker) properties hold for certain random linear transformations over the Hamming cube. We use these transformations to solve NPhard clustering problems in the cube as well as in geometric settings. More specifically, we address the following clustering problem. Given n points in a larger set (for example, R^d) endowed with a distance function (for example, L² distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NPhard in some high dimensional geometric settings, even for k = 2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube {0, 1}^d with Hamming distance, and R^d either with L¹ distance, or with L² distance, or with the square of L² distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the kmedian problem (and the related facility location problem), which has been considered in graphtheoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input.
Simfusion: measuring similarity using unified relationship matrix
 In SIGIR
, 2005
"... In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness p ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
In this paper we use a Unified Relationship Matrix (URM) to represent a set of heterogeneous data objects (e.g., web pages, queries) and their interrelationships (e.g., hyperlinks, user clickthrough sequences). We claim that iterative computations over the URM can help overcome the data sparseness problem and detect latent relationships among heterogeneous data objects, thus, can improve the quality of information applications that require combination of information from heterogeneous sources. To support our claim, we present a unified similaritycalculating algorithm, SimFusion. By iteratively computing over the URM, SimFusion can effectively integrate relationships from heterogeneous sources when measuring the similarity of two data objects. Experiments based on a web search engine query log and a web page collection demonstrate that SimFusion can improve similarity measurement of web objects over both traditional content based algorithms and the cutting edge SimRank algorithm.
VectorBased Semantic Analysis: Representing Word Meanings Based On Random Labels
 In ESSLI Workshop on Semantic Knowledge Acquistion and Categorization
, 2001
"... Vectorbased semantic analysis is the practice of using cooccurrence statistics to construct vectors that represent word meanings by virtue of their direction in multidimensional semantic space. This paper discusses the theoretical presumptions behind this practice, and a representational scheme b ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Vectorbased semantic analysis is the practice of using cooccurrence statistics to construct vectors that represent word meanings by virtue of their direction in multidimensional semantic space. This paper discusses the theoretical presumptions behind this practice, and a representational scheme based on the Distributional Hypothesis is identified as the rationale for vectorbased semantic analysis. A new method for calculating semantic word vectors is then described. The method uses random labeling of words in narrow context windows to calculate semantic context vectors for each word type in the text data. The method is evaluated with a standardized synonym test, and it is shown that incorporating linguistic information in the context vectors can enhance the results.
Iterative Searching In An Online Database
 In Proceedings of Human Factors Society 35th Annual Meeting
, 1991
"... An experiment examined how people use an online retrieval system. Subjects solved general topical search problems using a database containing the full text of news articles (e.g., find articles about the "Background of the new prime minister of Great Britain"). Time, accuracy and content of the sear ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
An experiment examined how people use an online retrieval system. Subjects solved general topical search problems using a database containing the full text of news articles (e.g., find articles about the "Background of the new prime minister of Great Britain"). Time, accuracy and content of the searches were recorded. Of particular interest was the use of two iterative search methods available in the interface  a Lookup function that allowed users to explicitly specify an alternative query; and a LikeThese function that could be used to automatically generate a new query using articles the user marked as relevant. Results showed that subjects could easily use both query reformulation methods. Subjects generated much more effective LikeThese searches than Lookup searches. An analysis of individual subject differences suggests that the LikeThese method is more accessible to a wide range of users. Figure 1. Example of InfoSearch interface. Response of the system to the search problem "...
Visual islands: intuitive browsing of visual search results
 In CIVR '08
, 2008
"... The amount of available digital multimedia has seen exponential growth in recent years. While advances have been made in the indexing and searching of images and videos, less focus has been given to aiding users in the interactive exploration of large datasets. In this paper a new framework, called ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The amount of available digital multimedia has seen exponential growth in recent years. While advances have been made in the indexing and searching of images and videos, less focus has been given to aiding users in the interactive exploration of large datasets. In this paper a new framework, called visual islands, is proposed that reorganizes image query results from an initial search or even a general photo collection using a fast, nonglobal feature projection to compute 2D display coordinates. A prototype system is implemented and evaluated with three core goals: fast browsing, intuitive display, and nonlinear exploration. Using the TRECVID2005[15] dataset, 10 users evaluated the goals over 24 topics. Experiments show that users experience improved comprehensibility and achieve a significant pagelevel precision improvement with the visual islands framework over traditional paged browsing.