Results 1 - 10
of
12
Latent Semantic Space: Iterative Scaling Improves Precision of Inter-document Similarity Measurement
- In Proceedings of the SIGIR
, 2000
"... We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively "scaling" vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.
Iterative Residual Rescaling: An Analysis and Generalization of LSI
- IN PROC. OF THE 24 TH INTERNATIONAL ACM SIGIR
, 2001
"... We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing (LSI ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
We consider the problem of creating document representations in which inter-document similarity measurements correspond to semantic similarity. We first present a novel subspace-based framework for formalizing this task. Using this framework, we derive a new analysis of Latent Semantic Indexing (LSI), showing a precise relationship between its performance and the uniformity of the underlying distribution of documents over topics. This analysis helps explain the improvements gained by Ando’s (2000) Iterative Residual Rescaling (IRR) algorithm: IRR can compensate for distributional non-uniformity. A further benefit of our framework is that it provides a well-motivated, effective method for automatically determining the rescaling factor IRR depends on, leading to further improvements. A series of experiments over various settings and with several evaluation metrics validates our claims.
A Framework for Understanding Latent Semantic Indexing (LSI) Performance
- INFORMATION PROCESSING AND MANAGEMENT
, 2006
"... In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.
Order-Theoretical Ranking
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCES (JASIS
, 2000
"... Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretic ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretical and practical limitations. We present an approach to document ranking that explicitly addresses the word mismatch problem by exploiting interdocument similarity information in a novel way. Document ranking is seen as a querydocument transformation driven by a conceptual representation of the whole document collection, into which the query is merged. Our approach is based on the theory of concept (or Galois) lattices, which, we argue, provides a powerful, well-founded, and computationallytractable framework to model the space in which documents and query are represented and to compute such a transformation. We compared information retrieval using concept lattice-based ranking (CLR) to BMR and HCR. The results showed that HCR was outperformed by CLR as well as by BMR, and suggested that, of the two best methods, BMR achieved better performance than CLR on the whole document set while CLR compared more favorably when only the first retrieved documents were used for evaluation. We also evaluated the three methods' specific ability to rank documents that did not match the query, in which case the superiority of CLR over BMR and HCR (and that of HCR over BMR) was apparent.
A Probabilistic Model for Latent Semantic Indexing
- JASIST
, 2005
"... Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic space built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Dimension reduction methods, such as Latent Semantic Indexing (LSI), when applied to semantic space built upon text collections, improve information retrieval, information filtering and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their statistical significance, the likelihood. Semantic dimensions containing redundant and noisy information can be separated out and should be ignored because their contribution to the overall statistical significance is negative. LSI is the optimal solution of the model. The peak in likelihood curve indicates the existence of an intrinsic semantic dimension. The importance of LSI dimensions follows the Zipf-distribution, indicating that LSI dimensions represent the latent concepts. Document frequency of words follow the Zipf distribution, and the number of distinct words follows log-normal distribution. Experiments on five standard document collections confirm and illustrate the analysis.
A Framework for Understanding LSI Performance
- PROCEEDINGS OF ACM SIGIR WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN INFORMATION RETRIEVAL (ACMSIGIR MF/IR
, 2003
"... In this paper we present a theoretical model for understanding the performance of LSI search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework presented here is based on ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
In this paper we present a theoretical model for understanding the performance of LSI search and retrieval applications. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second order term co-occurrence and the values produced by the SVD algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.
Detecting Patterns in the LSI Term-Term Matrix
- In Proceedings ICDM’02 Workshop on Foundations of Data Mining and Discovery
, 2002
"... applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical and mathematical studies prove that term cooccurrence plays a crucial role in LSI.
A Mathematical View of Latent Semantic Indexing: Tracing Term Co-Occurrences
, 2002
"... Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI's use ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI's use of higher orders of co-occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co-occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI.
Identification of Critical Values in Latent Semantic Indexing
- Foundations of Data Mining and Knowledge Discovery
, 2005
"... This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in these matrices
Is it all About Connections? Factors Affecting the Performance of a Link-Based Recommender System
- Proceedings of the SIGIR 2001 Workshop on Recommender Systems, 2001
, 2001
"... Abstract This study reports on a recent evaluation of the similarity model used by Recommendation Explorer, an automatic recommender system. In particular, we consider the role of several system-internal factors in determining the quality of recommendation. More generally, we discuss factors in the ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract This study reports on a recent evaluation of the similarity model used by Recommendation Explorer, an automatic recommender system. In particular, we consider the role of several system-internal factors in determining the quality of recommendation. More generally, we discuss factors in the recommendation task itself that complicate the construction and evaluation of recommender systems, and reflect on the implications of our findings for research in this area. 1 Introduction Evaluating information retrieval systems is a notoriously steep challenge. The subjectivity of such crucial variables as the relevance and quality of information complicates the matter of evaluation. Automatic recommender systems are similar to IR systems in many ways [10, 11]. Among these similarities, the problem of evaluation is particularly vexing [14]. In this paper we focus on the evaluation problem by analyzing the factors that affect the performance of Recommendation Explorer, an automatic recommender system.

