A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
Document Language Models, Query Models, and Risk Minimization for Information Retrieval
 In Proceedings of SIGIR’01
, 2001
Latent semantic indexing: A probabilistic analysis
, 1998
Cited by 295 (7 self)
Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the termdocument matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We also propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.
A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems
 ACM Transactions on Information Systems
, 1994
Cited by 209 (33 self)
We present a probabilistic relational algebra (PRA) which is a generalization of standard relational algebra. Here tuples are assigned probabilistic weights giving the probability that a tuple belongs to a relation. Based on intensional semantics, the tuple weights of the result of a PRA expression always confirm to the underlying probabilistic model. We also show for which expressions extensional semantics yields the same results. Furthermore, we discuss complexity issues and indicate possibilities for optimization. With regard to databases, the approach allows for representing imprecise attribute values, whereas for information retrieval, probabilistic document indexing and probabilistic search term weighting can be modelled. As an important extension, we introduce the concept of vague predicates which yields a probabilistic weight instead of a Boolean value, thus allowing for queries with vague selection conditions. So PRA implements uncertainty and vagueness in combination with the...
COMBINING APPROACHES TO INFORMATION RETRIEVAL
Cited by 111 (3 self)
The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “metasearch” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.
A Formal Study of Information Retrieval Heuristics
 SIGIR '04
, 2004
Cited by 96 (18 self)
Empirical studies of information retrieval methods show that good retrieval performance is closely related to the use of various retrieval heuristics, such as TFIDF weighting. One basic research question is thus what exactly are these "necessary" heuristics that seem to cause good retrieval performance. In this paper, we present a formal study of retrieval heuristics. We formally define a set of basic desirable constraints that any reasonable retrieval function should satisfy, and check these constraints on a variety of representative retrieval functions. We find that none of these retrieval functions satisfies all the constraints unconditionally. Empirical results show that when a constraint is not satisfied, it often indicates nonoptimality of the method, and when a constraint is satisfied only for a certain range of parameter values, its performance tends to be poor when the parameter is out of the range. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies these constraints. Thus the proposed constraints provide a good explanation of many empirical observations and make it possible to evaluate any existing or new retrieval formula analytically.
Probabilistic Relevance Models Based on Document and Query Generation
 Language Modeling and Information Retrieval
, 2002
Cited by 86 (14 self)
We give a uni ed account of the probabilistic semantics underlying the language modeling approach and the traditional probabilistic model for information retrieval, showing that the two approaches can be viewed as being equivalent probabilistically, since they are based on dierent factorizations of the same generative relevance model. We also discuss how the two approaches lead to dierent retrieval frameworks in practice, since they involve component models that are estimated quite dierently.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
Cited by 64 (14 self)
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
A Risk Minimization Framework for Information Retrieval
 IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
Cited by 61 (1 self)
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model nontraditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
An exploration of proximity measures in information retrieval
 In Proceedings of the 30th ACM conference on research and development in information retrieval (ACMSIGIR 2007
, 2007
Cited by 58 (6 self)
In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as withindocument frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores of documents in which the matched query terms are close to each other. Such a proximity heuristic, however, has been largely underexplored in the literature; it is unclear how we can model proximity and incorporate a proximity measure into an existing retrieval model. In this paper, we systematically explore the query term proximity heuristic. Specifically, we propose and study the effectiveness of five different proximity measures, each modeling proximity from a different perspective. We then design two heuristic constraints and use them to guide us in incorporating the proposed proximity measures into an existing retrieval model. Experiments on five standard TREC test collections show that one of the proposed proximity measures is indeed highly correlated with document relevance, and by incorporating it into the KLdivergence language model and the Okapi BM25 model, we can significantly improve retrieval performance.