Results 11  20
of
229
Statistical Language Models for Information Retrieval. Tutorial Presentation at the
 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR
, 2006
"... Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for model ..."
Abstract

Cited by 68 (6 self)
 Add to MetaCart
(Show Context)
Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges. 1
Applying Discrete PCA in Data Analysis
, 2004
"... Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic analysis, and genotype inference with admixture. In this paper we explore a number of extensions to the common theory, and ..."
Abstract

Cited by 58 (11 self)
 Add to MetaCart
(Show Context)
Methods for analysis of principal components in discrete data have existed for some time under various names such as grade of membership modelling, probabilistic latent semantic analysis, and genotype inference with admixture. In this paper we explore a number of extensions to the common theory, and present some application of these methods to some common statistical tasks. We show that these methods can be interpreted as a discrete version of ICA. We develop a hierarchical version yielding components at different levels of detail, and additional techniques for Gibbs sampling. We compare the algorithms on a text prediction task using support vector machines, and to information retrieval.
discriminant model for information retrieval
 In the Proceedings of SIGIR’2005
, 2005
"... This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic featu ..."
Abstract

Cited by 56 (16 self)
 Add to MetaCart
(Show Context)
This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic features that are derived from the component models of HMM that is widely used in language modeling approaches to IR. Therefore, LDM is a means of melding discriminative and generative models for IR. We present two algorithms of parameter learning for LDM. One is to optimize the average precision (AP) directly using an iterative procedure. The other is a perceptronbased algorithm that minimizes the number of discordant documentpairs in a rank list. The effectiveness of our approach has been evaluated on the task of ad hoc retrieval using six English and Chinese TREC test sets. Results show that (1) in most test sets, LDM significantly outperforms the stateoftheart language modeling approaches and the classical probabilistic retrieval model; (2) it is more appropriate to train LDM using a measure of AP rather than likelihood if the IR system is graded on AP; and (3) linguistic features (e.g. phrases and dependences) are effective for IR if they are incorporated properly.
Corpus Structure, Language Models, and Ad Hoc Information Retrieval
"... Most previous work on the recently developed languagemodeling approach to information retrieval focuses on document speci c characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by ..."
Abstract

Cited by 56 (15 self)
 Add to MetaCart
Most previous work on the recently developed languagemodeling approach to information retrieval focuses on document speci c characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by documentbased language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard languagemodeling approach in precision and recall, and our new interpolation algorithm posts statistically signi cant improvements for both metrics over all three corpora tested.
Contextual Search and Name Disambiguation in Email using Graphs
 SIGIR
, 2006
"... Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other nontextual objects: for instance, email messages are connected to other ..."
Abstract

Cited by 52 (10 self)
 Add to MetaCart
Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other nontextual objects: for instance, email messages are connected to other messages via header information. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a structural graph. The suggested framework is evaluated for two emailrelated problems: disambiguating names in email documents, and threading. We show that reranking schemes based on the graphwalk similarity measures often outperform baseline methods, and that further improvements can be obtained by use of appropriate learning methods.
A Risk Minimization Framework for Information Retrieval
 IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model nontraditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
An exploration of proximity measures in information retrieval
 In Proceedings of the 30th ACM conference on research and development in information retrieval (ACMSIGIR 2007
, 2007
"... In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as withindocument frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
In most existing retrieval models, documents are scored primarily based on various kinds of term statistics such as withindocument frequencies, inverse document frequencies, and document lengths. Intuitively, the proximity of matched query terms in a document can also be exploited to promote scores of documents in which the matched query terms are close to each other. Such a proximity heuristic, however, has been largely underexplored in the literature; it is unclear how we can model proximity and incorporate a proximity measure into an existing retrieval model. In this paper, we systematically explore the query term proximity heuristic. Specifically, we propose and study the effectiveness of five different proximity measures, each modeling proximity from a different perspective. We then design two heuristic constraints and use them to guide us in incorporating the proposed proximity measures into an existing retrieval model. Experiments on five standard TREC test collections show that one of the proposed proximity measures is indeed highly correlated with document relevance, and by incorporating it into the KLdivergence language model and the Okapi BM25 model, we can significantly improve retrieval performance.
Using Temporal Profiles of Queries for Precision Prediction
, 2004
"... A key missing component in information retrieval systems is selfdiagnostic tests to establish whether the system can provide reasonable results for a given query on a document collection. If we can measure properties of a retrieved set of documents which allow us to predict average precision, we ca ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
A key missing component in information retrieval systems is selfdiagnostic tests to establish whether the system can provide reasonable results for a given query on a document collection. If we can measure properties of a retrieved set of documents which allow us to predict average precision, we can automate the decision of whether to elicit relevance feedback, or modify the retrieval system in other ways. We use metadata attached to documents in the form of time stamps to measure the distribution of documents retrieved in response to a query, over the time domain, to create a temporal profile for a query. We define some useful features over this temporal profile. We find that using these temporal features, together with the content of the documents retrieved, we can improve the prediction of average precision for a query.