## Statistical Language Models for Information Retrieval. Tutorial Presentation at the (2006)

Venue: | 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR |

Citations: | 59 - 5 self |

### BibTeX

@INPROCEEDINGS{Zhai06statisticallanguage,

author = {Chengxiang Zhai},

title = {Statistical Language Models for Information Retrieval. Tutorial Presentation at the},

booktitle = {29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR},

year = {2006}

}

### OpenURL

### Abstract

Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges. 1

### Citations

703 | A Study of Smoothing Methods for Language Models Applied to information retrieval
- Lafferty, Zhai
- 2004
(Show Context)
Citation Context ...ge modeling approaches to retrieval have led to effective retrieval functions without much heuristic design. In particular, the query likelihood retrieval function [80] with Dirichlet prior smoothing =-=[124]-=- has comparable performance to the most effective TF-IDF weighting retrieval functions including BM25 [24]. Due to their good empirical performance and great potential of leveraging statistical estima... |

467 | W.B.: Query expansion using local and global document analysis
- Xu, Croft
- 1996
(Show Context)
Citation Context ... feedback with language models. The observation that local co-occurrence analysis is more effective than global co-occurrence analysis is also consistent with a study of a traditional retrieval model =-=[116]-=-. Intuitively, this is because the local documents (i.e., documents close to the query) can prevent noisy words from being picked from feedback due to distracting co-occurrences. In Collins-Thompson a... |

230 | Evaluation of an inference network-based retrieval model - Turtle, Croft - 1991 |

202 | Language modeling for information retrieval
- Croft, Lafferty
- 2003
(Show Context)
Citation Context ...hrough smoothing, we implicitly penalize words that are common in the collection (with high p(qi|C)). This also explains why we can model the noise in the query through more aggressive smoothing. See =-=[125]-=- for more discussion about this. The equation above also shows that computing the query likelihood scoring function using any smoothing method based on a collection language model is as efficient as c... |

174 | Model-based feedback in the language modeling approach to information retrieval - Zhai, Lafferty |

164 | Cluster-based language models for distributed retrieval
- Xu, Croft
- 1999
(Show Context)
Citation Context ...o the document language model) and models the retrieval problem as a statistical decision problem [50, 121, 127]. However, KL-divergence had previously been used for distributed information retrieval =-=[117]-=-. By truncating the query model θQ to keep only high probability words and renormalizing it, we can score a KL-divergence model efficiently. Indeed, we may rewrite the scoring function in the same way... |

146 | Beyond independent relevance: methods and evaluation metrics for subtopic retrieval
- Zhai, Cohen, et al.
- 2003
(Show Context)
Citation Context ...cy has been observed [21]. 6.8 Subtopic Retrieval The subtopic retrieval task represents an interesting retrieval task because it requires modeling the dependency of relevance of individual documents =-=[122]-=-. Given a topic query, the task of subtopic retrieval is to retrieve documents that can cover as many subtopics of the topic as possible. If we are to solve the problem with a traditional retrieval mo... |

144 |
LDA-based document models for ad-hoc retrieval,” (2006
- Wei, Croft
(Show Context)
Citation Context ...y given the clusters containing the document. A soft clustering strategy has been adopted to smooth document language models through using the Latent Dirichlet Allocation (LDA) model to do clustering =-=[113]-=-. With this model, we allow a document to be in multiple topics (roughly like document clusters, but characterized by unigram language models) with some uncertainties. Thus smoothing of a document can... |

110 | Novelty and redundancy detection in Adaptive Filtering
- Zhang, Callan, et al.
- 2002
(Show Context)
Citation Context ...vely captures the redundancy. λ∗ can be computed using the EM algorithm. The novelty can be defined as 1 − λ∗ . A similar but slightly more sophisticated three-component mixture model was proposed in =-=[131]-=- in order to capture novelty in information filtering. Note that the redundancy/novelty captured in this way is asymmetric in the sense that if we switch the roles of D1 and D2, the redundancy value w... |

100 |
On modeling information retrieval with probabilistic inference
- Wong, Yao
- 1995
(Show Context)
Citation Context ...139 The field has progressed in two different ways. On the one hand, theoretical models have been proposed often to model relevance through inferences; representative models include the logic models =-=[27, 111, 115]-=- and the inference network model [109]. However, these models, while theoretically interesting, have not been able to directly lead to empirically effective models, even though heuristic instantiation... |

58 | A crosscollection mixture model for comparative text mining
- Zhai, Velivelli, et al.
- 2004
(Show Context)
Citation Context ...criminative. PLSA has also been extended in several studies mostly to accommodate a topic hierarchy [36], incorporate context variables such as time and location [66], and analyze sentiments [65]. In =-=[130]-=-, a background topic is introduced to PLSA to make the extracted topic models more focusing on the content words rather than the common words in the collection. 6.10 Summary In this section, we review... |

49 |
C.: Evaluating a probabilistic model for crosslingual information retrieval
- Xu, Weischedel, et al.
- 2001
(Show Context)
Citation Context ...kes an important contribution in extending the basic query likelihood retrieval model. Such a model has later been used successfully in applying language models to cross-lingual information retrieval =-=[118]-=-. The cluster-based query likelihood method proposed in [48] can also be regarded as a form of a translation model where the whole document is “translated” into a query as a single unit through a set ... |

47 | A risk minimization framework for information retrieval
- ZHAI, LAFFERTY
- 2006
(Show Context)
Citation Context ...inimization retrieval framework, which introduces the concept of query language model (in additional to the document language model) and models the retrieval problem as a statistical decision problem =-=[50, 121, 127]-=-. However, KL-divergence had previously been used for distributed information retrieval [117]. By truncating the query model θQ to keep only high probability words and renormalizing it, we can score a... |

39 |
Regularized estimation of mixture models for robust pseudo-relevance feedback
- TAO, C
(Show Context)
Citation Context ... methods (especially divergence minimization) are also shown to be sensitive to parameter settings. There has been some follow-up work on improving the robustness of the mixture model feedback method =-=[107, 108]-=-. In Tao and Zhai [107],172 Query Models and Feedback in Language Models the mixture model was extended to better integrate the original query model with the feedback documents and to allow each feed... |

36 | Risk Minimization and Language Modeling in Text Retrieval
- Zhai
- 2002
(Show Context)
Citation Context ...oring is effective as a retrieval method, Zhai and Lafferty studied the robustness of query likelihood scoring and examined how retrieval performance is affected by different strategies for smoothing =-=[121, 124, 126]-=-. Through comparing several different smoothing methods, they have observed: (1) retrieval performance is sensitive to the setting of smoothing parameters and the choice of smoothing methods; (2) the ... |

34 | Fast statistical parsing of noun phrases for document indexing, Fifth Conference on Applied Natural Language Processing (1997) 312–319. A Correctness and Completeness In order to show that Algorithm 1 in Section 4.2 is sound and complete, we need to prove
- Zhai
(Show Context)
Citation Context ...bservation on these models is consistent with what researchers have observed on some early effort on applying natural language processing techniques to improve indexing, notably phrase-based indexing =-=[23, 56, 104, 120]-=-. A more successful retrieval model that can capture limited dependencies is the Markov Random Field model proposed in [68]. This model is a general discriminative model where arbitrary features can b... |

33 | Bayesian extension to the language model for ad hoc information retrieval
- Zaragoza, Hiemstra, et al.
- 2003
(Show Context)
Citation Context ...s to consider this uncertainty and use the posterior distribution of θD (i.e., p(θD|D)) to compute the query likelihood. Such a full Bayesian treatment was proposed and studied in Zaragoza and others =-=[119]-=-. Their new scoring function is ∫ p(Q|D) = p(Q|θD)p(θD|D)dθD. The regular query likelihood scoring formula can be seen as a special case of this more general query likelihood when we assume that p(θD|... |

23 |
A probability distribution model for information retrieval
- Wong, Yao
- 1989
(Show Context)
Citation Context ...sentation problem. Such a connection was actually recognized in some early work, but no previous work has looked into the problem of how to estimate such a model accurately. For example, Wong and Yao =-=[114]-=- proposed to use a multinomial model to represent a document, but they just used the ML estimator and did not further study the estimation problem. 2.2 BBN and Twenty-One in TREC-7 At about the same t... |

23 |
Document quality models for Web ad hoc retrieval
- Zhou, Croft
(Show Context)
Citation Context ...y queries.” In a study by Kurland and Lee [49], a PageRank score computed using induced links between documents based on document similarity has been used as a prior to improve retrieval accuracy. In =-=[132]-=- priors to capture document quality150 The Basic Language Modeling Approach are shown to be effective for improving the accuracy of the top-ranked documents in ad hoc web search. 2.4 Summary In this ... |

10 |
Rijsbergen, A theoretical basis for the use of cooccurrence data in information retrieval
- Van
- 1977
(Show Context)
Citation Context ...stantiations of them can be effective. On the other hand, there have been many empirical studies of models, including many variants of the vector space model [89, 90, 91, 96] and probabilistic models =-=[26, 51, 80, 83, 110, 109]-=-. The vector-space model with heuristic TF-IDF weighting and document length normalization has traditionally been one of the most effective retrieval models, and it remains quite competitive as a stat... |

9 |
Rijsbergen, A non-classical logic for information retrieval
- Van
- 1986
(Show Context)
Citation Context ...139 The field has progressed in two different ways. On the one hand, theoretical models have been proposed often to model relevance through inferences; representative models include the logic models =-=[27, 111, 115]-=- and the inference network model [109]. However, these models, while theoretically interesting, have not been able to directly lead to empirically effective models, even though heuristic instantiation... |

8 | A mixture clustering model for pseudo feedback in information retrieval
- Tao, Zhai
- 2004
(Show Context)
Citation Context ... methods (especially divergence minimization) are also shown to be sensitive to parameter settings. There has been some follow-up work on improving the robustness of the mixture model feedback method =-=[107, 108]-=-. In Tao and Zhai [107],172 Query Models and Feedback in Language Models the mixture model was extended to better integrate the original query model with the feedback documents and to allow each feed... |

5 | Improve retrieval accuracy for difficult queries using negative feedback
- Wang, Fang, et al.
- 2007
(Show Context)
Citation Context ...ck; in particular, feedback based on only negative information (i.e., nonrelevant information) remains challenging with the KL-divergence retrieval model and additional heuristics may need to be used =-=[112]-=-. This is in contrast to the documentgeneration probabilistic models such as the Robertson–Sparck Jones model [83] which can naturally use negative examples to improve the estimate of the nonrelevant ... |

1 |
Improving the robustness of language models
- Zhai, Tao, et al.
- 2003
(Show Context)
Citation Context ...models have now been applied to multiple retrieval tasks such as crosslingual retrieval [54], distributed IR [95], expert finding [25], passage retrieval [59], web search [47, 76], genomics retrieval =-=[129]-=-, topic tracking [41, 53, 99], and subtopic retrieval [122].140 Introduction This survey is to systematically review this development of the language modeling approaches. We will survey a wide range ... |