## Multi-Style Language Model for Web Scale Information Retrieval

### Cached

### Download Links

Citations: | 6 - 3 self |

### BibTeX

@MISC{Wang_multi-stylelanguage,

author = {Kuansan Wang and Xiaolong Li and Jianfeng Gao},

title = {Multi-Style Language Model for Web Scale Information Retrieval},

year = {}

}

### OpenURL

### Abstract

Web documents are typically associated with many text streams, including the body, the title and the URL that are determined by the authors, and the anchor text or search queries used by others to refer to the documents. Through a systematic large scale analysis on their cross entropy, we show that these text streams appear to be composed in different language styles, and hence warrant respective language models to properly describe their properties. We propose a language modeling approach to Web document retrieval in which each document is characterized by a mixture model with components corresponding to the various text streams associated with the document. Immediate issues for such a mixture model arise as all the text streams are not always present for the documents, and they do not share the same lexicon, making it challenging to properly combine the statistics from the mixture components. To address these issues, we introduce an “openvocabulary” smoothing technique so that all the component language models have the same cardinality and their scores can simply be linearly combined. To ensure that the approach can cope with Web scale applications, the model training algorithm is designed to require no labeled data and can be fully automated with few heuristics and no empirical parameter tunings. The evaluation on Web document ranking tasks shows that the component language models indeed have varying degrees of capabilities as predicted by the cross-entropy analysis, and the combined mixture model outperforms the state-of-the-art BM25F based system.

### Citations

4178 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the maximum a posteriori decision rule that was first shown in =-=[7]-=- and reiterated for IR by Zhai and Lafferty [33]. Specifically, given a query Q, a minimum risk retrieval system should rank the document D based on product of the likelihood of the query under the do... |

965 | Introduction to Information Retrieval
- Manning, Raghavan, et al.
- 2008
(Show Context)
Citation Context ...erlying statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working so well [1][6][9]=-=[18]-=-[31]. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the maximum a po... |

951 | A language modeling approach to information retrieval - Ponte, Croft - 1998 |

755 | A study of smoothing methods for language models applied to information retrieval
- Zhai, Lafferty
(Show Context)
Citation Context ...dilemma that some streams are sporadic and sparse for many Web documents. Although the interpolation coefficient α Di in (7) can in practice be kept as a free parameter to be empirically tuned (e.g., =-=[32]-=-), a major objective of this work is to explore alternatives that are tuning-free and thus more desirable when an IR system leaves a lab environment. In addition to the methods described in the next s... |

751 |
Information theory and statistical mechanics
- Jaynes
- 1957
(Show Context)
Citation Context ...mass for OOVs: pUnk � � t�V � P ( t) �1� P ( t) � 0 C t�V When an open vocabulary model is used to evaluate a text corpus and encounter additional k distinct OOV tokens, the maximum entropy principle =-=[13]-=- is often applied to evenly distribute pUnk among these newly discovered OOVs, i.e., PC(t) = pUnk / k for t �V. The key question is how much mass one should take away from V and assign it for pUnk. Th... |

499 | Okapi at TREC-3
- Robertson, Walker, et al.
- 1995
(Show Context)
Citation Context ...y multiple institutions, comes with a few well-established retrieval results. In the following, we report two pertinent experimental data as baselines for comparison. The first is based on Okapi BM25 =-=[25]-=- and its multifield extension [26] (referred to as BM25F below), both of which parameters are taken from the published results in [28]. We note that neither set of the Okapi experiment takes into acco... |

339 | Relevance-Based Language Models - Lavrenko, Croft - 2001 |

321 | Document language models, query models, and risk minimization for information retrieval
- Lafferty, Zhai
(Show Context)
Citation Context ...weightings and document length normalization, have been working so well [1][6][9][18][31]. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem =-=[16]-=-, for which the optimal performance can be achieved by following the maximum a posteriori decision rule that was first shown in [7] and reiterated for IR by Zhai and Lafferty [33]. Specifically, given... |

284 | Information Retrieval as Statistical Translation
- Berger, Lafferty
- 1999
(Show Context)
Citation Context ...t its underlying statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working so well =-=[1]-=-[6][9][18][31]. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the ma... |

219 | Two-Stage language models for information retrieval
- Zhai, Lafferty
- 2002
(Show Context)
Citation Context ...minimization problem [16], for which the optimal performance can be achieved by following the maximum a posteriori decision rule that was first shown in [7] and reiterated for IR by Zhai and Lafferty =-=[33]-=-. Specifically, given a query Q, a minimum risk retrieval system should rank the document D based on product of the likelihood of the query under the document language model, P D(Q), and the prior of ... |

204 |
On Relevance, Probabilistic Indexing and Information Retrieval
- Maron, Kuhns
- 1997
(Show Context)
Citation Context ...in the query. Over a decade of studies on this topic, it has been now widely understood that LM is a principled realization of the statistical approach envisioned by Maron and Kuhns at the dawn of IR =-=[19]-=-, and that its underlying statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working... |

203 |
A hidden Markov model information retrieval system
- Miller, Leek, et al.
- 1999
(Show Context)
Citation Context ...es are often composed in a different language style than the document body, and a poor query likelihood can thus occur for relevant documents because of the style mismatch. To this end, Miller et al. =-=[21]-=- has proposed a hidden Markov model in which an additional latent stage is included to model the query generation process. Lafferty et al. have argued for an explicit model of the query language itsel... |

173 | Simple bm25 extension to multiple weighted fields
- Robertson, Zaragoza, et al.
- 2004
(Show Context)
Citation Context ...s viewed the multiple-field document retrieval as a structured document retrieval problem, and some established retrieval models, such as BM25, have been generalized to multi-field document retrieval =-=[26]-=-. A straightforward generalization for LM is to view the document as being described by multiple text streams. As shown in Sec. 2, a quantitative analysis on the documents indexed by a commercial Web ... |

135 | The importance of prior probabilities for entry page search
- Kraaij, Westerveld, et al.
- 2002
(Show Context)
Citation Context ...prior, which we have found to be critical in downplaying the roles of undesirable contents such as spam. Similar observations on the importance of document prior have been made for other applications =-=[15]-=-. As such, we adopt the machine learning technique described in [28] and train a neural network ranker based system that uses NDCG@10 as the objective function to combine BM25/F with the document prio... |

128 | Entropy-based Pruning of Backoff Language models
- Stolcke
- 1998
(Show Context)
Citation Context ...fore use (4) to compute the discount factor of an N-gram by choosing the corresponding (N-1)-gram as the background. For N > 1, (4) coincides with the formulation of the well-known Stolcke heuristics =-=[27]-=- that has been widely used in the N-gram LM pruning: N-grams that can be reasonably predicted by (N-1)-gram can be pruned out of the model. For the purpose of this work, we further extend the idea dow... |

91 | Combining document representations for known-item search
- Ogilvie, Callan
- 2003
(Show Context)
Citation Context ...ponding component LM for the stream, we have � � P � P( D | D) P � w P (2) D i i Such a mixture distribution has been widely used for LM in speech and language processing [11] as well as in IR (e.g., =-=[22]-=-). However, beneath the simple linear interpolation form of (2) lies the serious question of the conditions under which the component LMs can be combined properly. Since the foundation of mixture mode... |

80 | Dependence language model for information retrieval
- Gao, Nie, et al.
- 2004
(Show Context)
Citation Context ...underlying statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working so well [1][6]=-=[9]-=-[18][31]. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the maximum ... |

76 | A formal study of information retrieval heuristics - Fang, Tao, et al. - 2004 |

69 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...ly. Empirical studies on applying the LMs for different task domains have also confirmed that mixing textual sources with different language styles can significantly degrade the quality of LMs (e.g., =-=[2]-=-). When a probability space is divided into disjoint partitions, the probability of an event can be evaluated as the sum of the conditional probability of the event occurring in each partition, weight... |

68 | Statistical Language Models for Information Retrieval
- Zhai
- 2008
(Show Context)
Citation Context ...ing statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working so well [1][6][9][18]=-=[31]-=-. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the maximum a poster... |

45 |
Always good turing: asymptotically optimal probability estimation, Science 302 (5644
- Orlitsky, Santhanam, et al.
- 2003
(Show Context)
Citation Context ...ission and/or a fee. SIGIR’10, July 19–23, 2010, Geneva, Switzerland. Copyright 2010 ACM 978-1-60558-896-4/10/07…$10.00. 1. INTRODUCTION Inspired by the success in speech recognition, Ponte and Croft =-=[23]-=- introduced the language modeling (LM) techniques to information retrieval (IR) that have since become an important research area. The motivation is very simple: just as we would like a speech recogni... |

42 | Title language model for information retrieval
- Jin, Hauptmann, et al.
(Show Context)
Citation Context .... Lafferty et al. have argued for an explicit model of the query language itself [16], and proposed the machine translation techniques to bridge the gap between the body and the query [1]. Jin et al. =-=[14]-=-, for example, used the title and the body of a document as the target and the source languages, respectively, and demonstrated that the “translated” title LM can be a viable choice as the P D for IR.... |

39 |
Overview of the TREC 2009 Web Track
- Clarke, Craswell, et al.
- 2009
(Show Context)
Citation Context ...s conducted on the test collection with all the relevance judgments. There is no reason to believe the results reported here cannot be reproduced elsewhere, such as the recent TREC Web Track data set =-=[4]-=-, provided that the document prior can be computed with methods that effectively confront the prevalent spamming activi-ties on the Web. Specifically, the test collection used in this paper includes ... |

39 |
Search Engines: Information Retrieval in Practice
- Croft, Metzler, et al.
- 2009
(Show Context)
Citation Context ...ts underlying statistical framework provides mathematically sound explanations to why many proven heuristics, such as TF/IDF weightings and document length normalization, have been working so well [1]=-=[6]-=-[9][18][31]. As is in the case of speech recognition, LM for IR can be formulated as a Bayesian risk minimization problem [16], for which the optimal performance can be achieved by following the maxim... |

38 | Spam double-funnel: Connecting web spammers with advertisers
- Wang, Ma, et al.
- 2007
(Show Context)
Citation Context ...ent prior can be computed with methods that effectively confront the prevalent spamming activi-ties on the Web. Specifically, the test collection used in this paper includes a technique described in =-=[30]-=- that identifies spammers based on the HTTP redirection patterns. We have found such crawling time features critical and can augment other link graph analysis and content based methods and lead to an ... |

23 | Exploring Web Scale Language Models for Search Query - Huang |

15 | Web resources for language modeling in conversational speech recognition - Bulyko, Ostendorf, et al. - 2007 |

7 | A machine learning approach for improved bm25 retrieval
- Svore, Burges
- 2009
(Show Context)
Citation Context ...euristics or are incongruent to the properties of the tuning data. The scale of the Web typically amplifies the difficulty of these issues, as demonstrated in the machine learning results reported in =-=[28]-=- that show the retrieval performance can be highly volatile depending on how the parameters in BM25F are acquired. In this paper, we propose an information theoretically motivated method towards open ... |

5 | 2009. Efficacy of a constantly adaptive language modeling technique for webscale applications
- Wang, Li
(Show Context)
Citation Context ...ALM In this paper, we adopt a more analytically tractable approach to open-vocabulary discount. The key element in our method is a model adaptation algorithm called CALM first proposed by Wang and Li =-=[29]-=-. A close examination of the original presentation reveals that the adaptation framework in CALM can be explained in an alternative manner using the widely known vector space paradigm. As the original... |

1 | 21 language models at TREC: A language modeling approach to the text retrieval conference - Hiemstra, Kraaij |