## A Tutorial on Information Retrieval Modelling

### Cached

### Download Links

Citations: | 3 - 0 self |

### BibTeX

@MISC{Hiemstra_atutorial,

author = {Djoerd Hiemstra},

title = {A Tutorial on Information Retrieval Modelling},

year = {}

}

### OpenURL

### Abstract

Many applications that handle information on the internet would be completely

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...nfortunately, it is unknown to which subset each document belongs. The estimation of the three parameters should therefore be done iteratively by applying e.g. the expectation maximisation algorithm (=-=Dempster et al. 1977-=-) or alternatively by the method of moments as done by Harter (1975). If a document is taken at random from subset one, then the probability of relevance of this document is assumed to be equal to, or... |

4273 | A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...998; Hiemstra and Kraaij 1998; Miller et al. 1999). They originate from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s 15s(see e.g. =-=Rabiner 1990-=-). Automatic speech recognition systems combine probabilities of two distinct models: the acoustic model, and the language model. The acoustic model might for instance produce the following candidate ... |

3249 | The anatomy of a large-scale hypertextual web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ...any links pointing to them are more likely to be relevant, as shown in the next section. 4.6 Google’s page rank model When Sergey Brin and Lawrence Page launched the web search engine Google in 1998 (=-=Brin and Page 1998-=-), it had two features that distinguished it from other web search engines: It had a simple no-nonsense search interface, and, it used a radically different approach to rank the search results. Instea... |

3124 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...he components of the vectors � d and �q. 3.1 The vector space model Gerard Salton and his colleagues suggested a model based on Luhn’s similarity criterion that has a stronger theoretical motivation (=-=Salton and McGill 1983-=-). They considered the index representations and the query as vectors embedded in a high dimensional Euclidean space, where each term is assigned a separate dimension. The similarity measure is usuall... |

882 | A language modeling approach to information retrieval
- Ponte, Croft
- 1998
(Show Context)
Citation Context ...onable if the approach still deserves to be called a ‘Bayesian network model’. 4.5 Language models Language models were applied to information retrieval by a number of researchers in the late 1990’s (=-=Ponte and Croft 1998-=-; Hiemstra and Kraaij 1998; Miller et al. 1999). They originate from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s 15s(see e.g. Rab... |

851 | An Empirical Study of Smoothing Techniques for Language Modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ... (t, d) � t � t cf (t) · cf (k) λ ) (16) 1−λ Here, P(Ti=ti) = cf (ti)/ � � t cf (t), and cf (t) = d tf (t, d). There are many approaches to smoothing, most pioneered for automatic speech recognition (=-=Chen and Goodman 1996-=-). Another approach to smoothing that is often used for information retrieval is so-called Dirichlet smoothing, which is defined as (Zhai and Lafferty 2004): P(T1 =t1, · · ·,Tn=tn|D=d) = n� i=1 tf (ti... |

850 |
Relevance feedback in information retrieval
- Rocchio
- 1971
(Show Context)
Citation Context ...ad-hoc, but quite successful retrieval algorithms are nicely grounded in the vector space model if the vector lengths are normalised. An example is the relevance feedback algorithm by Joseph Rocchio (=-=Rocchio 1971-=-). Rocchio suggested the following algorithm for relevance feedback, where �qold is the original query, �qnew is the revised query, � d (i) rel (1 ≤ i ≤ r) is one of the r documents the user selected ... |

701 | A study of smoothing methods for language models applied to information retrieval - Zhai, Lafferty |

597 |
K.S.: Relevance weighting of search terms
- Robertson, Jones
- 1976
(Show Context)
Citation Context ...1 social RELEVANT Figure 5: Venn diagram of the collection given the query term social 11sStephen Robertson and Karen Spärck-Jones based their probabilistic retrieval model on this line of reasoning (=-=Robertson and Spärck-Jones 1976-=-). They suggested to rank documents by P(R|D), that is the probability of relevance R given the document’s content description D. Note that D is here a vector of binary components, each component typi... |

353 | Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval
- Robertson, Walker
- 1994
(Show Context)
Citation Context ... this model is cumbersome, it inspired Stephen Robertson and Stephen Walker in developing the Okapi BM25 term weighting algorithm, which is still one of the best performing term weighting algorithms (=-=Robertson and Walker 1994-=-; Sparck-Jones et al. 2000). 4.4 Bayesian network models In 1991, Howard Turtle proposed the inference network model (Turtle and Croft 1991) which is formal in the sense that it is based on the Bayesi... |

297 | G.: Accurately interpreting clickthrough data as implicit feedback
- Joachims, Granka, et al.
- 2005
(Show Context)
Citation Context ...at “such a procedure would of course be extremely impractical”, but in fact, such techniques – rank optimization using so-called click-through rates – are 10snow common in web search engines as well (=-=Joachims et al. 2005-=-). Probabilistic indexing models were also studied by Fuhr (1989). 4.2 The probabilistic retrieval model Whereas Maron and Kuhns introduced ranking by the probability of relevance, it was Stephen Robe... |

264 | A probabilistic model of information retrieval: development and status: Part 1 and Part 2." Information Processing and Management
- Sparck-Jones, Walker, et al.
- 2000
(Show Context)
Citation Context ...it inspired Stephen Robertson and Stephen Walker in developing the Okapi BM25 term weighting algorithm, which is still one of the best performing term weighting algorithms (Robertson and Walker 1994; =-=Sparck-Jones et al. 2000-=-). 4.4 Bayesian network models In 1991, Howard Turtle proposed the inference network model (Turtle and Croft 1991) which is formal in the sense that it is based on the Bayesian network mechanism (Metz... |

234 |
The Probability Ranking Principle in IR
- Robertson
- 1977
(Show Context)
Citation Context ...ing by the probability of relevance, it was Stephen Robertson who turned the idea into a principle. He formulated the probability ranking principle, which he attributed to William Cooper, as follows (=-=Robertson 1977-=-). If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, w... |

229 | Evaluation of an inference network-based retrieval model
- Turtle, Croft
- 1991
(Show Context)
Citation Context ...ll one of the best performing term weighting algorithms (Robertson and Walker 1994; Sparck-Jones et al. 2000). 4.4 Bayesian network models In 1991, Howard Turtle proposed the inference network model (=-=Turtle and Croft 1991-=-) which is formal in the sense that it is based on the Bayesian network mechanism (Metzler and Croft 2004). A Bayesian network is an acyclic directed graph (a directed graph is acyclic if there is no ... |

227 | The SMART Retrieval System: Experiments in Automatic Document Processing, chapter Computer evaluation of indexing and text processing - Salton, Lesk - 1971 |

191 | Extended Boolean information retrieval - Salton, Fox - 1983 |

189 |
A hidden markov model information retrieval system
- Miller, Leek, et al.
- 1999
(Show Context)
Citation Context ...ed a ‘Bayesian network model’. 4.5 Language models Language models were applied to information retrieval by a number of researchers in the late 1990’s (Ponte and Croft 1998; Hiemstra and Kraaij 1998; =-=Miller et al. 1999-=-). They originate from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s 15s(see e.g. Rabiner 1990). Automatic speech recognition syste... |

188 |
On relevance, probabilistic indexing and information retrieval
- Maron, Kuhns
- 1960
(Show Context)
Citation Context ...which in this case are vectors with binary components dk that denote whether a document is indexed by term k or not. 4.1 The probabilistic indexing model As early as 1960, Bill Maron and Larry Kuhns (=-=Maron and Kuhns 1960-=-) defined their probabilistic indexing model. Unlike Luhn, they did not target automatic indexing by information retrieval systems. Manual indexing was still guiding the field, so they suggested that ... |

128 | The importance of prior probabilities for entry page search
- Kraaij, Westerveld, et al.
- 2002
(Show Context)
Citation Context ...Document priors can be easily combined with standard language modelling probabilities and are as such powerful means to improve the effectiveness of for instance queries for home pages in web search (=-=Kraaij et al. 2002-=-). 5 Conclusion There is no such thing as a dominating model or theory of information retrieval, unlike the situation in for instance the area of databases where the relational model is the dominating... |

110 | Matrix, vector space, and information retrieval - Berry, Drmac, et al. - 1999 |

108 | An algebra for structured text search and a framework for its implementation
- Clarke, Cormack, et al.
- 1995
(Show Context)
Citation Context ...ks the term union. Clearly, it is likely that the latter document is more useful than the former, but the model has no means to make the distinction. 2.2 Region models Regions models (Burkowski 1992; =-=Clarke et al. 1995-=-; Navarro and Baeza-Yates 1997; Jaakkola and Kilpelainen 1999) are extensions of the Boolean model that reason about arbitrary parts of textual data, called segments, extents or regions. Region models... |

107 | Twenty-One at TREC-7: Ad-hoc and cross-language track
- Hiemstra, Kraaij
- 1998
(Show Context)
Citation Context ... still deserves to be called a ‘Bayesian network model’. 4.5 Language models Language models were applied to information retrieval by a number of researchers in the late 1990’s (Ponte and Croft 1998; =-=Hiemstra and Kraaij 1998-=-; Miller et al. 1999). They originate from probabilistic models of language generation developed for automatic speech recognition systems in the early 1980’s 15s(see e.g. Rabiner 1990). Automatic spee... |

106 |
Combining the language model and inference network approaches to retrieval
- Metzler, Croft
(Show Context)
Citation Context ...2000). 4.4 Bayesian network models In 1991, Howard Turtle proposed the inference network model (Turtle and Croft 1991) which is formal in the sense that it is based on the Bayesian network mechanism (=-=Metzler and Croft 2004-=-). A Bayesian network is an acyclic directed graph (a directed graph is acyclic if there is no directed path A → · · · → Z such that A = Z) that encodes probabilistic dependency relationships between ... |

93 | On the specification of term values in automatic indexing - Salton, Yang - 1973 |

87 | Models for Retrieval with Probabilistic Indexing," Information Processing and Management - Fuhr - 1989 |

69 | Probabilistic models for automatic indexing - Bookstein, Swanson - 1976 |

69 | Proximal Nodes: A Model to Query Document Databases by Content and Structure
- Baeza-Yates, Navarro
- 1997
(Show Context)
Citation Context ...learly, it is likely that the latter document is more useful than the former, but the model has no means to make the distinction. 2.2 Region models Regions models (Burkowski 1992; Clarke et al. 1995; =-=Navarro and Baeza-Yates 1997-=-; Jaakkola and Kilpelainen 1999) are extensions of the Boolean model that reason about arbitrary parts of textual data, called segments, extents or regions. Region models model a document collection a... |

66 | Hyperlink analysis for the Web - Henzinger - 2001 |

49 | Probabilistic models of indexing and searching - Robertson, Rijsbergen, et al. - 1981 |

44 | Beyond pagerank: machine learning for static ranking
- Richardson, Prakash, et al.
- 2006
(Show Context)
Citation Context ... a Boolean AND query) and rank those documents by their page rank. Interestingly, a simple algorithm like this would not only be effective for web search, it can also be implemented very efficiently (=-=Richardson et al. 2006-=-). In practice, web search engines like Google use many more factors in their ranking than just page rank alone. In terms of the probabilistic indexing model and the language modelling approaches, sta... |

32 |
Retrieval activities in a database consisting of heterogeneous collections of structured text
- Burkowski
- 1992
(Show Context)
Citation Context ... worker that lacks the term union. Clearly, it is likely that the latter document is more useful than the former, but the model has no means to make the distinction. 2.2 Region models Regions models (=-=Burkowski 1992-=-; Clarke et al. 1995; Navarro and Baeza-Yates 1997; Jaakkola and Kilpelainen 1999) are extensions of the Boolean model that reason about arbitrary parts of textual data, called segments, extents or re... |

18 |
Modelling documents with multiple poisson distributions
- Margulis
- 1993
(Show Context)
Citation Context ...ual data if the term frequencies differ very much per document. Some studies therefore examine the use of more than two Poisson functions, but this makes the estimation problem even more intractable (=-=Margulis 1993-=-). Robertson, van Rijsbergen, and Porter (1981) proposed to use the 2-Poisson model to include the frequency of terms within documents in the probabilistic model. Although the actual implementation of... |

16 | Nested text-region algebra
- Jaakkola, Kilpeläinen
- 1999
(Show Context)
Citation Context ...latter document is more useful than the former, but the model has no means to make the distinction. 2.2 Region models Regions models (Burkowski 1992; Clarke et al. 1995; Navarro and Baeza-Yates 1997; =-=Jaakkola and Kilpelainen 1999-=-) are extensions of the Boolean model that reason about arbitrary parts of textual data, called segments, extents or regions. Region models model a document collection as a linearized string of words.... |

5 |
A statistical approach to mechanised encoding and searching of litary information
- Luhn
- 1957
(Show Context)
Citation Context ... point. The remaining sections of this chapter discuss these models of ranked retrieval. 3 Vector space approaches Peter Luhn was the first to suggest a statistical approach to searching information (=-=Luhn 1957-=-). He suggested that in order to search a document collection, the user should first prepare a document that is similar to the documents needed. The degree of similarity between the representation of ... |

4 |
Score Region Algebra: A flexible framework for structured information retrieval
- Mihajlovic
- 2006
(Show Context)
Citation Context ...k documents. For most retrieval applications, ranking is of the utmost importance and ranking extensions have been proposed of the Boolean model (Salton, Fox and Wu 1983) as well as of region models (=-=Mihajlovic 2006-=-). These extensions are based on models that take the need for ranking as their starting point. The remaining sections of this chapter discuss these models of ranked retrieval. 3 Vector space approach... |

4 | Structured text retrieval models - HIEMSTRA, BAEZA-YATES - 2009 |

3 | An algorithm for probabilistic indexing - Harter - 1975 |

1 | Huete (eds - Campos, Fernandez-Luna, et al. - 2004 |

1 | Models for retrieval with probabilistic indexing. Information processing and management - unknown authors - 1989 |