## Understanding inverse document frequency: On theoretical arguments for IDF (2004)

### Cached

### Download Links

- [www.soi.city.ac.uk]
- [nlp.cs.swarthmore.edu]
- CiteULike

### Other Repositories/Bibliography

Venue: | Journal of Documentation |

Citations: | 98 - 1 self |

### BibTeX

@ARTICLE{Robertson04understandinginverse,

author = {Stephen Robertson},

title = {Understanding inverse document frequency: On theoretical arguments for IDF},

journal = {Journal of Documentation},

year = {2004},

volume = {60},

pages = {2004}

}

### Years of Citing Articles

### OpenURL

### Abstract

The term weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon’s Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in traditional probabilistic model of information retrieval.

### Citations

6862 | The Mathematical Theory of Communication - Shannon, Weaver - 1949 |

942 | A language modeling approach to information retrieval
- Ponte, Croft
- 1998
(Show Context)
Citation Context ...even if the probabilities are assumed to be independent.) The relationship between such models and the matters discussed in this paper is a little unclear. Standard simple language models in IR (e.g. =-=Ponte and Croft, 1998-=-) do not make explicit use of either IDF or TF, but nevertheless achieve a similar effect to a TF*IDF measure. It seems possible that they are closer in spirit to the inverse total term frequency meas... |

647 | Relevance weighting of search terms - Robertson, Jones - 1976 |

404 | A statistical interpretation of term specificity and its application in retrieval - Jones - 1972 |

402 | The Geometry of Information Retrieval - Rijsbergen - 2004 |

386 | Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval - Robertson, Walker - 1994 |

375 | A probabilistic analysis of the Rocchio algorithm with IFIDF for text categorization
- Joachims
(Show Context)
Citation Context ... example of an attempt to use information theory for a derivation of a TF*IDF weighting scheme – and we have already seen some problems with such a formulation. (10)sJoachims The approach taken here (=-=Joachims, 1997-=-) appeals not to information theory, but to naïve Bayes machine learning models, as used in the relevance weighting theory (Joachims’ task is not the usual ranked retrieval task, but a categorisation ... |

338 | Relevance-Based Language Models
- Lavrenko, Croft
- 2001
(Show Context)
Citation Context ...r event space. Another difference is that the simple language models make no direct reference to relevance. However, some more recent work in this field does have an explicit relevance variable (e.g. =-=Lavrenko and Croft, 2001-=-). The 2-Poisson model and the BM25 weighting function were formulated before the language modelling idea came along. However, they may be recast as an elementary form of language model. This view is ... |

296 | A probabilistic model of information retrieval: Development and comparative experiments-part 2 - KS, Walker, et al. |

219 | Two-Stage language models for information retrieval
- Zhai, Lafferty
- 2002
(Show Context)
Citation Context ...oes not present an analogue to a TF*IDF formula. Language models There has been a great deal of work recently in the application of statistical language models (LM) to information retrieval (see e.g. =-=Croft and Lafferty, 2003-=-). In a statistical language model, each successive word position (in a document or query) is regarded as a slot to be filled with a word; a generative process is envisaged, in which these slots are f... |

211 | Entropy and information theory - Gray - 1990 |

139 |
On term selection for query expansion
- Robertson
- 1990
(Show Context)
Citation Context ...ia bigrams), and (b) distinguishing between the term weight and the gain from including the term as a feature in the classifier. (However, this distinction was also made earlier in the IR field – see =-=Robertson, 1990-=-.) 6 The classical probabilistic model for IR The interpretation of IDF as a function of a probability suggests associating it with (one of) the probabilistic approaches to information retrieval. Ther... |

101 |
On the specification of term values in automatic indexing
- Salton, Yang
- 1973
(Show Context)
Citation Context ...TF*IDF-style measures mentioned at the beginning of this paper. TF*IDF-style measures emerged from extensive empirical studies of combinations of weighting factors, particularly by the Cornell group (=-=Salton and Yang, 1973-=-).[3] To understand or explain the effectiveness of these measures, we need some justification not just for the TF factor itself, but also for why one should want to multiply the log-like IDF weight b... |

72 | Probabilistic models for automatic indexing - Bookstein, Swanson - 1976 |

67 | On relevance weights with little relevance information - Robertson, Walker - 1997 |

49 | Inverse document frequency (idf): A measure of deviations from poisson - Church, Gale - 1995 |

43 |
An information-theoretic perspective of tf-idf measures
- Aizawa
- 2003
(Show Context)
Citation Context ...l entropy of one random variable given another, the mutual information of two variables, the Kullback-Leibler divergence between two probability distributions. All these are closely related (see e.g. =-=Aizawa, 2003-=-). In general, however, these quantities are defined over entire probability distributions, and do not obviously relate to single (elementary or aggregate) events. There is in fairly common use a noti... |

34 |
Why inverse document frequency
- Papineni
(Show Context)
Citation Context ...etermined by the optimal term weight for searching, but requires a related but different measure of the effect of adding such a term to the query (Robertson, 1990), what Papineni refers to as ‘gain’ (=-=Papineni, 2001-=-). 6.3 Little relevance information Suppose now that we have little relevance information – the user either knows about in advance, or has seen and judged, just a few documents, and has marked R as re... |

20 |
Using probabilistic models of information retrieval without relevance information
- Croft, Harper
- 1979
(Show Context)
Citation Context ...n? Concerning the probability pi, relating to relevant documents, at first view we have no information at all. On that basis, probably the best we can do is to assume a fixed value p0 for all the pi (=-=Croft and Harper, 1979-=-). Concerning qi, however, we can make the same assumption made in the previous paragraph – that the collection consists to a very large extent of non-relevant documents. An extreme version of the com... |

13 |
The probability ranking principle in information retrieval
- Robertson
- 1977
(Show Context)
Citation Context ...25) is developed in Sparck Jones, Walker and Robertson (2000). It is not yet quite obvious how IDF might relate to a relevance variable, but this will emerge below. The probability ranking principle (=-=Robertson, 1977-=-) states that for optimal performance, documents should be ranked in order of probability of relevance to the request or information need, as calculated from whatever evidence is available to the syst... |

10 | Improving the suitability of imperfect transcriptions for information retrieval from spoken documents
- Siegler, Witbrock
- 1999
(Show Context)
Citation Context ...ult to see it as an explanation of the value of IDF. There are certainly other ways to interpret ‘the heuristic that tf-idf employs’, as discussed below. Siegler and Witbrock This much shorter paper (=-=Siegler and Witbrock, 1999-=-) defines both queries and documents as mappings of words onto probabilities – in other words, the event space is the space of words but with two different probability measures (to which they confusin... |

8 | On event spaces and probabilistic models in information retrieval - Robertson - 2005 |

6 | A frequency-based and a Poisson-based definition of the probability of being informative," presented at - Roelleke - 2003 |

2 |
Term specificity [letter to the editor
- Robertson
- 1972
(Show Context)
Citation Context ...oks like it might represent a probability (actually inverted). Thus we can consider the probability that a random document d would ni (1)sLog frequency Figure 1: Zipf’s law Log rank contain the term (=-=Robertson, 1972-=-). This probability has an obvious estimate, namely the inverse of the fraction in the IDF equation: P (ti) = P (ti occurs in d) ≈ ni N In the light of this relation, we can reasonably redefine IDF in... |

2 |
Overview of the Okapi projects [introduction to special issue of
- Robertson
- 1997
(Show Context)
Citation Context ...planation of TF*IDF based on an extension of the relevance weighting model that accommodates TF. The weighting function known as Okapi BM25 (it was developed as part of the Okapi experimental system (=-=Robertson, 1997-=-), and was one of a series of Best Match functions) is probably the best-known and most widely used such extension. 8.1 Eliteness and the 2-Poisson model The following is a brief account of the deriva... |

1 | The history of idf and its influences on IR and other - Harman - 1975 |

1 |
Specificity and weighted retrieval [documentation note
- Robertson
- 1974
(Show Context)
Citation Context ...tion to the IDF formula, by regarding IDF as measuring the amount of information carried by the term. Indeed, something along these lines has been done several times, including by the present author (=-=Robertson, 1974-=-). Some problems with this view of IDF are discussed below. First, it will be useful to define some other concepts associated with the Shannon entropy. (3) (4)s5.1 Messages and random variables We can... |