## An investigation of dirichlet prior smoothing’s performance advantage (2005)

### Cached

### Download Links

Citations: | 7 - 0 self |

### BibTeX

@TECHREPORT{Smucker05aninvestigation,

author = {Mark D. Smucker and James Allan},

title = {An investigation of dirichlet prior smoothing’s performance advantage},

institution = {},

year = {2005}

}

### OpenURL

### Abstract

In the language modeling approach to information retrieval, Dirichlet prior smoothing frequently outperforms Jelinek-Mercer smoothing. Both Dirichlet prior and Jelinek-Mercer are forms of linear interpolated smoothing. The only difference between them is that Dirichlet prior determines the amount of smoothing based on a document’s length. Theory suggests that Dirichlet prior’s advantage should be the result of better document model estimation, for Dirichlet prior sensibly smooths longer documents less. In contrast, our hypothesis was that Dirichlet prior’s performance advantage comes primarily from a penalization of shorter documents ’ scores. We conducted two experiments to test our hypothesis. In our first experiment, when we transformed the test collections to have a uniform probability of relevance given document length, P (Rel|Len), Dirichlet prior’s performance advantage disappeared. If Dirichlet prior’s advantage came from better estimation, it should have retained that advantage even with a uniform P (Rel|Len). In our second experiment, we gave the known P (Rel|Len) as a document prior to the retrieval method. With the document prior, Jelinek-Mercer’s performance increased to match Dirichlet prior and Dirichlet prior showed some degradation in performance. These results confirm our hypothesis. While better estimation was formerly a plausible explanation of Dirichlet prior’s performance advantage, we now 1 know that Dirichlet prior smoothing’s advantage appears to come from its penalization of shorter documents.

### Citations

949 | An empirical study of smoothing techniques for language modeling - Chen, Goodman - 1998 |

757 | A study of smoothing methods for language models applied to ad hoc information retrieval
- Zhai, Lafferty
- 2001
(Show Context)
Citation Context ...ssful with large amounts of smoothing for all document lengths. We will explain that because the amount of smoothing controls the amount of smoothing’s inverse document frequency (IDF) like behavior (=-=Zhai and Lafferty 2001-=-), one needs significant amounts of smoothing to maximize that behavior. Finally, we discuss the potential of using a document prior for performance improvements. 4.1 Smoothing Longer Documents Less O... |

392 | Document length normalization
- Singhal, Salton, et al.
- 1996
(Show Context)
Citation Context ...or’s performance advantage comes more from its penalization of shorter documents than from its potentially better estimation. In the TREC collections, longer documents are more likely to be relevant (=-=Singhal et al. 1996-=-). It appears that by smoothing shorter doc3suments more than longer documents, Dirichlet prior is able to suppress the scores of short documents, which is advantageous given the nature of the TREC co... |

345 | Core Team (2004). R: A Language and Environment for Statistical Computing - Development |

298 | Viewing morphology as an inference process
- Krovetz
- 1993
(Show Context)
Citation Context ... TREC 7 and 8 consists of TREC volumes 4 and 5 minus the Congressional Record (CR) subcollection. We preprocessed the collections and queries in the same manner. We stemmed using the Krovetz stemmer (=-=Krovetz 1993-=-) and removed stopwords using an in-house stopword list of 418 noise words. We used Lemur 4.3.2 (Lemur 2003) for all experiments. The calculations in Section 5.2 and the data plotted in Figure 6 were ... |

285 | Information retrieval as statistical translation
- Berger, Lafferty
- 1999
(Show Context)
Citation Context ... Introduction The language modeling approach to information retrieval (IR) represents documents as generative probabilistic models (Ponte and Croft 1998, Miller et al. 1998, Hiemstra and Kraaij 1998, =-=Berger and Lafferty 1999-=-, Song and Croft 1999). Documents with higher probabilities for query words are preferred over other documents. A document’s score is computed to be the probability that it would generate the query. T... |

206 | A General Language Model for Information Retrieval - Song, Croft - 1999 |

110 | Twenty-One at TREC-7: ad-hoc and crosslanguage track
- Hiemstra, Kraaij
- 1999
(Show Context)
Citation Context ...nt length normalization. 1 Introduction The language modeling approach to information retrieval (IR) represents documents as generative probabilistic models (Ponte and Croft 1998, Miller et al. 1998, =-=Hiemstra and Kraaij 1998-=-, Berger and Lafferty 1999, Song and Croft 1999). Documents with higher probabilities for query words are preferred over other documents. A document’s score is computed to be the probability that it w... |

65 | Good-turing frequency estimation without tears - Gale, Sampson - 1995 |

43 | BBN at TREC7: Using Hidden Markov Models for Information Retrieval - Miller, Schwartz - 1998 |

34 |
Schools of Linguistics
- Sampson
- 1980
(Show Context)
Citation Context ...thing. Good-Turing explicitly uses the zero probability mass, P0, and estimates it for a document D to be: P0 = N1(D) |D| where N1(D) is the number of words that occur exactly once in the document D (=-=Sampson 2001-=-). We will not use or discuss Good-Turing smoothing beyond using its estimation of the zero probability mass. Gale and Sampson (1995) provide a good explanation of Good-Turing smoothing. The λ paramet... |

7 | A language modeling approach to information retrieval - JM, WB - 1998 |

4 | The SMART project at TREC - Buckley |

4 | Vries (2000). Relating the new language models of information retrieval to the traditional retrieval models, Centre for Telematics and Information - Hiemstra, d |

2 |
Algorithm as 266: Maximum likelihood estimation of the parameters of the dirichlet distribution
- Narayanan
- 1991
(Show Context)
Citation Context ... finds the density parameters that produce the highest likelihood for a collection of documents when the density is used as a prior. The MLE can be computed numerically using a Newton-Raphson method (=-=Narayanan 1991-=-) or via an expectation maximization (EM) like method (Sjölander et al. 1996). In contrast, common practice in information retrieval, and the one we follow, is to let P (w|M) =P (w|C), i.e. use the ML... |

2 | Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation - MD, Allan |

1 | Empirical estimates of adaptation: the chance of two Noriegas is closer to p/2 thanp2 - KW - 2000 |

1 | Probability: deductive and inductive problems - WE - 1932 |

1 |
Mian IS and Haussler D
- Sjölander, Karplus, et al.
- 1996
(Show Context)
Citation Context ...ution to the problem of zero probabilities and poor probability estimates is to bring prior knowledge to the estimation process. A natural fit as a prior for the multinomial is the Dirichlet density (=-=Sjölander et al. 1996-=-). A Dirichlet density can be thought of as an urn containing multinomial dies. All the multinomials are of the same size with |V | parameters. The Dirichlet density has the same number of parameters ... |

1 | Lightening the load of document smoothing for better language modeling retrieval - MD, Allan - 2006 |