## Empirical Development of an Exponential Probabilistic Model for Text Retrieval: Using Textual Analysis to Build a Better Model (2003)

### Cached

### Download Links

- [haystack.lcs.mit.edu]
- [people.csail.mit.edu]
- [haystack.lcs.mit.edu]
- [www.csail.mit.edu]
- [goanna.cs.rmit.edu.au]
- [people.csail.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval |

Citations: | 11 - 0 self |

### BibTeX

@INPROCEEDINGS{Teevan03empiricaldevelopment,

author = {Jaime Teevan and David R. Karger},

title = {Empirical Development of an Exponential Probabilistic Model for Text Retrieval: Using Textual Analysis to Build a Better Model},

booktitle = {In Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval},

year = {2003},

pages = {18--25}

}

### OpenURL

### Abstract

Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive Bayesian framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.

### Citations

1135 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...family, and thus our model, we do semi-parametric analysis of a single-parameter exponential family distribution, and there is substantial relevant statistical and machine learning literature on this =-=[1, 6, 9]-=-. Previously, attempts have been made to learn IR model parameterssfrom text [24], but such work has not focused on learning the actual model. In learning the model, we use relevance judgments on past... |

941 | A Language Modeling Approach to Information Retrieval
- Ponte, Croft
- 1998
(Show Context)
Citation Context ...val and present experimental results. 2. RELATED WORK While the statistical properties of text corpora are fundamental to the use of probabilistic models, as well as to the use of other recent models =-=[19, 21]-=-, the statistical properties have not necessarily been fundamental to the models' development, nor to understanding their assumptions. Most IR models do attempt to minimize unfounded assumptions, alth... |

891 | A tutorial on learning with Bayesian networks
- Heckerman
- 1998
(Show Context)
Citation Context ...family, and thus our model, we do semi-parametric analysis of a single-parameter exponential family distribution, and there is substantial relevant statistical and machine learning literature on this =-=[1, 6, 9]-=-. Previously, attempts have been made to learn IR model parameterssfrom text [24], but such work has not focused on learning the actual model. In learning the model, we use relevance judgments on past... |

402 | Statistical interpretation of term specificity and its application in retrieval
- Jones, K
- 1972
(Show Context)
Citation Context ...of a document to rank documents, without investigating whether queries look anything like document titles. Early IR systems did place focus on text during their development. For example, Sparck Jones =-=[11]-=- used analysis of three small corpora to suggest the use of inverse document frequency for term weighting, now a common practice, based on textual analysis. Recently, people have revisited textual ana... |

386 | Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval
- Robertson, Walker
- 1994
(Show Context)
Citation Context ... assumptions against text data. Often a model is developed without reference to the data; the only actual interaction with text is when testing a particular retrieval system based on the model (e.g., =-=[8, 15, 22]-=-). At that point it is difficult to decide whether unsatisfactory retrieval is due to the retrieval system or the underlying model. In this work our primary goal is to develop a generative probabilist... |

376 | Naive (bayes) at forty: The independence assumption in information retrieval
- Lewis
- 1998
(Show Context)
Citation Context ...p further by, instead of imposing a model based on analysis, learning one computationally. We derive our model within the well studied nave Bayesian framework, of which Lewis provides a good overview =-=[16]-=-. Within this framework, researchers have considered a number of term distribution families to model text, with the selection sometimes based on textual analysis [2, 14]. Distributions that have been ... |

295 | A probabilistic model of information retrieval: development and comparative experiments - part 1
- Jones, Walker, et al.
(Show Context)
Citation Context ...val algorithm, yielding better performance in retrieval. This hope is not always fulfilled, as has been illustrated by the relatively unsuccessful attempts to allow for term dependencies in retrieval =-=[12]-=-. Still, by focusing on a few empirically inaccurate and easily correctable flaws in current models, we find we are able to improve our model's performance on certain IR tasks. We begin by discussing ... |

217 | Language Modeling for Information Retrieval
- Croft, Lafferty
- 2003
(Show Context)
Citation Context ...ential family distribution, and there is substantial relevant statistical and machine learning literature on this [1, 6, 9]. Previously, attempts have been made to learn IR model parameterssfrom text =-=[24]-=-, but such work has not focused on learning the actual model. In learning the model, we use relevance judgments on past queries in a principled way to devise a retrieval strategy for future queries. T... |

171 | Learning to classify text from labeled and unlabeled documents
- Nigam, McCallum, et al.
- 1998
(Show Context)
Citation Context ...he Bayesian machine learning framework could be used for automatic relevance feedback, using unlabeled documents that seem relevant to modify its believed relevant-document distribution. Nigam et al. =-=[20]-=- got good results with this approach using the multinomial model, and we conjecture that our model, more accurate than the multinomial, could further improve performance. We also believe that by learn... |

127 |
A statistical approach to the mechanized encoding and searching of literary information
- Luhn
- 1957
(Show Context)
Citation Context ...1. INTRODUCTION The goal of information retrieval (IR) is to determine which documents are relevant to a user's information need. In early IR work, this determination was based on heuristic judgments =-=[17]-=- (e.g., that documents containing the user's query terms are likely to be relevant) followed by heuristic tweaking of parameters (e.g., term weights) to make the system work. Subsequently, attempts we... |

113 | Probabilistic models in information retrieval
- Fuhr
- 1992
(Show Context)
Citation Context ...esearchers have considered a number of term distribution families to model text, with the selection sometimes based on textual analysis [2, 14]. Distributions that have been explored include binomial =-=[5]-=-, multinomial [13, 16], Poisson [22], Poisson mixtures [2] and more [3]. The work we present here di#ers from this earlier nave Bayesian work because we do not restrict our model to a particular famil... |

87 | Poisson Mixture
- Church, Gale
- 1995
(Show Context)
Citation Context ...ibes text, and we rely on this better model to improve retrieval. Some previous work has tried to match underlying model assumptions closely to text through manual analysis of a small number of terms =-=[2, 14]-=-. In contrast, we match the assumptions to text computationally using a large text corpus (TREC). Thus, though we restrict ourselves throughout our analysis to nave Bayesian models, we are able to sig... |

87 |
Automatic feedback using past queries: Social searching
- Fitzpatrick, Dent
- 1997
(Show Context)
Citation Context ...relevance judgments on past queries in a principled way to devise a retrieval strategy for future queries. Thus our work serves as an instance of cross query learning. Similarly, Fitzpatrick and Dent =-=[4]-=- used textually similar past queries for relevance feedback. In contrast to their work, rather than using topically related past queries within a specific model, we use all past queries to teach us an... |

81 |
Distribution of content words and phrases in text and language modeling. Natural Language Engineering
- Katz
- 1996
(Show Context)
Citation Context ...ibes text, and we rely on this better model to improve retrieval. Some previous work has tried to match underlying model assumptions closely to text through manual analysis of a small number of terms =-=[2, 14]-=-. In contrast, we match the assumptions to text computationally using a large text corpus (TREC). Thus, though we restrict ourselves throughout our analysis to nave Bayesian models, we are able to sig... |

65 | Towards multidocument summarization by reformulation: Progress and prospects," presented at
- McKeown, Klavans, et al.
- 1999
(Show Context)
Citation Context ...ue, we only summed over query terms. The results, displayed in Figure 4, show performance under this scheme is better in all areas of the curve than the Poisson model, as well as comparable to tf.idf =-=[18]-=-, performing slightly worse in areas of high precision, and better in areas of low precision. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 Recall Precision Tf.idf New Model Poisson Figure 4: The hyper-le... |

54 | Using interdocument similarity information in document retrieval systems
- Griffiths, Luckhurst, et al.
- 1986
(Show Context)
Citation Context ... assumptions against text data. Often a model is developed without reference to the data; the only actual interaction with text is when testing a particular retrieval system based on the model (e.g., =-=[8, 15, 22]-=-). At that point it is di#cult to decide whether unsatisfactory retrieval is due to the retrieval system or the underlying model. In this work our primary goal is to develop a generative probabilistic... |

46 | A Maximum Likelihood Ratio Information Retrieval Model
- Ng
- 1999
(Show Context)
Citation Context ...val and present experimental results. 2. RELATED WORK While the statistical properties of text corpora are fundamental to the use of probabilistic models, as well as to the use of other recent models =-=[19, 21]-=-, the statistical properties have not necessarily been fundamental to the models' development, nor to understanding their assumptions. Most IR models do attempt to minimize unfounded assumptions, alth... |

42 | Title language model for information retrieval
- Jin, Hauptmann, et al.
- 2002
(Show Context)
Citation Context ...erstanding their assumptions. Most IR models do attempt to minimize unfounded assumptions, although often without understanding which ones actually are unfounded. For example, Jin, Hauptmann and Zhai =-=[10]-=- suggested using the probability a query would be the title of a document to rank documents, without investigating whether queries look anything like document titles. Early IR systems did place focus ... |

42 | Improving two-stage ad-hoc retrieval for short queries - Kwok, Chan |

41 | A Theory of Term Weighting Based on Exploratory Data Analysis
- Greiff
- 1998
(Show Context)
Citation Context ...of inverse document frequency for term weighting, now a common practice, based on textual analysis. Recently, people have revisited textual analysis on newer and larger data sets. For example, Greiff =-=[7]-=- suggested improvements to tf.idf by studying 85,000 Associated Press articles from TREC. Because probabilistic models have explicit assumptions that make such textual analysis straightforward, some r... |

34 | A new probabilistic model of text classification and retrieval
- Kalt
- 1996
(Show Context)
Citation Context ...onsidered a number of term distribution families to model text, with the selection sometimes based on textual analysis [2, 14]. Distributions that have been explored include binomial [5], multinomial =-=[13, 16]-=-, Poisson [22], Poisson mixtures [2] and more [3]. The work we present here di#ers from this earlier nave Bayesian work because we do not restrict our model to a particular family within the framework... |

13 |
Some simple eective approximations to the 2-poisson model for probabilistic weighted retrieval
- Robertson, Walker
- 1994
(Show Context)
Citation Context ... assumptions against text data. Often a model is developed without reference to the data; the only actual interaction with text is when testing a particular retrieval system based on the model (e.g., =-=[8, 15, 22]-=-). At that point it is di#cult to decide whether unsatisfactory retrieval is due to the retrieval system or the underlying model. In this work our primary goal is to develop a generative probabilistic... |

4 |
On the naive Bayes model for text classification
- Eyheramendy, Lewis, et al.
- 2003
(Show Context)
Citation Context ...documents are more likely to be relevant than long ones." While such a conclusion may be supportable, some prefer to apply length normalization to eliminate this dependence, usually by ad hoc met=-=hods [3]-=-. Length normalization can be pursued in a principled fashion. From the nave Bayesian model, we could derive an appropriate and tractable length normalization scheme by conditioning the probability di... |

4 |
A theory of term weighting based on exploratory data analysis
- Grei�
- 1998
(Show Context)
Citation Context ... of inverse document frequency for term weighting, now a common practice, based on textual analysis. Recently, people have revisited textual analysis on newer and larger data sets. For example, Grei# =-=[7]-=- suggested improvements to tf.idf by studying 85,000 Associated Press articles from TREC. Because probabilistic models have explicit assumptions that make such textual analysis straightforward, some r... |

2 |
Adaptive estimation of distributions using exponential sub-families
- Gous
- 1998
(Show Context)
Citation Context ...family, and thus our model, we do semi-parametric analysis of a single-parameter exponential family distribution, and there is substantial relevant statistical and machine learning literature on this =-=[1, 6, 9]-=-. Previously, attempts have been made to learn IR model parameterssfrom text [24], but such work has not focused on learning the actual model. In learning the model, we use relevance judgments on past... |