## Inferring probability of relevance using the method of logistic regression (1994)

Venue: | In Proceedings of ACM SIGIR’94 |

Citations: | 44 - 1 self |

### BibTeX

@INPROCEEDINGS{Gey94inferringprobability,

author = {Fredric C. Gey},

title = {Inferring probability of relevance using the method of logistic regression},

booktitle = {In Proceedings of ACM SIGIR’94},

year = {1994},

pages = {222--231},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

This research evaluates a model for probabilistic text and document retrieval; the model utilizes the technique of logistic regression to obtain equations which rank documents by probability of relevance as a function of document and query properties. Since the model infers probability of relevance from statistical clues present in the texts of documents and queries, we call it logistic inference. By transforming the distri-bution of each statistical clue into its standardized distribution (one with mean v = O and standard deviation a = 1), the method allows one to apply logistic coefficients derived from a training collection to other docu-ment collections, with little loss of predictive power. The model is applied to three well-known information retrieval test collections, and the results are compared directly to the particular vector space model of retrieval which uses term-frequency/inverse-document-frequency (tfidf) weighting and the cosine similarity measure. In the comparison, the logistic inference method performs significantly better than (in two collec-tions) or equally well as (in the third collection) the tfidf/cosine vector space model. The differences in per-formances of the two models were subjected to statistical tests to see if the differences are statistically significant or could have occurred by chance. 1.

### Citations

3458 |
Introduction to Modern Information Retrieval
- Salton, McGill
(Show Context)
Citation Context ... of this task is the concise representation of the meanings of the texts contained in the collection. In the vector space model developed by Gerard Salton and associates at Cornell University [1] [2] =-=[3]-=- both documents and queries are represented as vectors in the “m’’ -dimensional term space assuming that the indexing vocabulary of nontrivial terms is of size “m”. For this model, we have a clear geo... |

1713 | Buckley,“Term Weighting Approaches in Automatic Text Retrieval
- Salton, C
- 1987
(Show Context)
Citation Context ...ts show that a simple retrieval weight, which only accounted for the occurrence of the term in the document and the occurrence of the term in the query, performed worst in their retrieval comparisons =-=[5]-=-. By a probabilistic search system we mean one whose query specification, indexing processes, and retrieval rules are derived from the formal application of the theory of probability to the logic of t... |

1338 |
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer
- Salton
- 1989
(Show Context)
Citation Context ...eart of this task is the concise representation of the meanings of the texts contained in the collection. In the vector space model developed by Gerard Salton and associates at Cornell University [1] =-=[2]-=- [3] both documents and queries are represented as vectors in the “m’’ -dimensional term space assuming that the indexing vocabulary of nontrivial terms is of size “m”. For this model, we have a clear... |

1282 |
Applied Logistic Regression
- Hosmer, Lemeshow
- 1989
(Show Context)
Citation Context ... a document chosen at random from the collection will be relevant to query qi. The coefficients of individual query and document term properties are derived by using the method of logistic regression =-=[10]-=- which fits an equation to predict a dichotomous independent variable as a function of (possibly continuous) independent variables which show statistical variation. Once log odds of relevance is compu... |

247 |
The SMART Retrieval System – Experiments in Automatic Document Processing
- Salton
- 1971
(Show Context)
Citation Context ...he heart of this task is the concise representation of the meanings of the texts contained in the collection. In the vector space model developed by Gerard Salton and associates at Cornell University =-=[1]-=- [2] [3] both documents and queries are represented as vectors in the “m’’ -dimensional term space assuming that the indexing vocabulary of nontrivial terms is of size “m”. For this model, we have a c... |

247 | Inference networks for document retrieval
- Turtle
- 1991
(Show Context)
Citation Context ...r distribution is binomial rather than normal, as is usually assumed by ordinary regression. [15] [16]. An alternative approach is to use Bayesian inference networks as developed by Turtle and others =-=[14]-=- 3. Sampling for logistic regression One of the computational problems in using logistic regression is the computational size necessary to compute logistic regression coefficients. Even supposing that... |

188 | Using statistical testing in the evaluation of retrieval experiments
- Hull
- 1993
(Show Context)
Citation Context ...y a T-test to these differences under the null hypothesis that the methods perform identically and hence their mean difference should be zero and their standard deviation not significantly different. =-=[18]-=- provides a summary of possible statistical tests which might be used to evaluate retrieval experiments. j=l 225s226 6. Logistic model performance for the Cranfield collection For the logistic inferen... |

70 |
Overview of the first TREC conference
- Harman
- 1993
(Show Context)
Citation Context ...ents, and apply them directly to clue values for the new collection. 0 Y2 This methodology was first applied by Cooper, Gey and Chen [20] to the queries of the NIST text retrieval conference (TREC 1) =-=[21]-=-. This paper provides a more complete analysis of the method. 11. Applying to CACM and CISI collections In order to obtain means and standard deviation for all clues for the new collection, we must ob... |

63 |
A statistical interpretation of term specificity and its application in retrieval
- Sparck-Jones
- 1972
(Show Context)
Citation Context ... number of 1 %j documents for which term tj occurs in the document’s text or has been used to index the document. The attribute IDF, which is usually logged, was first suggested by Karen Sparck-Jones =-=[4]-=-. Two well-known weights, used in the SMART retrieval system, are the term frequency for the query terms (QAF), and, for the document terms, the term frequency- inverse document frequency product (DAF... |

47 |
Extending the Boolean and Vector Space Model of Information Retrieval with P-norm Queries and Multiple Concept Types
- Fox
- 1983
(Show Context)
Citation Context ...approximate the dependent variable, document relevance by linear combinations of multiple clues such as term frequency, authorship, and co-citation, were first introduced, with some success by Ed Fox =-=[11]-=-. More recently Fuhr and Buckley [12] [13] used polynomial regression to approximate relevance. There are two well-known major problems with the application of ordinary regression approaches to probab... |

2 |
Probabilistic Dependence and Logistic Inference in Information Retrieval
- Gey
- 1993
(Show Context)
Citation Context ...ween CISI standardized and tfidf/cosine. Among reasons for this failure to achieve performance improvement in the CISI collection are the different prior probability of relevance for the collections. =-=[22]-=- gives more detail on the sensitivity of the model to the proper estimation of prior probability of relevance. 12. Conclusions In this research we have investigated a new probabilistic text and docume... |

1 |
C Lam H Salton G. A generalized term dependence model in information retrieval
- Buckley
- 1983
(Show Context)
Citation Context ...sed on relevance, we have a chicken and egg process, whereby the test is applied to the sample which was used for fitting. In past tests of probabilistic models where this approach has also been used =-=[19]-=-, the next step in testing such models has been to train on half the number of queries and then use that training to predict the results of the other half of the queries. We feel that a large number o... |

1 |
Information retrieval from the TIPSTER collection: an application of staged logistic regression
- A
(Show Context)
Citation Context ...d standard deviation, CTv for each clue v, compute the new coefficients, and apply them directly to clue values for the new collection. 0 Y2 This methodology was first applied by Cooper, Gey and Chen =-=[20]-=- to the queries of the NIST text retrieval conference (TREC 1) [21]. This paper provides a more complete analysis of the method. 11. Applying to CACM and CISI collections In order to obtain means and ... |