## Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval (1998)

Citations: | 347 - 1 self |

### BibTeX

@INPROCEEDINGS{Lewis98naive(bayes),

author = {David D. Lewis},

title = {Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval},

booktitle = {},

year = {1998},

pages = {4--15},

publisher = {Springer Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

The naive Bayes classifier, currently experiencing a renaissance in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assump- tions made about word occurrences in documents.

### Citations

3127 | Introduction to Modern Information Retrieval - Salton, McGill - 1983 |

1703 | Text categorization with support vector machines: Learning with many relevant features
- Joachims
- 1998
(Show Context)
Citation Context ...rrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization =-=[3, 14, 15, 25, 32, 34]-=- but none to text retrieval. 5.3 Non-Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information ... |

852 |
Relevance feedback in information retrieval
- Rocchio
- 1971
(Show Context)
Citation Context ...ive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years [37,38]. In fact, naive Bayes methods, along with prototype formation methods =-=[44, 45, 24]-=-, accounted for most applications of supervised learning to information retrieval until quite recently. In this paper we briefly review the naive Bayes classifier and its use in information retrieval.... |

601 | Improving retrieval performance by relevance feedback
- Salton, Buckley
- 1990
(Show Context)
Citation Context ...ive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years [37,38]. In fact, naive Bayes methods, along with prototype formation methods =-=[44, 45, 24]-=-, accounted for most applications of supervised learning to information retrieval until quite recently. In this paper we briefly review the naive Bayes classifier and its use in information retrieval.... |

600 | On the optimality of the simple bayesian classifier under zeroone loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...imental evidence has been developed that a training procedure based on the naive Bayes assumptions can yield an optimal classifier in a variety of situations where the assumptions are wildly violated =-=[9]-=-. 7 Conclusion Naive Bayes models have been remarkably successful in information retrieval. In the yearly TREC evaluations [16-19, 52], numerous variations of naive Bayes models have been used, produc... |

597 |
Relevance weighting of search terms
- Robertson, Jones, et al.
- 1976
(Show Context)
Citation Context ...ial score of a document to be the constant term in Equation 11, the full score can be computed by adding up values involving only those words present in a document, not those absent from the document =-=[41,48]-=-. Since most words do not occur in most documents, this is desirable from the standpoint of computational eciency. The two-class, binary feature naive Bayes model has come to be known in information r... |

518 | Perceptrons: An Introduction to Computational Geometry - Minsky, Papert - 1969 |

474 | A Sequential Algorithm for Training Text Classifiers
- Lewis, Gale
- 1994
(Show Context)
Citation Context ...rrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization =-=[3, 14, 15, 25, 32, 34]-=- but none to text retrieval. 5.3 Non-Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information ... |

366 | Pivoted document length normalization
- Singhal, Buckley, et al.
- 1996
(Show Context)
Citation Context ...h can only increase the estimate P(cklx), regardless of the actual content of the document. While a case can be made that longer documents are somewhat more likely to be of interest to any given user =-=[43, 47]-=-, the above effect is likely to be far stronger than appropriate. 5 Other Distributional Models In this section we look at a number of variations on the naive Bayes model that attempt to address the w... |

353 | Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval
- Robertson, Walker
- 1994
(Show Context)
Citation Context ...h can only increase the estimate P(cklx), regardless of the actual content of the document. While a case can be made that longer documents are somewhat more likely to be of interest to any given user =-=[43, 47]-=-, the above effect is likely to be far stronger than appropriate. 5 Other Distributional Models In this section we look at a number of variations on the naive Bayes model that attempt to address the w... |

247 | Context-sensitive learning methods for text categorization
- Cohen, Singer
- 1999
(Show Context)
Citation Context ...ations of naive Bayes models have been used, producing some of the best results. Recent comparisons of learning methods for text categorization have been somewhat less favorable to naive Bayes models =-=[5, 25]-=- while still showing them to achieve respectable effectiveness. This may be because the larger amount of training data available in text categorization data sets favors algorithms which produce more c... |

230 |
A method for disambiguating word senses in a large corpus
- Gale, Church, et al.
- 1993
(Show Context)
Citation Context ...rrange to compare posterior log odds of classes for each document individually, without comparisons across documents. Indeed, we know of many applications of multinomial models to text categorization =-=[3, 14, 15, 25, 32, 34]-=- but none to text retrieval. 5.3 Non-Distributional Approaches A variety of ad hoc approaches have been developed that more or less gracefully integrate term frequency and document length information ... |

229 | Evaluation of an inference network-based retrieval model
- Turtle, Croft
- 1991
(Show Context)
Citation Context ...sumption isn't really needed anyway. Whatever its successes in machine learning, the first strategy has not met with great success in IR. While interesting research on dependence models has been done =-=[8, 11, 21,49, 50], these mo-=-dels are rarely used in practice. Even most work in the "inference net" approach to information retrieval has mostly used independence (or ad hoc) models. Results from the second strategy ar... |

198 | Information storage and retrieval - Korfhage - 1997 |

193 | The Third Text REtrieval Conference (TREC-3 - Harman, Ed - 1994 |

189 |
On relevance, probabilistic indexing and information retrieval
- Maron, Kuhns
- 1960
(Show Context)
Citation Context ...researchers tend to be aware of the large pattern recognition literature on naive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years =-=[37,38]-=-. In fact, naive Bayes methods, along with prototype formation methods [44, 45, 24], accounted for most applications of supervised learning to information retrieval until quite recently. In this paper... |

180 | Scaling Up the Accuracy of Naïve-Bayes Classifiers: a Decision Tree Hybrid
- Kohavi
- 1996
(Show Context)
Citation Context ...g them to achieve respectable effectiveness. This may be because the larger amount of training data available in text categorization data sets favors algorithms which produce more complex classifiers =-=[27]-=-, or may because the more elaborate representation and estimation tricks developed for retrieval and routing with naive Bayes have not been applied to categorization. There are many open research ques... |

159 |
Pattern classification and scene analysis. Wiley-Interscience
- Duda, Hart
- 1973
(Show Context)
Citation Context ...s' rule as: P(xlek) P(clx) = p(c) x p(x) (3) When we know the P(ck I x) exactly for a classification problem, classification can be done in an optimal way for a wide variety of effectiveness measures =-=[10, 31]-=-. For instance, the expected number of classification errors can be minimized by assigning a document with feature vector x to the class ck for which P(ck I x) is highest. We of course do not know the... |

124 |
A theoretical basis for the use of co-occurrence data in information retrieval
- Rijsbergen
- 1977
(Show Context)
Citation Context ...sumption isn't really needed anyway. Whatever its successes in machine learning, the first strategy has not met with great success in IR. While interesting research on dependence models has been done =-=[8, 11, 21,49, 50], these mo-=-dels are rarely used in practice. Even most work in the "inference net" approach to information retrieval has mostly used independence (or ad hoc) models. Results from the second strategy ar... |

122 | Information retrieval systems: theory and implementation - Kowalski - 1997 |

102 | Evaluating and optimizing autonomous text classification systems
- Lewis
- 1995
(Show Context)
Citation Context ...s' rule as: P(xlek) P(clx) = p(c) x p(x) (3) When we know the P(ck I x) exactly for a classification problem, classification can be done in an optimal way for a wide variety of effectiveness measures =-=[10, 31]-=-. For instance, the expected number of classification errors can be minimized by assigning a document with feature vector x to the class ck for which P(ck I x) is highest. We of course do not know the... |

90 |
A probabilistic approach to automatic keyword indexing: part 1
- Harter
- 1975
(Show Context)
Citation Context ...e context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures [4, 26]: the Poisson itself [40], mixtures of 2, 3, or more Poissons =-=[1,2,22,23,36]-=-, and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, sometimes involving latent variables that intervene between the class label and... |

86 | Models for retrieval with probabilistic indexing
- Fuhr
- 1989
(Show Context)
Citation Context ...ormation into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal binary indexing of the document, for which the observed index term occurrences provide evidence =-=[7, 13]-=-. Retrieval or classification is based on computing (or approximating) the expected value of the posterior log odds. The expectation is taken with respect to the probabilities of various ideal indexin... |

83 |
Relevance feedback and other query modification techniques
- Harman
- 1992
(Show Context)
Citation Context ... related and partially ad hoc applications of naive Bayes dating back to Maron [37]. Robertson and Sparck Jones' particular interest in the binary independence model was its use in relevance [eedback =-=[20, 45]-=-. In relevance feedback, a user query is given to a search engine, which produces an initial ranking of its document collection by some means. The user examines the initial top-ranked documents and gi... |

81 | Natural language processing for information retrieval
- Lewis, Jones
- 1996
(Show Context)
Citation Context ...] [46, Ch. 3], [51, Chs. 2-3]). An ongoing surprise and disappointment is that structurally simple representations produced without linguistic or domain knowledge have been as effective as any others =-=[30, 33]-=-. We therefore make the common assumption that the preprocessing of the document produces a bag (multiset) of index terms which do not themselves have internal structure. This representation is someti... |

78 |
Distribution of content words and phrases in text and language modelling
- Katz
- 1996
(Show Context)
Citation Context ...distributions for term frequencies have been investigated, some in the context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures =-=[4, 26]-=-: the Poisson itself [40], mixtures of 2, 3, or more Poissons [1,2,22,23,36], and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, som... |

77 | Using taxonomy, discriminants, and signatures for navigating in text databases
- Chakrabarti, Dom, et al.
- 1997
(Show Context)
Citation Context |

73 |
Automatic indexing: an experimental inquiry
- Maron
- 1961
(Show Context)
Citation Context ...researchers tend to be aware of the large pattern recognition literature on naive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years =-=[37,38]-=-. In fact, naive Bayes methods, along with prototype formation methods [44, 45, 24], accounted for most applications of supervised learning to information retrieval until quite recently. In this paper... |

70 | The First Text REtrieval - Harman - 1999 |

61 | Text categorization of low quality images
- Ittner, Lewis, et al.
- 1995
(Show Context)
Citation Context ...ive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years [37,38]. In fact, naive Bayes methods, along with prototype formation methods =-=[44, 45, 24]-=-, accounted for most applications of supervised learning to information retrieval until quite recently. In this paper we briefly review the naive Bayes classifier and its use in information retrieval.... |

52 |
An evaluation of feedback in document retrieval using cooccurrence data
- HARPER, RIJSBERGEN
- 1978
(Show Context)
Citation Context ...sumption isn't really needed anyway. Whatever its successes in machine learning, the first strategy has not met with great success in IR. While interesting research on dependence models has been done =-=[8, 11, 21,49, 50], these mo-=-dels are rarely used in practice. Even most work in the "inference net" approach to information retrieval has mostly used independence (or ad hoc) models. Results from the second strategy ar... |

49 |
Probabilistic models of indexing and searching
- Robertson, Rijsbergen, et al.
- 1981
(Show Context)
Citation Context ...ce [40] is the most clear treatment from a classification standpoint. Despite considerable study, explicit use of Poisson mixtures for text retrieval have not proven more effective than using the BIM =-=[35,42]-=-. This failure has been variously blamed on the larger number of parameters these models require estimating, the choice of estimation methods, the difficulty of accounting for document length in these... |

34 | One term or two
- Church
- 1995
(Show Context)
Citation Context ...distributions for term frequencies have been investigated, some in the context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures =-=[4, 26]-=-: the Poisson itself [40], mixtures of 2, 3, or more Poissons [1,2,22,23,36], and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, som... |

32 |
Search term relevance weighting given little relevance information
- JONES, K
- 1979
(Show Context)
Citation Context ...ial score of a document to be the constant term in Equation 11, the full score can be computed by adding up values involving only those words present in a document, not those absent from the document =-=[41,48]-=-. Since most words do not occur in most documents, this is desirable from the standpoint of computational eciency. The two-class, binary feature naive Bayes model has come to be known in information r... |

31 |
Parameter estimation for probabilistic document retrieval models
- Losee
- 1988
(Show Context)
Citation Context ...ce [40] is the most clear treatment from a classification standpoint. Despite considerable study, explicit use of Poisson mixtures for text retrieval have not proven more effective than using the BIM =-=[35,42]-=-. This failure has been variously blamed on the larger number of parameters these models require estimating, the choice of estimation methods, the difficulty of accounting for document length in these... |

28 |
Experiments with representation in a document retrieval system
- Croft
- 1983
(Show Context)
Citation Context ...ormation into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal binary indexing of the document, for which the observed index term occurrences provide evidence =-=[7, 13]-=-. Retrieval or classification is based on computing (or approximating) the expected value of the posterior log odds. The expectation is taken with respect to the probabilities of various ideal indexin... |

27 | Some inconsistencies and misidentified modelling assumptions in probabilistic information retrieval
- Cooper
- 1992
(Show Context)
Citation Context ...es [4]. In any case, the effectiveness improvements yielded by these strategies have been small (with the possible selection of feature selection). IR's representative of the third strategy is Cooper =-=[6], who poin-=-ts out that in the case of a two-class naive Bayes model, the usual independence assumptions (Equation 4) can be replaced by a weaker "linked dependence" assumption: P(xlc ) _ fi P(xjlc) (14... |

26 |
Text Representation for Intelligent Text Retrieval: A Classification-Oriented View
- Lewis
- 1992
(Show Context)
Citation Context ...] [46, Ch. 3], [51, Chs. 2-3]). An ongoing surprise and disappointment is that structurally simple representations produced without linguistic or domain knowledge have been as effective as any others =-=[30, 33]-=-. We therefore make the common assumption that the preprocessing of the document produces a bag (multiset) of index terms which do not themselves have internal structure. This representation is someti... |

26 |
Boolean queries and term dependencies in probabilistic retrieval models
- Croft
- 1986
(Show Context)
Citation Context ...sumption isn't really needed anyway. Whatever its successes in machine learning, the first strategy has not met with great success in IR. While interesting research on dependence models has been done =-=[8, 11, 21, 49, 50], these mo-=-dels are rarely used in practice. Even most work in the "inference net" approach to information retrieval has mostly used independence (or ad hoc) models. Results from the second strategy ar... |

24 | Document classification by machine: theory and practice
- Guthrie, Walker, et al.
- 1994
(Show Context)
Citation Context ...m frequencies is to treat the bag of words for a length f document as resulting from f draws on a d-valued multinomial variable X, rather than as a single draw on a vector-valued variable of length d =-=[15]-=-. The naive Bayes assumption then is that the draws on X are independent-- each word of the document is generated independently from every other. A multinomial model has the advantage that document le... |

20 |
A decision theoretic foundation for indexing
- Bookstein, Swanson
- 1975
(Show Context)
Citation Context ...e context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures [4, 26]: the Poisson itself [40], mixtures of 2, 3, or more Poissons =-=[1,2,22,23,36]-=-, and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, sometimes involving latent variables that intervene between the class label and... |

17 |
Modelling documents with multiple Poisson distributions
- Margulis
- 1993
(Show Context)
Citation Context ...e context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures [4, 26]: the Poisson itself [40], mixtures of 2, 3, or more Poissons =-=[1,2,22,23,36]-=-, and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, sometimes involving latent variables that intervene between the class label and... |

9 |
Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms
- William
- 1992
(Show Context)
Citation Context ... Sparck Jones noted that if a system does not need to choose between c and c2, but only to rank documents in order of P(clx), then the only quantity needed from Equation 11 is:s. pj(1 - pj2)s. .... . =-=(12)-=- All other values in Equation 11 are constant across x's, and so can dropped. the result is still monotonic with P(clx), but does not require an estimate of the prior P(c). Such an estimate is dicult ... |

9 |
1977] Operations research applied to document indexing and retrieval decisions
- BOOKSTEIN, KRAFT
- 1977
(Show Context)
Citation Context ...e context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures [4, 26]: the Poisson itself [40], mixtures of 2, 3, or more Poissons =-=[1, 2, 22, 23, 36]-=-, and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, sometimes involving latent variables that intervene between the class label and... |

7 |
Two learning schemes in information retrieval
- Yu, Mizuno
- 1988
(Show Context)
Citation Context ...oach is to fit a distributional but nonparametric model (for instance a linear regression) to predict the probability that a given term frequency will be observed in a document of a particular length =-=[53]-=-. Such nonparametric approaches have been relatively rare in IR, and it appears that the sophisticated discretization and kernel based approaches investigated in machine learning have not been tried. ... |

5 | Percepttons: An Introduction to Computational Geometry - Minsky, Papert - 1969 |

5 |
Mosteller: Applied Bayesian and classical inference
- Wallace, F
- 1984
(Show Context)
Citation Context ...encies have been investigated, some in the context of naive Bayes classifiers and some for other purposes. The distributions investigated have mostly been Poisson mixtures [4, 26]: the Poisson itself =-=[40]-=-, mixtures of 2, 3, or more Poissons [1,2,22,23,36], and the negative binomial (an infinite mixture of Poissons) [40]. The details of the particular models can be complex, sometimes involving latent v... |

3 |
Bayesian inference with node aggregation for information retrieval
- Favero, B, et al.
- 1994
(Show Context)
Citation Context |

2 |
and Kenji Yamanishi. Document classification using a finite mixture model
- Li
- 1997
(Show Context)
Citation Context |

1 |
Operations research apphed to document indexing and retrieval decisions
- Bookstein, Kraft
- 1977
(Show Context)
Citation Context |