## Modeling word burstiness using the Dirichlet distribution (2005)

### Cached

### Download Links

Venue: | In Proceedings of the 22nd international conference on Machine learning, volume 119 of ACM International Conference Proceeding Series |

Citations: | 49 - 4 self |

### BibTeX

@INPROCEEDINGS{Madsen05modelingword,

author = {Rasmus E. Madsen and David Kauchak and Charles Elkan},

title = {Modeling word burstiness using the Dirichlet distribution},

booktitle = {In Proceedings of the 22nd international conference on Machine learning, volume 119 of ACM International Conference Proceeding Series},

year = {2005},

pages = {545--552}

}

### Years of Citing Articles

### OpenURL

### Abstract

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model. 1.

### Citations

2721 | Indexing by Latent Semantic Analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...ferent from the within-document diversity allowed by latent topic modeling, where each topic is represented by a single multinomial, but each word in a document may be generated by a different topic (=-=Deerwester et al., 1990-=-; Hofmann, 1999; Blei et al., 2003). We hope that many applications of text modeling in addition to those outlined in this paper will benefit from using DCM models in the future. Acknowledgments. This... |

2366 | Latent Dirichlet Allocation
- Blei, Ng, et al.
(Show Context)
Citation Context ...of-bags-of-words. We show below that the latter approach works well. Dirichlet distributions have been used previously to model text, but our approach is fundamentally different. In the LDA approach (=-=Blei et al., 2003-=-) the Dirichlet is a distribution over topics, while each topic is modeled in the usual way as a multinomial distribution over words. In our approach, each topic, i.e. each class of documents, is mode... |

1103 | Machine learning in automated text categorization
- Sebastiani
- 2002
(Show Context)
Citation Context ...sumption, word emissions are independent given the class, i.e. the naive Bayes property holds. This property is not valid (Lewis, 1998), but naive Bayes models remain popular (McCallum & Nigam, 1998; =-=Sebastiani, 2002-=-) because they are fast and easy to implement, they can be fitted even with limited training data, and they do yield accurate classification when heuristics are applied (Jones, 1972; Rennie et al., 20... |

828 |
A Vector Space Model for Automatic Indexing
- Salton, Wong, et al.
- 1975
(Show Context)
Citation Context ...tic model that, without any heuristic changes, is far better suited for representing a class of text documents. As most researchers do, we represent an individual document as a vector of word counts (=-=Salton et al., 1975-=-). This bag-of-words representation loses semantic information, but it simplifies further processing. The usual next simplification is the assumption that documents are generated by repeatedly drawing... |

758 | A comparison of event models for naive Bayes text classification
- McCallum, Nigam
(Show Context)
Citation Context ...tribution. Under this assumption, word emissions are independent given the class, i.e. the naive Bayes property holds. This property is not valid (Lewis, 1998), but naive Bayes models remain popular (=-=McCallum & Nigam, 1998-=-; Sebastiani, 2002) because they are fast and easy to implement, they can be fitted even with limited training data, and they do yield accurate classification when heuristics are applied (Jones, 1972;... |

356 | A statistical interpretation of term specificity and its application in retrieval
- Jones
- 1972
(Show Context)
Citation Context ... Nigam, 1998; Sebastiani, 2002) because they are fast and easy to implement, they can be fitted even with limited training data, and they do yield accurate classification when heuristics are applied (=-=Jones, 1972-=-; Rennie et al., 2003). The central problem with the naive Bayes assumption is that words tend to appear in bursts, as opposed to being emitted independently (Church & Gale, 1995; Katz, 1996). Rennie ... |

347 | Naive (Bayes) at forty: The independence assumption in information retrieval
- Lewis
- 1998
(Show Context)
Citation Context ...enerated by repeatedly drawing words from a fixed distribution. Under this assumption, word emissions are independent given the class, i.e. the naive Bayes property holds. This property is not valid (=-=Lewis, 1998-=-), but naive Bayes models remain popular (McCallum & Nigam, 1998; Sebastiani, 2002) because they are fast and easy to implement, they can be fitted even with limited training data, and they do yield a... |

321 |
Human Behaviour and the Principle of Least-Effort: An introduction to human ecology. Addison-Wesley Cambridge MA
- Zipf
- 1949
(Show Context)
Citation Context ...el is appropriate for common words, but not for other words. The distributions of counts produced by multinomials are fundamentally different from the count distributions of natural text. Zipf’s law (=-=Zipf, 1949-=-) states that the probability pi of occurrence of an event follows a power law pi ≈ i −a , where i is the rank of the event and a is a parameter. The most famousexample of Zipf’s law is that the freq... |

255 | Automated learning of decision rules for text categorization
- Apté, Damerau, et al.
- 1994
(Show Context)
Citation Context ...n the industry and newsgroup data sets each document belongs to one class only. The Reuters-21578 3 data set contains 21,578 documents. We use the Mod Apte split which only contains 10,789 documents (=-=Apte et al., 1994-=-), those in the 90 classes with at least one training and one test example. The Mod Apte split uses a predefined set of 7,770 training documents and 3,019 test documents. The documents are multi-label... |

246 |
A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering.,” unpublished manuscript
- McCallum
- 1996
(Show Context)
Citation Context ...2003). We have made every effort to reproduce previous results in order to ensure that a fair comparison is made. Documents are preprocessed and count vectors are extracted using the Rainbow toolbox (=-=McCallum, 1996-=-). The 500 most common words are removed from the vocabulary to ensure that our results are comparable with previous results. The Dirichlet toolbox (Minka, 2003) is used to estimate the parameters of ... |

108 | Tackling the poor assumptions of naive Bayes text classifiers
- Rennie, Shih, et al.
- 2003
(Show Context)
Citation Context ...ic model that represents the data well. Unfortunately, for text classification too little attention has been devoted to this task. Instead, a generic multinomial model is typically used. Recent work (=-=Rennie et al., 2003-=-) has pointed out a number of deficiencies of the multinomial model, and Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the ... |

81 | Poisson mixtures
- Church, Gale
- 1995
(Show Context)
Citation Context ... when heuristics are applied (Jones, 1972; Rennie et al., 2003). The central problem with the naive Bayes assumption is that words tend to appear in bursts, as opposed to being emitted independently (=-=Church & Gale, 1995-=-; Katz, 1996). Rennie et al. (2003) address this issue by log-normalizing counts, reducing the impact of burstiness on the likelihood of a document. Teevan and Karger (2003) empirically search for a m... |

78 |
Distribution of content words and phrases in text and language modelling
- Katz
- 1996
(Show Context)
Citation Context ...applied (Jones, 1972; Rennie et al., 2003). The central problem with the naive Bayes assumption is that words tend to appear in bursts, as opposed to being emitted independently (Church & Gale, 1995; =-=Katz, 1996-=-). Rennie et al. (2003) address this issue by log-normalizing counts, reducing the impact of burstiness on the likelihood of a document. Teevan and Karger (2003) empirically search for a model that fi... |

39 |
An Information-theoretic Perspective of TF-IDF
- Aizawa
- 2003
(Show Context)
Citation Context ...+ xdw) where all logarithms are natural (base e). A traditional information retrieval heuristic is the term-frequency inverse-document-frequency (TFIDF) transformation, which exists in various forms (=-=Aizawa, 2003-=-). The version used here includes the log-transformation: x tfidf D dw = log(1 + xdw) log �D d ′ =1 δd ′ w where δdw is 1 if word w is present in document d. After the TFIDF transformation, document v... |

17 | Indexing by Latent Semantic Analysis - Harshman - 1990 |

11 | Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model - Teevan, Karger - 2003 |

6 | Parametric Models of Linguistic Count Data.” 41st Annual Meeting of the Association for Computational Linguistics - Jansche - 2003 |

1 |
Probabilistic latent semantic indexing (PLSI
- Hofmann
- 1999
(Show Context)
Citation Context ...cument diversity allowed by latent topic modeling, where each topic is represented by a single multinomial, but each word in a document may be generated by a different topic (Deerwester et al., 1990; =-=Hofmann, 1999-=-; Blei et al., 2003). We hope that many applications of text modeling in addition to those outlined in this paper will benefit from using DCM models in the future. Acknowledgments. This paper is based... |

1 |
Estimating a Dirichlet distribution. www.stat.cmu.edu/˜minka/papers/dirichlet
- Minka
- 2003
(Show Context)
Citation Context ...frequency of an English word, as a function of the word’s rank, follows a power law with exponent close to minus one. We propose to model a collection of text documents with a Dirichlet distribution (=-=Minka, 2003-=-). The Dirichlet distribution can be interpreted in two ways for this purpose, either as a bag-of-scaled-documents or as a bag-of-bags-of-words. We show below that the latter approach works well. Diri... |