MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Poisson Mixtures (1995) [53 citations — 3 self]

by Kenneth W. Church ,  William A. Gale
Natural Language Engineering
Add To MetaCart

Abstract:

Shannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a "bag-of-words" assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter theta to vary over documents subject to a density function phi. phi is intended to capture dependencies on hidden variables such [as] genre, author, topic, etc. (The Negative Binomial is a well-known special case where phi is a Gamma distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (sigma^2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x>=2|x>=1)).

Citations

4923 Elements of Information Theory – Cover, Thomas - 1991
565 Automatic Text Processing – Salton - 1989
536 Text Compression – Bell, Cleary, et al. - 1990
215 Some simple effective approximations to 2-Poisson method for probabilistic weighted retrieval – Robertson, Walker - 1994
184 A statistical interpretation of term specificity and its application in retrieval – Jones, K - 1972
175 A method for disambiguating word senses in a large corpus – Gale, Church, et al. - 1993
144 Frequency Analysis of English Usage – Francis - 1982
96 Univariate discrete distributions – Johnson, Kotz, et al. - 1992
94 Inference and Disputed Authorship: The Federalist – Mosteller, Wallace - 1964
52 Probabilistic models for automatic indexing – Bookstein, Swanson - 1974
50 A probabilistic approach to automatic keyword indexing (part i & ii – Harter - 1975
34 Adaptive language modeling using the maximum entropy approach – Lau, Rosenfeld, et al. - 1993
2 Explanation and Generalization of Vector Models – Bookstein - 1982