## An Empirical Study of Smoothing Techniques for Language Modeling (1996)

### Cached

### Download Links

- [l2r.cs.uiuc.edu]
- [l2r.cs.uiuc.edu]
- [acl.ldc.upenn.edu]
- [wing.comp.nus.edu.sg]
- [aclweb.org]
- [www.aclweb.org]
- [aclweb.org]
- [aclweb.org]
- [ucrel.lancs.ac.uk]
- [arxiv.org]
- [www2.denizyuret.com]
- [research.microsoft.com]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [nlp.postech.ac.kr]
- [www.isip.msstate.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 851 - 20 self |

### BibTeX

@TECHREPORT{Chen96anempirical,

author = {Stanley Chen and Stanley F. Chen and Joshua Goodman and Joshua Goodman},

title = {An Empirical Study of Smoothing Techniques for Language Modeling},

institution = {},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a tutorial introduction to n-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980), Katz (1987), Bell, Cleary, and Witten (1990), Ney, Essen, and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g., Brown versus Wall Street Journal), count cutoffs, and n-gram order (bigram versus trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. Our results show that previous comparisons have not been complete enough to fully characterize smoothing algorithm performance. We introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all oth...

### Citations

697 | Class-based Ngram models of natural language - Brown - 1992 |

692 | A stochastic parts program and noun phrase parser for unrestricted text
- Church
- 1988
(Show Context)
Citation Context ...some domain of interest. Language models are employed in many tasks including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (=-=Church, 1988-=-; Brown et al., 1990; Kernighan, Church & Gale, 1990; Hull, 1992; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability o... |

664 | Slava M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; =-=Katz, 1987-=-; Church & Gale, 1991; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the... |

617 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ... T measured in words. 2 This value can be interpreted as the average number of bits needed to encode each of the WT words in the test data using the compression algorithm associated with model p(tk) (=-=Bell et al., 1990-=-). We sometimes refer to cross-entropy as just entropy. The perplexity PPp(T ) of a model p is the reciprocal of the (geometric) average probability assigned by the model to each word in the test set ... |

584 | A Statistical Approach to Machine Translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ... interest. Language models are employed in many tasks including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; =-=Brown et al., 1990-=-; Kernighan, Church & Gale, 1990; Hull, 1992; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the p... |

526 | Theory of probability - Jeffreys - 1961 |

525 | Switchboard: Telephone speech corpus for research and development - Godfrey, Holliman, et al. - 1992 |

512 |
An inequality and associated maximization technique in statistical estimation of probabilistic functions of a markov process
- Baum
- 1972
(Show Context)
Citation Context ...be the uniform distribution punif(wi) = 1 |V | . Given fixed pML, it is possible to search efficiently for the λ w i−1 i−n+1 that maximize the probability of some data using the Baum–Welch algorithm (=-=Baum, 1972-=-). Training a distinct to the same λ w i−1 i−n+1 for each w i−1 i−n+1 is not generally felicitous, while setting all λ w i−1 i−n+1 value leads to poor performance (Ristad, 1995). Bahl, Jelinek and Mer... |

503 | Three generative, lexicalised models for statistical parsing
- Collins
- 1997
(Show Context)
Citation Context ...her applications as well, e.g. prepositional phrase attachment (Collins & Brooks, 1995), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (Magerman, 1994; =-=Collins, 1997-=-; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. In the extreme case where there is so much tr... |

391 | A Maximum Likelihood Approach to Continuous Speech Recognition - Bahl, Jelinek, et al. - 1983 |

353 |
The population frequencies of the species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...words considered. Lidstone and Jeffreys advocate taking δ = 1. Gale and Church (1990, 1994) have argued that this method generally performs poorly. 2.2. Good–Turing estimate The Good–Turing estimate (=-=Good, 1953-=-) is central to many smoothing techniques. The Good–Turing estimate states that for any n-gram that occurs r times, we should pretend that it occurs r ∗ times where r ∗ = (r + 1) nr+1 (2) and where nr... |

336 | Interpolated estimation of markov source parameters from sparse data - Jelinek, Mercer - 1980 |

271 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...deling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; Church & Gale, 1991; =-=Kneser & Ney, 1995-=-; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previous comparison is tha... |

229 | A Gaussian prior for smoothing maximum entropy models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ... us p(s) = l+1 ∏ i=1 p(wi|w1 · · · wi−1) ≈ l+1 ∏ i=1 p(wi|w i−1 i−n+1 ) 1 Maximum entropy techniques can also be used to smooth n-gram models. A discussion of these techniques can be found elsewhere (=-=Chen & Rosenfeld, 1999-=-).362 An empirical study of smoothing techniques for language modeling where w j i denotes the words wi · · · w j and where we take w−n+2 through w0 to be 〈BOS〉. To estimate p(wi|w i−1 i−n+1 ), a nat... |

228 |
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ... to Jelinek– Mercer smoothing. To end the recursion, the Katz unigram model is taken to be the maximum likelihood unigram model. . 2.5. Witten–Bell smoothing Witten–Bell smoothing (Bell et al., 1990; =-=Witten & Bell, 1991-=-) 3 was developed for the task of text compression, and can be considered to be an instance of Jelinek–Mercer smoothing. In particular, the nth-order smoothed model is defined recursively as a linear ... |

169 | On structuring probabilistic dependences in stochastic language modeling - Ney, Essen, et al. - 1994 |

155 |
Natural Language Parsing as Statistical Pattern Recognition
- Magerman
- 1994
(Show Context)
Citation Context ... but for many other applications as well, e.g. prepositional phrase attachment (Collins & Brooks, 1995), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (=-=Magerman, 1994-=-; Collins, 1997; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. In the extreme case where ther... |

136 | Prepositional phrase attachment through a backedoff model
- Collins, Brooks
- 1995
(Show Context)
Citation Context ...e. 6. Discussion Smoothing is a fundamental technique for statistical modeling, important not only for language modeling but for many other applications as well, e.g. prepositional phrase attachment (=-=Collins & Brooks, 1995-=-), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (Magerman, 1994; Collins, 1997; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help ... |

128 |
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language
- Church, Gale
- 1991
(Show Context)
Citation Context ... issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; =-=Church & Gale, 1991-=-; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previo... |

120 | A tree-based statistical language model for natural language speech recognition - Bahl - 1987 |

119 | Entropy-based pruning of backoff language models
- Stolcke
- 1998
(Show Context)
Citation Context ...he differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (Kneser, 1996; Seymore & Rosenfeld, 1996; =-=Stolcke, 1998-=-). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine how the performance of a language model meas... |

81 | A spelling correction program based on a noisy channel model - Kernighan, Church, et al. - 1990 |

79 | A hierarchical Dirichlet language model
- MacKay, Peto
- 1994
(Show Context)
Citation Context ...re lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; Church & Gale, 1991; Kneser & Ney, 1995; =-=MacKay & Peto, 1995-=-) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previous comparison is that of Ney, Martin and ... |

68 | Good-turing frequency estimation without tears - Gale, Sampson - 1995 |

65 | Scalable backoff language models
- Seymore, Rosenfeld
- 1996
(Show Context)
Citation Context ...dition, the magnitudes of the differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (Kneser, 1996; =-=Seymore & Rosenfeld, 1996-=-; Stolcke, 1998). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine how the performance of a lang... |

64 | An estimate of an upper bound for the entropy of English - Brown, Pietra, et al. - 1992 |

64 | Building Probabilistic Models for Natural Language
- Chen
- 1996
(Show Context)
Citation Context .... In Section 5, we present the results of all of our experiments. Finally, in Section 6 we summarize the most important conclusions of this work. This work builds on our previously reported research (=-=Chen, 1996-=-; Chen & Goodman, 1996). An extended version of this paper (Chen & Goodman, 1998) is available; it contains a tutorial introduction to n-gram models and smoothing, more complete descriptions of existi... |

49 | Statistical Language Modeling Using Leaving-One-Out - Ney, Martin, et al. - 1997 |

40 |
Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities
- Lidstone
- 1920
(Show Context)
Citation Context ...entations sometimes differ significantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; =-=Lidstone, 1920-=-; Johnson, 1932; Jeffreys, 1948). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1... |

35 |
Estimation of probabilities m the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ...le smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (=-=Nádas, 1984-=-; Katz, 1987; Church & Gale, 1991; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size.... |

31 | A statistical approach to machine translation - Lafferty, Roossin - 1990 |

31 |
Probability: the deductive and inductive problems
- Johnson
- 1932
(Show Context)
Citation Context ...mes differ significantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; Lidstone, 1920; =-=Johnson, 1932-=-; Jeffreys, 1948). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1. Thus, we set ... |

28 | The JanusRTk Switchboard/Callhome 1997 Evaluation System: Pronunciation Modeling - Finke - 1997 |

28 |
On Turing’s Formula for Word Probabilities
- Nádas
- 1985
(Show Context)
Citation Context ...malize: for an n-gram wi i−n+1 with r counts, we take nr pGT(w i r ∗ i−n+1 ) = where N = ∑ ∞ r=0 nrr ∗ . The Good–Turing estimate can be derived theoretically using only a couple of weak assumptions (=-=Nádas, 1985-=-), and has been shown empirically to accurately describe data when nr values are large. In practice, the Good–Turing estimate is not used by itself for n-gram smoothing, because it does not include th... |

26 |
Statistical Language Modeling Using a Variable Context Length
- Kneser
- 1996
(Show Context)
Citation Context ...ff case. In addition, the magnitudes of the differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (=-=Kneser, 1996-=-; Seymore & Rosenfeld, 1996; Stolcke, 1998). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine ho... |

24 | What is wrong with adding one - Gale, Church - 1994 |

24 | L.: A hierarchical dirichlet language model. Natural Language Engineering 1(3 - Mackay, Peto - 1994 |

22 |
On smoothing techniques for bigram-based natural language modelling
- Ney, Essen
- 1991
(Show Context)
Citation Context ... to method C in these references. Different notation is used in the original text.366 An empirical study of smoothing techniques for language modeling 2.6. Absolute discounting Absolute discounting (=-=Ney & Essen, 1991-=-; Ney, Essen and Kneser, 1994), like Jelinek– Mercer smoothing, involves the interpolation of higher- and lower-order models. However, , instead of multiplying the higher-order maximum-likelihood dist... |

18 | Language and Pronunciation Modeling - Seymore, Chen, et al. - 1997 |

9 |
Combining syntactic knowledge and visual text recognition: A hidden Markov model for part of speech tagging in a word recognition algorithm
- Hull
- 1992
(Show Context)
Citation Context ...s including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Kernighan, Church & Gale, 1990; =-=Hull, 1992-=-; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the previous two words: p(wi|wi−2wi−1). The simpl... |

8 | A Maximum Likelihood Approach to Continuous Speech Recognition - F, Mercer - 1983 |

8 |
Specification of the 1995 ARPA hub 3 evaluation: Unlimited vocabulary NAB news baseline
- Stern
- 1996
(Show Context)
Citation Context ...eet Journal (WSJ) text from 1990–1995, and 35 million words of San Jose Mercury News (SJM) text from 1991. We used the 20 000 word vocabulary supplied for the 1995 ARPA speech recognition evaluation (=-=Stern, 1996-=-). For the NAB corpus, we primarily used the Wall Street Journal374 An empirical study of smoothing techniques for language modeling text, and only used the other text if more than 98 million words o... |

7 | K.W.: Estimation procedures for language context: Poor estimates are worse than none - Gale, Church - 1990 |

7 |
Combining statistical and syntactic methods in recognizing handwritten sentences. AAAI Symposium: Probabilistic Approaches to Natural Language
- Srihari, Baltus
- 1992
(Show Context)
Citation Context ...speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Kernighan, Church & Gale, 1990; Hull, 1992; =-=Srihari & Baltus, 1992-=-). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the previous two words: p(wi|wi−2wi−1). The simplest § E-mail: sfc@cs.cmu... |

7 | Hub 4 language modeling using domain interpolation and data clustering
- Weng, Stolcke, et al.
- 1997
(Show Context)
Citation Context ... speed and memory of computers, there has been some use of higherorder n-gram models such as 4-gram and 5-gram models in speech recognition in recent years (Seymore, Chen, Eskenazi & Rosenfeld, 1997; =-=Weng et al., 1997-=-). In this section, we examine388 An empirical study of smoothing techniques for language modeling Diff. in test cross-entropy from baseline (bits/token) –0. 0 0 05 . 0 05 . 1 –0 . 1 –0 . 15 –0 . 2 –... |

6 | hub-4 sphinx-3 system - Placeway, Chen, et al. - 1996 |

6 |
Hub 4: Business Broadcast News
- Rudnicky
- 1996
(Show Context)
Citation Context ...rd data is three million words of telephone conversation transcriptions (Godfrey, Holliman & McDaniel, 1992). We used the 9800 word vocabulary created by Finke et al. (1997). The Broadcast News text (=-=Rudnicky, 1996-=-) consists of 130 million words of transcriptions of television and radio news shows from 1992–1996. We used the 50 000 word vocabulary developed by Placeway et al. (1997). For all corpora, any out-of... |

4 | Evaluation metric for language models. In DARPA Broadcast News Transcription Understanding Workshop - Chen, Beeferman, et al. - 1998 |

3 |
Theory of Probability (2 nd Edition
- Jeffreys
- 1948
(Show Context)
Citation Context ...ificantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; Lidstone, 1920; Johnson, 1932; =-=Jeffreys, 1948-=-). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1. Thus, we set padd(wi|w i−1 i−... |

1 | What’s wrong with one - Gale, Church - 1994 |