## An Empirical Study of Smoothing Techniques for Language Modeling (1998)

### Cached

### Download Links

- [l2r.cs.uiuc.edu]
- [l2r.cs.uiuc.edu]
- [acl.ldc.upenn.edu]
- [wing.comp.nus.edu.sg]
- [aclweb.org]
- [www.aclweb.org]
- [aclweb.org]
- [aclweb.org]
- [ucrel.lancs.ac.uk]
- [arxiv.org]
- [www2.denizyuret.com]
- [research.microsoft.com]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [nlp.postech.ac.kr]
- [www.isip.msstate.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 924 - 20 self |

### BibTeX

@TECHREPORT{Chen98anempirical,

author = {Stanley F. Chen},

title = {An Empirical Study of Smoothing Techniques for Language Modeling},

institution = {},

year = {1998}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods. 1

### Citations

736 | Class-Based n-gram Models of Natural Language - Brown, Pietra, et al. - 1992 |

706 | A stochastic parts program and a noun phrase parser for unrestricted text
- Church
- 1988
(Show Context)
Citation Context ...some domain of interest. Language models are employed in many tasks including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (=-=Church, 1988-=-; Brown et al., 1990; Kernighan, Church & Gale, 1990; Hull, 1992; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability o... |

698 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; =-=Katz, 1987-=-; Church & Gale, 1991; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the... |

639 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ... T measured in words. 2 This value can be interpreted as the average number of bits needed to encode each of the WT words in the test data using the compression algorithm associated with model p(tk) (=-=Bell et al., 1990-=-). We sometimes refer to cross-entropy as just entropy. The perplexity PPp(T ) of a model p is the reciprocal of the (geometric) average probability assigned by the model to each word in the test set ... |

629 | A statistical approach to machine translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ... interest. Language models are employed in many tasks including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; =-=Brown et al., 1990-=-; Kernighan, Church & Gale, 1990; Hull, 1992; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the p... |

559 | Theory of Probability - Jeffreys - 1961 |

556 | SWITCHBOARD: Telephone speech corpus for research and development - Godfrey, Holliman, et al. - 1992 |

544 |
An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, Inequalities 3
- Baum
- 1972
(Show Context)
Citation Context ...be the uniform distribution punif(wi) = 1 |V | . Given fixed pML, it is possible to search efficiently for the λ w i−1 i−n+1 that maximize the probability of some data using the Baum–Welch algorithm (=-=Baum, 1972-=-). Training a distinct to the same λ w i−1 i−n+1 for each w i−1 i−n+1 is not generally felicitous, while setting all λ w i−1 i−n+1 value leads to poor performance (Ristad, 1995). Bahl, Jelinek and Mer... |

524 | Three Generative, Lexicalised Models for Statistical Parsing
- Collins
- 1997
(Show Context)
Citation Context ...her applications as well, e.g. prepositional phrase attachment (Collins & Brooks, 1995), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (Magerman, 1994; =-=Collins, 1997-=-; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. In the extreme case where there is so much tr... |

412 | A maximum likelihood approach to continuous speech recognition - Bahl, Jelinek, et al. - 1983 |

393 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...words considered. Lidstone and Jeffreys advocate taking δ = 1. Gale and Church (1990, 1994) have argued that this method generally performs poorly. 2.2. Good–Turing estimate The Good–Turing estimate (=-=Good, 1953-=-) is central to many smoothing techniques. The Good–Turing estimate states that for any n-gram that occurs r times, we should pretend that it occurs r ∗ times where r ∗ = (r + 1) nr+1 (2) and where nr... |

351 | Interpolated estimation of Markov source parameters from sparse data,” Pattern Recognition in Practice - Jelinek, Mercer - 1986 |

296 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...deling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; Church & Gale, 1991; =-=Kneser & Ney, 1995-=-; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previous comparison is tha... |

241 |
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ... to Jelinek– Mercer smoothing. To end the recursion, the Katz unigram model is taken to be the maximum likelihood unigram model. . 2.5. Witten–Bell smoothing Witten–Bell smoothing (Bell et al., 1990; =-=Witten & Bell, 1991-=-) 3 was developed for the task of text compression, and can be considered to be an instance of Jelinek–Mercer smoothing. In particular, the nth-order smoothed model is defined recursively as a linear ... |

232 | A gaussian prior for smoothing maximum entropy models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ... us p(s) = l+1 ∏ i=1 p(wi|w1 · · · wi−1) ≈ l+1 ∏ i=1 p(wi|w i−1 i−n+1 ) 1 Maximum entropy techniques can also be used to smooth n-gram models. A discussion of these techniques can be found elsewhere (=-=Chen & Rosenfeld, 1999-=-).362 An empirical study of smoothing techniques for language modeling where w j i denotes the words wi · · · w j and where we take w−n+2 through w0 to be 〈BOS〉. To estimate p(wi|w i−1 i−n+1 ), a nat... |

183 | On structuring probabilistic dependences in stochastic language modeling - Ney, Essen, et al. - 1994 |

157 |
Natural Language Parsing as Statistical Pattern Recognition
- Magerman
- 1994
(Show Context)
Citation Context ... but for many other applications as well, e.g. prepositional phrase attachment (Collins & Brooks, 1995), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (=-=Magerman, 1994-=-; Collins, 1997; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help performance, and data sparsity is almost always an issue in statistical modeling. In the extreme case where ther... |

141 | Prepositional Phrase Attachment through a Backed-off Model
- Collins, Brooks
- 1995
(Show Context)
Citation Context ...e. 6. Discussion Smoothing is a fundamental technique for statistical modeling, important not only for language modeling but for many other applications as well, e.g. prepositional phrase attachment (=-=Collins & Brooks, 1995-=-), part-of-speech tagging (Church, 1988), and stochastic pars-S. F. Chen and J. Goodman 391 ing (Magerman, 1994; Collins, 1997; Goodman, 1997). Whenever data sparsity is an issue, smoothing can help ... |

130 |
A Comparison of the Enhanced Good-Turing and Deleted Estimation Methods for Estimating
- Church, Gale
- 1991
(Show Context)
Citation Context ... issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; =-=Church & Gale, 1991-=-; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previo... |

127 | Entropy-based Pruning of Backoff Language Models
- Stolcke
(Show Context)
Citation Context ...he differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (Kneser, 1996; Seymore & Rosenfeld, 1996; =-=Stolcke, 1998-=-). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine how the performance of a language model meas... |

124 | A tree-based statistical language model for natural anguage speech recognition - BMd, Brown, et al. - 1989 |

84 | A spelling correction program based on a noisy channel model - Kernighan, Church, et al. - 1990 |

82 | A hierarchical Dirichlet language model
- MacKay, Peto
- 1995
(Show Context)
Citation Context ...re lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (Nádas, 1984; Katz, 1987; Church & Gale, 1991; Kneser & Ney, 1995; =-=MacKay & Peto, 1995-=-) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size. Perhaps the most complete previous comparison is that of Ney, Martin and ... |

68 | An estimate of an upper bound for the entropy of english - Brown, Pietra - 1992 |

66 | Building Probabilistic Models for Natural Language
- Chen
- 1996
(Show Context)
Citation Context .... In Section 5, we present the results of all of our experiments. Finally, in Section 6 we summarize the most important conclusions of this work. This work builds on our previously reported research (=-=Chen, 1996-=-; Chen & Goodman, 1996). An extended version of this paper (Chen & Goodman, 1998) is available; it contains a tutorial introduction to n-gram models and smoothing, more complete descriptions of existi... |

66 | Good-Turing frequency estimation without tears - Gale, Sampson - 1995 |

66 | Scalable Backoff Language Models
- Seymore, Rosenfeld
- 1996
(Show Context)
Citation Context ...dition, the magnitudes of the differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (Kneser, 1996; =-=Seymore & Rosenfeld, 1996-=-; Stolcke, 1998). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine how the performance of a lang... |

50 | Statistical language modeling using leaving-one-out - Ney, Martin, et al. - 1997 |

43 |
Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities
- Lidstone
- 1920
(Show Context)
Citation Context ...entations sometimes differ significantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; =-=Lidstone, 1920-=-; Johnson, 1932; Jeffreys, 1948). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1... |

35 |
Estimation of probabilities in the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ...le smoothing is a central issue in language modeling, the literature lacks a definitive comparison between the many existing techniques. Most previous studies that have compared smoothing algorithms (=-=Nádas, 1984-=-; Katz, 1987; Church & Gale, 1991; Kneser & Ney, 1995; MacKay & Peto, 1995) have only done so with a small number of methods (typically two) on one or two corpora and using a single training set size.... |

34 | A statistical approach to machine translation - Lafferty, Roossin - 1990 |

31 |
Probability: The deductive and inductive problems
- Johnson
- 1932
(Show Context)
Citation Context ...mes differ significantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; Lidstone, 1920; =-=Johnson, 1932-=-; Jeffreys, 1948). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1. Thus, we set ... |

30 |
On Turing’s formula for word probabilities
- Nadas
- 1985
(Show Context)
Citation Context ...malize: for an n-gram wi i−n+1 with r counts, we take nr pGT(w i r ∗ i−n+1 ) = where N = ∑ ∞ r=0 nrr ∗ . The Good–Turing estimate can be derived theoretically using only a couple of weak assumptions (=-=Nádas, 1985-=-), and has been shown empirically to accurately describe data when nr values are large. In practice, the Good–Turing estimate is not used by itself for n-gram smoothing, because it does not include th... |

28 | The JanusRTk Switchboard/Callhome 1997 Evaluation System: Pronunciation Modeling - Finke - 1997 |

27 |
Statistical language modeling using a variable context
- Kneser
- 1996
(Show Context)
Citation Context ...ff case. In addition, the magnitudes of the differences in performance seem to be less when cutoffs are used. Recently, several more sophisticated n-gram model pruning techniques have been developed (=-=Kneser, 1996-=-; Seymore & Rosenfeld, 1996; Stolcke, 1998). It remains to be seen how smoothing interacts with these new techniques. 5.3.3. Cross-entropy and speech recognition In this section, we briefly examine ho... |

26 | A hierarchical Dirichlet language model. Natural language engineering - MacKay, Peto - 1995 |

25 | W.: What's wrong with adding one - Gale, Church - 1994 |

23 |
On smoothing techniques for bigrambases natural language modelling
- New, Essen
- 1991
(Show Context)
Citation Context ... to method C in these references. Different notation is used in the original text.366 An empirical study of smoothing techniques for language modeling 2.6. Absolute discounting Absolute discounting (=-=Ney & Essen, 1991-=-; Ney, Essen and Kneser, 1994), like Jelinek– Mercer smoothing, involves the interpolation of higher- and lower-order models. However, , instead of multiplying the higher-order maximum-likelihood dist... |

21 | Language and Pronunciation Modeling - Seymore, Chen, et al. - 1997 |

9 | A Maximum Likelihood Approach to Continuous Speech Recognition - L, Mercer - 1983 |

9 |
Combining syntactic knowledge and visual text recognition: A hidden Markov model for part of speech tagging in a word recognition algorithm
- Hull
- 1992
(Show Context)
Citation Context ...s including speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Kernighan, Church & Gale, 1990; =-=Hull, 1992-=-; Srihari & Baltus, 1992). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the previous two words: p(wi|wi−2wi−1). The simpl... |

8 |
Combining statistical and syntactic methods in recognizing handwritten sentences
- Srihari, Baltus
- 1992
(Show Context)
Citation Context ...speech recognition, optical character recognition, handwriting recognition, machine translation, and spelling correction (Church, 1988; Brown et al., 1990; Kernighan, Church & Gale, 1990; Hull, 1992; =-=Srihari & Baltus, 1992-=-). The central goal of the most commonly used language models, trigram models, is to determine the probability of a word given the previous two words: p(wi|wi−2wi−1). The simplest § E-mail: sfc@cs.cmu... |

8 |
Specification of the 1995 ARPA hub 3 evaluation: Unlimited vocabulary NAB news baseline
- Stern
- 1996
(Show Context)
Citation Context ...eet Journal (WSJ) text from 1990–1995, and 35 million words of San Jose Mercury News (SJM) text from 1991. We used the 20 000 word vocabulary supplied for the 1995 ARPA speech recognition evaluation (=-=Stern, 1996-=-). For the NAB corpus, we primarily used the Wall Street Journal374 An empirical study of smoothing techniques for language modeling text, and only used the other text if more than 98 million words o... |

7 | K.W.: Estimation procedures for language context: Poor estimates are worse than none - Gale, Church - 1990 |

7 | Hub 4 language modeling using domain interpolation and data clustering
- Weng, Stolcke, et al.
- 1997
(Show Context)
Citation Context ... speed and memory of computers, there has been some use of higherorder n-gram models such as 4-gram and 5-gram models in speech recognition in recent years (Seymore, Chen, Eskenazi & Rosenfeld, 1997; =-=Weng et al., 1997-=-). In this section, we examine388 An empirical study of smoothing techniques for language modeling Diff. in test cross-entropy from baseline (bits/token) –0. 0 0 05 . 0 05 . 1 –0 . 1 –0 . 15 –0 . 2 –... |

6 | hub-4 sphinx-3 system - Placeway, Chen, et al. - 1996 |

6 |
Hub 4: Business Broadcast News
- Rudnicky
- 1996
(Show Context)
Citation Context ...rd data is three million words of telephone conversation transcriptions (Godfrey, Holliman & McDaniel, 1992). We used the 9800 word vocabulary created by Finke et al. (1997). The Broadcast News text (=-=Rudnicky, 1996-=-) consists of 130 million words of transcriptions of television and radio news shows from 1992–1996. We used the 50 000 word vocabulary developed by Placeway et al. (1997). For all corpora, any out-of... |

4 | Evaluation metric for language models. In DARPA Broadcast News Transcription Understanding Workshop - Chen, Beeferman, et al. - 1998 |

4 |
Theory of Probability, 2nd Edition
- Jeffreys
- 1948
(Show Context)
Citation Context ...ificantly from the original algorithm description. 2.1. Additive smoothing One of the simplest types of smoothing used in practice is additive smoothing (Laplace, 1825; Lidstone, 1920; Johnson, 1932; =-=Jeffreys, 1948-=-). To avoid zero probabilities, we pretend that each n-gram occurs slightly more often than it actually does: we add a factor δ to every count, where typically 0 < δ ≤ 1. Thus, we set padd(wi|w i−1 i−... |

1 | What’s wrong with one - Gale, Church - 1994 |