## A survey of smoothing techniques for ME models (2000)

Venue: | IEEE Transactions on Speech and Audio Processing |

Citations: | 86 - 1 self |

### BibTeX

@ARTICLE{Chen00asurvey,

author = {Stanley F. Chen and Ronald Rosenfeld and Associate Member},

title = {A survey of smoothing techniques for ME models},

journal = {IEEE Transactions on Speech and Audio Processing},

year = {2000},

volume = {8},

pages = {37--50}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. Several smoothing methods for ME models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in ME smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing-gram language models. Because of the mature body of research in-gram model smoothing and the close connection between ME and conventional-gram models, this domain is well-suited to gauge the performance of ME smoothing methods. Over a large number of data sets, we find that fuzzy ME smoothing performs as well as or better than all other algorithms under consideration. We contrast this method with previous-gram smoothing methods to explain its superior performance. Index Terms—Exponential models, language modeling, maximum entropy, minimum divergence,-gram models, smoothing.

### Citations

1154 | Information theory and statistics
- Kullback
(Show Context)
Citation Context ... employed profitably in a conventional -gram smoothing technique. Not only can fuzzy ME smoothing be applied to ME modeling, but it can also be applied in the more general minimum divergence paradigm =-=[40]-=-, [41]. Maximizing entropy is equivalent to finding the model with the smallest Kullback–Leibler divergence from the uniform distribution. In minimum divergence modeling, one selects the model satisfy... |

851 | An Empirical Study of Smoothing Techniques for Language Modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...mum a posteriori instead of ML parameter values. While simple and efficient, this method exhibits all of the behaviors that have been observed by Chen and Goodman to be beneficial for -gram smoothing =-=[8]-=-. In the remainder of this section, we present an introduction to ME modeling and discuss why smoothing ME models is necessary. In Section II, we introduce -gram language models and summarize previous... |

751 |
Computational Analysis of Presentday American English
- Kucera, Francis
- 1967
(Show Context)
Citation Context ... smoothing methods generally leads to worse performance for the methods that perform well. We used data from four sources: the Brown corpus, which contains text from a number of miscellaneous sources =-=[33]-=-; WSJ newspaper text [34]; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [35]; Fig. 2. Cross-entropy of baseline smoothing algorithm on test set over... |

664 | Slava M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...etermined while ignoring information from lower-order distributions. Interpolated models include Jelinek–Mercer smoothing [14] and Witten–Bell smoothing [16]; backed-off models include Katz smoothing =-=[17]-=-, absolute discounting [19], and Kneser–Ney smoothing [18]. To describe the different types of discounting, we write as where can be viewed as the discount in count space from the ML estimate and wher... |

617 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ... are less sparsely estimated from the training data, their interpolation generally reduces overfitting. A large number of other smoothing methods for -gram models have been proposed, e.g., [8], [14], =-=[16]-=-–[19]. We present a brief overview of past work in -gram model smoothing. One basic observation is that the ML estimate of the probability of an -gram that does not occur in the training data is zero ... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ... can be interpreted as the model that assumes only the knowledge that is represented by the features derived from the training data, and nothing else. The ME paradigm has many elegant properties [2], =-=[3]-=-. The ME model is unique and can be shown to be an exponential model of the form where is a normalization factor and are the parameters of the model. Furthermore, the ME model is also the ML model in ... |

525 |
Switchboard: Telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...of baseline smoothing algorithm on test set over multiple training set sizes on Brown, SWB, and WSJ corpora. and the Switchboard (SWB) corpus, which contains transcriptions of telephone conversations =-=[36]-=-. In each experiment, we selected a training set of a given length from one source, and two held-out sets from the same source. The first held-out set was used to optimize the parameters of each smoot... |

431 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...ally, the log-likelihood of the training data is concave in the model parameters , and thus it is relatively easy to find the unique ME/ML model using algorithms such as generalized iterative scaling =-=[10]-=- or improved iterative scaling [3]. 1 These properties hold when constraining feature expectations to be equal to those found in a training set. When constraining expectations to alternate values, the... |

391 |
A Maximum Likelihood Approach to Continuous Speech Recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...bution over word sequences that models how often each sequence occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction =-=[11]-=-–[13]. For a word sequence , we can express its probability as where the token signals the end of the sentence. The most widely-used language models, by far, are -gram language models. In an -gram mod... |

353 |
The population frequencies of the species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...onal to the original count , as in (4) where the discount is .Inabsolute discounting, is taken to be a constant . In Good–Turing discounting, the discount is calculated using the Good–Turing estimate =-=[20]-=-, a theoretically motivated discount that has been shown to be accurate in nonsparse data situations [17], [21]. A brief description of the Good–Turing estimate is given in Section IV-B. Jelinek–Merce... |

336 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...le to use smoothed estimates of these values. For example, one simple smoothing technique is to linearly interpolate the ML estimate of the -gram probability with an estimate of the -gram probability =-=[14]-=-, [15] The lower-order estimate can be defined analogously, and the recursion can end with a unigram or uniform distribution. Since the lower-order distributions are less sparsely estimated from the t... |

169 |
On structuring probabilistic dependences in stochastic language modeling
- Ney, Essen, et al.
- 1994
(Show Context)
Citation Context ...less sparsely estimated from the training data, their interpolation generally reduces overfitting. A large number of other smoothing methods for -gram models have been proposed, e.g., [8], [14], [16]–=-=[19]-=-. We present a brief overview of past work in -gram model smoothing. One basic observation is that the ML estimate of the probability of an -gram that does not occur in the training data is zero and i... |

157 | The design for the wall street journal-based csr corpus
- Paul, Baker
- 1992
(Show Context)
Citation Context ...s developed for conventional -gram models can also be utilized for ME -gram models. For example, ME -gram models can be expressed efficiently in the standard ARPA format for conventional -gram models =-=[25]-=-. IV. SMOOTHING MAXIMUM ENTROPY MODELS In this section, we survey previous work in ME model smoothing, including constraint exclusion, Good–Turing 2 The equivalence is not exact as exponential models ... |

128 |
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language
- Church, Gale
- 1991
(Show Context)
Citation Context ... . In Good–Turing discounting, the discount is calculated using the Good–Turing estimate [20], a theoretically motivated discount that has been shown to be accurate in nonsparse data situations [17], =-=[21]-=-. A brief description of the Good–Turing estimate is given in Section IV-B. Jelinek–Mercer smoothing and Witten–Bell smoothing use linear discounting, Kneser–Ney smoothing uses absolute discounting, a... |

83 | Information theory and statistical mechanics,” Phys - Jaynes - 1957 |

81 |
A spelling correction program based on a noisy channel model
- Kernighan, Church, et al.
- 1990
(Show Context)
Citation Context ...n over word sequences that models how often each sequence occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction [11]–=-=[13]-=-. For a word sequence , we can express its probability as where the token signals the end of the sentence. The most widely-used language models, by far, are -gram language models. In an -gram model, w... |

79 | A hierarchical Dirichlet language model
- MacKay, Peto
- 1994
(Show Context)
Citation Context ...ltiple flat discounts. We can contrast the Gaussian prior on parameters of fuzzy ME with previous work in -gram smoothing where priors have been applied directly in probability space. MacKay and Peto =-=[30]-=- use a Dirichlet prior and Nádas [31] uses a beta prior, both resulting in linear discounting which has been shown to perform suboptimally. Applying a prior to parameters proportional to log-probabili... |

50 | A model of lexical attraction and repulsion
- Beeferman, Berger, et al.
- 1997
(Show Context)
Citation Context ...yed profitably in a conventional -gram smoothing technique. Not only can fuzzy ME smoothing be applied to ME modeling, but it can also be applied in the more general minimum divergence paradigm [40], =-=[41]-=-. Maximizing entropy is equivalent to finding the model with the smallest Kullback–Leibler divergence from the uniform distribution. In minimum divergence modeling, one selects the model satisfying th... |

35 | Evaluation metrics for language models
- Chen, Beeferman, et al.
- 1998
(Show Context)
Citation Context ...n the test set. They also use the derivative measure cross-entropy , which can be interpreted as the average number of bits needed to code each word in the test set using the model . Chen et al. [8], =-=[22]-=- also conducted experiments investigating how the cross-entropy of a language model is related to its performance when used in a speech recognition system. They found a strong linear correlation betwe... |

35 |
Estimation of probabilities m the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ...t the Gaussian prior on parameters of fuzzy ME with previous work in -gram smoothing where priors have been applied directly in probability space. MacKay and Peto [30] use a Dirichlet prior and Nádas =-=[31]-=- uses a beta prior, both resulting in linear discounting which has been shown to perform suboptimally. Applying a prior to parameters proportional to log-probabilities instead of to probabilities prod... |

27 | Just-in-time language modeling
- Berger, Miller
- 1998
(Show Context)
Citation Context ...s active), and this is not the case in general. However, promising results have been achieved with fuzzy ME smoothing in text classification [42], as well as in topic adaptation for language modeling =-=[43]-=-. In addition,sCHEN AND ROSENFELD: ME MODELS 49 how parameters should be tied in other domains has yet to be explored. 10 Nonetheless, our results and analysis justify the choice of fuzzy ME smoothing... |

23 | A ME approach to adaptive statistical language modeling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...models, smoothing. I. INTRODUCTION MAXIMUM entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks [2]–=-=[5]-=-. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. While several s... |

16 | Adaptive Statistical Language Modelling - Lau - 1994 |

14 | Experiments in spoken document retrieval at CMU
- Siegler, Slattery, et al.
- 1999
(Show Context)
Citation Context ...y more relevant training material is available. We calculated word error rates by rescoring lattices produced by the Sphinx-III speech recognition system for the TREC-7 spoken document retrieval task =-=[38]-=-. The word error rates achieved on a 33 000-word test set are displayed in Table I, as well as the cross-entropy of each model on the correct transcript of the test set. 9 As expected, kneser-ney-mod ... |

11 |
A method of ME estimation with relaxed constraints. Johns Hopkins Univ.Language Modeling Workshop
- Khudanpur
- 1995
(Show Context)
Citation Context ...e than the perplexities achieved by the deleted interpolation and Good–Turing discounted ME models. D. Fat Constraints Other methods for relaxing constraints include work by Newman [28] and Khudanpur =-=[29]-=-. In these algorithms, instead of selecting the ME model over models that satisfy a set of constraints exactly, they only require that the given marginals of fall in some range around the target value... |

9 |
Extension to the ME method
- Newman
- 1977
(Show Context)
Citation Context ...is is slightly worse than the perplexities achieved by the deleted interpolation and Good–Turing discounted ME models. D. Fat Constraints Other methods for relaxing constraints include work by Newman =-=[28]-=- and Khudanpur [29]. In these algorithms, instead of selecting the ME model over models that satisfy a set of constraints exactly, they only require that the given marginals of fall in some range arou... |

8 |
Specification of the 1995 ARPA hub 3 evaluation: Unlimited vocabulary NAB news baseline
- Stern
- 1996
(Show Context)
Citation Context ...lly leads to worse performance for the methods that perform well. We used data from four sources: the Brown corpus, which contains text from a number of miscellaneous sources [33]; WSJ newspaper text =-=[34]-=-; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [35]; Fig. 2. Cross-entropy of baseline smoothing algorithm on test set over multiple training set si... |

7 | et al., “A statistical approach to machine translation - Brown - 1990 |

6 |
private communication
- Rockmore
(Show Context)
Citation Context ... still concave in and it is still straightforward to find the optimal model. For instance, we can make a simple modification to improved iterative scaling to find the maximum a posteriori (MAP) model =-=[27]-=-. The original update of each in this algorithm is to take where satisfies (12) 4 In an earlier incarnation of this work [26], the authors were unaware of the equivalence of fuzzy ME and the Gaussian ... |

6 |
Hub 4: Business Broadcast News
- Rudnicky
- 1996
(Show Context)
Citation Context ...rown corpus, which contains text from a number of miscellaneous sources [33]; WSJ newspaper text [34]; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows =-=[35]-=-; Fig. 2. Cross-entropy of baseline smoothing algorithm on test set over multiple training set sizes on Brown, SWB, and WSJ corpora. and the Switchboard (SWB) corpus, which contains transcriptions of ... |

4 |
Statistical Modeling by
- Pietra, Pietra
- 1993
(Show Context)
Citation Context ...hood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. While several smoothing methods for ME models have been proposed to address this problem =-=[1]-=-, [5]–[7], previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. However, there has a been great deal of research in smoot... |

3 |
et al “The 1997 CMU Sphinx-3 EnglishBroad cast News Transcription System
- Seymore
- 1998
(Show Context)
Citation Context ...o run with only a single decoding pass and with narrow search beams. With several passes, wider beams, and a larger language model, Sphinx-III achieved a word-error rate of 23.8% on a similar BN task =-=[39]-=-.s48 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 1, JANUARY 2000 TABLE I SPEECH RECOGNITION WORD ERROR RATES AND TEST SET CROSS-ENTROPIES OF VARIOUS SMOOTHING ALGORITHMS OVER SEVERAL... |

2 |
A ME approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...ram models, smoothing. I. INTRODUCTION MAXIMUM entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks =-=[2]-=-–[5]. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other ML methods is prone to overfitting of training data. While sever... |

2 | ME models for natural language ambiguity resolution - Ratnaparkhi - 1998 |

2 |
Improved Backing-Off for m-- gram Language Modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...f -grams with nonzero counts are generally too high. This dichotomy motivates the following framework for expressing smoothing methods, which can be used to express most existing smoothing techniques =-=[18]-=-: if if . That is, if an -gram occurs in the training data, the estimate is used; this estimate is generally a discounted version of the ML estimate. Otherwise, we back off to a scaled version of the ... |

1 |
et al., “An estimate of an upper bound for the entropy
- Brown
- 1992
(Show Context)
Citation Context ...use smoothed estimates of these values. For example, one simple smoothing technique is to linearly interpolate the ML estimate of the -gram probability with an estimate of the -gram probability [14], =-=[15]-=- The lower-order estimate can be defined analogously, and the recursion can end with a unigram or uniform distribution. Since the lower-order distributions are less sparsely estimated from the trainin... |

1 |
et al., “A maximum penalized entropy construction of conditional log-linear language and translation models using learned features and a generalized Csiszar algorithm,” Int
- Brown
- 1992
(Show Context)
Citation Context ...in the ME framework. The ME models described in Section I-A are joint models; to create the conditional distributions used in conventional -gram models we use the framework introduced by Brown et al. =-=[23]-=-. Instead of estimating a joint distribution over samples ,we estimate a conditional distribution over samples . Instead of constraints as given by (2), we have constraints of the form This can be int... |

1 |
Adaptive language modeling using the ME principle
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...ties. Conditional ME models share many of the same properties as joint models, including being ML models, and have computational and performance advantages over joint models in language modeling [5], =-=[24]-=-. A conditional ME model has the form To construct a ME -gram model, we take to be the history and to be the following word. For each (7) (8)sCHEN AND ROSENFELD: ME MODELS 41 -gram with that occurs in... |

1 |
A Gaussian prior for smoothing ME models
- Chen, Rosenfeld
- 1999
(Show Context)
Citation Context ...to improved iterative scaling to find the maximum a posteriori (MAP) model [27]. The original update of each in this algorithm is to take where satisfies (12) 4 In an earlier incarnation of this work =-=[26]-=-, the authors were unaware of the equivalence of fuzzy ME and the Gaussian prior, and mistakenly attributed the Gaussian prior to John Lafferty.sCHEN AND ROSENFELD: ME MODELS 43 and where . With the G... |

1 |
Cluster expansions and iterative scaling for ME language models
- Lafferty, Suhm
- 1995
(Show Context)
Citation Context ...tialized to zero and improved iterative scaling is applied to train the model. Iterative scaling is terminated when the perplexity of a held-out set no longer decreases appreciably. Cluster expansion =-=[32]-=- is employed to reduce computation. In the implementation ME-no-smooth, no smoothing is performed. (Since training is terminated when performance on a held-out set no longer improves, no probabilities... |

1 |
Using ME for text classification
- Nigam, Lafferty, et al.
- 1999
(Show Context)
Citation Context ...ure is active for a superset of the events that the other is active), and this is not the case in general. However, promising results have been achieved with fuzzy ME smoothing in text classification =-=[42]-=-, as well as in topic adaptation for language modeling [43]. In addition,sCHEN AND ROSENFELD: ME MODELS 49 how parameters should be tied in other domains has yet to be explored. 10 Nonetheless, our re... |