## A Gaussian Prior for Smoothing Maximum Entropy Models (1999)

### Cached

### Download Links

- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- CiteULike

### Other Repositories/Bibliography

Citations: | 226 - 2 self |

### BibTeX

@TECHREPORT{Chen99agaussian,

author = {Stanley F. Chen and Ronald Rosenfeld},

title = {A Gaussian Prior for Smoothing Maximum Entropy Models},

institution = {},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

In certain contexts, maximum entropy (ME) modeling can be viewed as maximum likelihood training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data. Several smoothing methods for maximum entropy models have been proposed to address this problem, but previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. In this work, we survey previous work in maximum entropy smoothing and compare the performance of several of these algorithms with conventional techniques for smoothing n-gram language models. Because of the mature body of research in n-gram model smoothing and the close connection between maximum entropy and conventional n-gram models, this domain is well-suited to gauge the performance of maximum entropy smoothing methods. Over a large number of data sets, we find that an ME smoothing method proposed to us by Lafferty [1] performs as well as or better tha...

### Citations

1167 | Information Theory and Statistics
- Kullback
- 1968
(Show Context)
Citation Context ...y and quantitatively similar to the empirical ideal. Not only can the Gaussian prior be applied to maximum entropy modeling, but it can also be applied in the more general minimum divergence paradigm =-=[37, 38]-=-. Maximizing entropy is equivalent to finding the model with the smallest Kullback-Leibler divergence from the uniform distribution. In minimum divergence modeling, one selects the model satisfying th... |

1082 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...imum entropy, smoothing 1 Introduction Maximum entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks =-=[2, 3, 4, 5]-=-. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data.... |

859 | An empirical study of smoothing techniques for language modeling,” Computer Speech and Language
- Chen, Goodman
- 1999
(Show Context)
Citation Context ...ximum likelihood parameter values are selected. While simple and efficient, this method exhibits all of the behaviors that have been observed by Chen and Goodman to be beneficial for n-gram smoothing =-=[8]-=-. In the remainder of this section, we present an introduction to maximum entropy modeling and discuss why smoothing ME models is necessary. In Section 2, we introduce n-gram language models and summa... |

758 |
Computational analysis of present-day American English
- Kucera, Francis
- 1967
(Show Context)
Citation Context ... earlier, this method has n free parameters oe m , one for each level of the n-gram model. We used data from four sources: the Brown corpus, which contains text from a number of miscellaneous sources =-=[32]-=-; Wall Street Journal (WSJ) newspaper text [33]; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [34]; and the Switchboard (SWB) corpus, which contains... |

673 | Information Theory and Statistical Mechanics - Jaynes - 1957 |

666 | Estimation of probabilities from sparse data for the language model component of a speech recognizer
- Katz
- 1987
(Show Context)
Citation Context ...tributions are less sparsely estimated from the training data, their interpolation generally reduces overfitting. A large number of other smoothing methods for n-gram models have been proposed, e.g., =-=[16, 8, 14, 17, 18, 19]-=-. We present a brief overview of past work in n-gram model smoothing. One basic observation is that the maximum likelihood estimate of the probability of an n-gram that does not occur in the training ... |

619 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...tributions are less sparsely estimated from the training data, their interpolation generally reduces overfitting. A large number of other smoothing methods for n-gram models have been proposed, e.g., =-=[16, 8, 14, 17, 18, 19]-=-. We present a brief overview of past work in n-gram model smoothing. One basic observation is that the maximum likelihood estimate of the probability of an n-gram that does not occur in the training ... |

587 | A Statistical Approach to Machine Translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...s) over word sequences s that models how often each sequence s occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction =-=[11, 12, 13]-=-. For a word sequence s = w 1 \Delta \Delta \Delta w l , we can express its probability Pr(s) as Pr(s) = Pr(w 1 ) \Theta Pr(w 2 jw 1 ) \Theta \Delta \Delta \Delta \Theta Pr(w l jw 1 \Delta \Delta \Del... |

551 | Inducing features of random fields
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...imum entropy, smoothing 1 Introduction Maximum entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks =-=[2, 3, 4, 5]-=-. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data.... |

528 |
Telephone speech corpus for research and development
- Godfrey, Holliman, et al.
- 1992
(Show Context)
Citation Context ...[33]; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [34]; and the Switchboard (SWB) corpus, which contains transcriptions of telephone conversations =-=[35]-=-. In each experiment, we selected a training set of a given length from one source, and two held-out sets from the same source. The first held-out set was used to optimize the parameters of each smoot... |

429 |
Generalized iterative scaling for log-linear models
- Darroch, Ratchli
- 1972
(Show Context)
Citation Context ...consistent. data is concave in the model parameters , and thus it is relatively easy to find the unique maximum entropy/maximum likelihood model using algorithms such as generalized iterative scaling =-=[10]-=- or improved iterative scaling [3]. While models with high entropy tend to be rather uniform or smooth and we may only constrain properties of q(x) we consider significant, a maximum entropy model can... |

395 |
A maximum likelihood approach to continuous speech recognition
- Bahl, Jelinek, et al.
- 1983
(Show Context)
Citation Context ...s) over word sequences s that models how often each sequence s occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction =-=[11, 12, 13]-=-. For a word sequence s = w 1 \Delta \Delta \Delta w l , we can express its probability Pr(s) as Pr(s) = Pr(w 1 ) \Theta Pr(w 2 jw 1 ) \Theta \Delta \Delta \Delta \Theta Pr(w l jw 1 \Delta \Delta \Del... |

355 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...c X (w i i\Gamma(n\Gamma1) ). In absolute discounting, d(w i i\Gamma(n\Gamma1) ) is taken to be a constant 0sDs1. In Good-Turing discounting, the discount is calculated using the Good-Turing estimate =-=[20]-=-, a theoretically-motivated discount that has been shown to be accurate in non-sparse data situations [17, 21]. A brief description of the Good-Turing estimate is given in Section 4.1. Jelinek-Mercer ... |

338 |
Interpolated estimation of markov source parameters from sparse data
- Jelinek, Mercer
- 1980
(Show Context)
Citation Context ...polate the maximum likelihood estimate of the n-gram probability q ML (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) with an estimate of the (n \Gamma 1)-gram probability Pr(w i jw i\Gamma1 i\Gamma(n\Gamma2) ) =-=[14, 15]-=-: q int (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) =sq ML (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) + (1 \Gamma ) q int (w i jw i\Gamma1 i\Gamma(n\Gamma2) ); 0s1 : (4) The lower-order estimate can be defined ana... |

275 | Improved backing-off for m-gram language modeling - Kneser, Ney - 1995 |

245 | A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language
- Rosenfeld
- 1996
(Show Context)
Citation Context ...imum entropy, smoothing 1 Introduction Maximum entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks =-=[2, 3, 4, 5]-=-. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other maximum likelihood methods is prone to overfitting of training data.... |

203 | Maximum Entropy Models for Natural Language Ambiguity Resolution - Ratnaparkhi - 1998 |

172 |
On structuring probabilistic dependences in stochastic language modeling
- Ney, Essen, et al.
- 1994
(Show Context)
Citation Context ...tributions are less sparsely estimated from the training data, their interpolation generally reduces overfitting. A large number of other smoothing methods for n-gram models have been proposed, e.g., =-=[16, 8, 14, 17, 18, 19]-=-. We present a brief overview of past work in n-gram model smoothing. One basic observation is that the maximum likelihood estimate of the probability of an n-gram that does not occur in the training ... |

128 |
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language
- Church, Gale
- 1991
(Show Context)
Citation Context ... 0sDs1. In Good-Turing discounting, the discount is calculated using the Good-Turing estimate [20], a theoretically-motivated discount that has been shown to be accurate in non-sparse data situations =-=[17, 21]-=-. A brief description of the Good-Turing estimate is given in Section 4.1. Jelinek-Mercer smoothing and Witten-Bell smoothing use linear discounting; KneserNey smoothing uses absolute discounting, and... |

84 |
Trigger-Based Language Models: a Maximum Entropy Approach
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...istribution of f i (x) in the training data. Della Pietra and Della Pietra describe a variant of generalized iterative scaling that can be used to find the optimal model under this objective function =-=[26]-=-. We can interpret this algorithm from the viewpoint of maximum a posteriori (MAP) estimation. In MAP estimation, we attempt to find the model q with the highest posterior probability given the traini... |

82 |
A spelling correction program based on a noisy channel model
- Kernighan, Church, et al.
- 1990
(Show Context)
Citation Context ...s) over word sequences s that models how often each sequence s occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction =-=[11, 12, 13]-=-. For a word sequence s = w 1 \Delta \Delta \Delta w l , we can express its probability Pr(s) as Pr(s) = Pr(w 1 ) \Theta Pr(w 2 jw 1 ) \Theta \Delta \Delta \Delta \Theta Pr(w l jw 1 \Delta \Delta \Del... |

80 | A hierarchical Dirichlet language model
- MacKay, Peto
- 1995
(Show Context)
Citation Context ...rameters are the variances oe i . 3 We can contrast a Gaussian prior onsparameters with previous work in n-gram smoothing where priors have been applied directly in probability space. MacKay and Peto =-=[29]-=- use a Dirichlet prior and N'adas [30] uses a Beta prior, both resulting in linear discounting which has been shown to perform suboptimally. In the basic version of the Gaussian prior method that we i... |

65 |
An estimate of an upper bound for the entropy of English
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...polate the maximum likelihood estimate of the n-gram probability q ML (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) with an estimate of the (n \Gamma 1)-gram probability Pr(w i jw i\Gamma1 i\Gamma(n\Gamma2) ) =-=[14, 15]-=-: q int (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) =sq ML (w i jw i\Gamma1 i\Gamma(n\Gamma1) ) + (1 \Gamma ) q int (w i jw i\Gamma1 i\Gamma(n\Gamma2) ); 0s1 : (4) The lower-order estimate can be defined ana... |

50 | A model of lexical attraction and repulsion
- Beeferman, Berger, et al.
- 1997
(Show Context)
Citation Context ...y and quantitatively similar to the empirical ideal. Not only can the Gaussian prior be applied to maximum entropy modeling, but it can also be applied in the more general minimum divergence paradigm =-=[37, 38]-=-. Maximizing entropy is equivalent to finding the model with the smallest Kullback-Leibler divergence from the uniform distribution. In minimum divergence modeling, one selects the model satisfying th... |

37 | Adaptive language modeling using the maximum entropy principle
- Lau, Rosenfeld, et al.
- 1993
(Show Context)
Citation Context ...tional ME models share many of the same properties as joint models, including being maximum likelihood models, and have computational and performance advantages over joint models in language modeling =-=[24, 5]-=-. A conditional maximum entropy model has the form q ME (yjx) = 1 Zs(x) exp( F X i=1si f i (x; y)) : (8) To construct a maximum entropy n-gram model, we take x = w i\Gamma1 i\Gamma(n\Gamma1) to be the... |

35 | Evaluation metrics for language models
- Chen, Beeferman, et al.
- 1998
(Show Context)
Citation Context ...use the derivative measure cross-entropy H q (X 0 ) = log 2 PP q (X 0 ), which can be interpreted as the average number of bits needed to code each word in the test set using the model q. Chen et al. =-=[8, 22]-=- also conducted experiments investigating how the cross-entropy of a language model is related to its performance when used in a speech recognition system. They found a strong linear correlation betwe... |

35 |
Estimation of probabilities m the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ... can contrast a Gaussian prior onsparameters with previous work in n-gram smoothing where priors have been applied directly in probability space. MacKay and Peto [29] use a Dirichlet prior and N'adas =-=[30]-=- uses a Beta prior, both resulting in linear discounting which has been shown to perform suboptimally. In the basic version of the Gaussian prior method that we implemented, we had n free parameters o... |

27 | Just-in-time language modeling
- Berger, Miller
- 1998
(Show Context)
Citation Context ...ntial models, and like other maximum likelihood methods is prone to overfitting of training data. While several smoothing methods for maximum entropy models have been proposed to address this problem =-=[6, 1, 7, 5]-=-, previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. However, there has a been great deal of research in smoothing n-gr... |

20 | Cluster Expansions and Iterative Scaling for Maximum Entropy Language Models
- Lafferty, Suhm
- 1996
(Show Context)
Citation Context ...alized to zero and improved iterative scaling is applied to train the model. Iterative scaling is terminated when the perplexity of a held-out set no longer decreases significantly. Cluster expansion =-=[31]-=- is employed to reduce computation. In the implementation ME-no-smooth, no smoothing is performed. (Since training is terminated when performance on a held-out set no longer improves, no probabilities... |

16 |
Adaptive Statistical Language Modelling
- Lau
- 1994
(Show Context)
Citation Context ...ntial models, and like other maximum likelihood methods is prone to overfitting of training data. While several smoothing methods for maximum entropy models have been proposed to address this problem =-=[6, 1, 7, 5]-=-, previous results do not make it clear how these smoothing methods compare with smoothing methods for other types of related models. However, there has a been great deal of research in smoothing n-gr... |

15 |
erty. Inducing features of random elds
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...imum entropy, smoothings1 Introduction Maximum entropy (ME) modeling has been successfully applied to a wide range of domains, including language modeling as well as many other natural language tasks =-=[2, 3,4,5]-=-. For many problems, this type of modeling can be viewed as maximum likelihood (ML) training for exponential models, and like other maximum likelihood methods is prone to over tting of training data. ... |

13 |
A statistical approach tomachine translation
- Brown, Cocke, et al.
- 1990
(Show Context)
Citation Context ...s) over word sequences s that models how often each sequence s occurs as a sentence. Language models have many applications, including speech recognition, machine translation, and spelling correction =-=[11, 12, 13]-=-. For a word sequence s = w1 wl,wecan express its probability Pr(s) as Pr(s) = Pr(w1) Pr(w2jw1) Pr(wljw1 wl,1) Pr(endjw1 wl) = l+1 Y i=1 Pr(wijw1 wi,1) where the token wl+1 = end signals the end of th... |

11 | Improved backing-o# for m-gram language modeling - Kneser, Ney - 1995 |

8 |
Specification of the 1995 ARPA hub 3 evaluation: Unlimited vocabulary NAB news baseline
- Stern
- 1996
(Show Context)
Citation Context ...m , one for each level of the n-gram model. We used data from four sources: the Brown corpus, which contains text from a number of miscellaneous sources [32]; Wall Street Journal (WSJ) newspaper text =-=[33]-=-; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [34]; and the Switchboard (SWB) corpus, which contains transcriptions of telephone conversations [35]... |

7 |
Extension to the maximum entropy method
- Newman
- 1977
(Show Context)
Citation Context ...s is slightly worse than the perplexities achieved by the deleted interpolation and Good-Turing discounted ME models. 4.3 Fat Constraints Other methods for relaxing constraints include work by Newman =-=[27]-=- and Khudanpur [28]. In these algorithms, instead of selecting the maximum entropy model over models q(x) that satisfy a set of constraints exactly, they only require that the given marginals of q(x) ... |

6 |
private communication
- Rockmore
(Show Context)
Citation Context ...gram models, this domain is well-suited to gauge the performance of maximum entropy smoothing methods. Over a large number of data sets, we find that an ME smoothing method proposed to us by Lafferty =-=[1]-=- performs as well as or better than all other algorithms under consideration. This general and efficient method involves using a Gaussian prior on the parameters of the model and selecting maximum a p... |

6 |
Hub 4: Business Broadcast News
- Rudnicky
- 1996
(Show Context)
Citation Context ...tains text from a number of miscellaneous sources [32]; Wall Street Journal (WSJ) newspaper text [33]; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows =-=[34]-=-; and the Switchboard (SWB) corpus, which contains transcriptions of telephone conversations [35]. In each experiment, we selected a training set of a given length from one source, and two held-out se... |

5 |
A method of maximum entropy estimation with relaxed constraints
- Khudanpur
- 1995
(Show Context)
Citation Context ... than the perplexities achieved by the deleted interpolation and Good-Turing discounted ME models. 4.3 Fat Constraints Other methods for relaxing constraints include work by Newman [27] and Khudanpur =-=[28]-=-. In these algorithms, instead of selecting the maximum entropy model over models q(x) that satisfy a set of constraints exactly, they only require that the given marginals of q(x) fall in some range ... |

2 |
A maximum penalized entropy construction of conditional log-linear language and translation models using learned features and a generalized csiszar algorithm
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...ework. The maximum entropy models described in Section 1.1 are joint models; to create the conditional distributions used in conventional n-gram models we use the framework introduced by Brown et al. =-=[23]-=-. Instead of estimating a joint distribution q(x) over samples x, we estimate a conditional distribution q(yjx) over samples (x; y). Instead of constraints as given by equation (2), we have constraint... |

2 |
Speci cation of the 1995 ARPA HUB 3 evaluation: Unlimited vocabulary NAB news baseline
- Stern
- 1996
(Show Context)
Citation Context ... m, one for each level of the n-gram model. We used data from four sources: the Brown corpus, which contains text from a number of miscellaneous sources [32]; Wall Street Journal (WSJ) newspaper text =-=[33]-=-; the Broadcast News (BN) corpus, which contains transcriptions of television and radio news shows [34]; and the Switchboard (SWB) corpus, which contains transcriptions of telephone conversations [35]... |

1 |
Statistical modeling by maximum entropy," unpublished report
- Pietra, Pietra
- 1993
(Show Context)
Citation Context ...ified Kneser-Ney smoothing, would outperform deleted interpolation by a much larger margin. 4.2 Fuzzy Maximum Entropy In the fuzzy maximum entropy framework developed by Della Pietra and Della Pietra =-=[25]-=-, instead of requiring that constraints are satisfied exactly, a penalty is associated with inexact constraint satisfaction. Finding the maximum entropy model is equivalent to finding the model q(x) s... |

1 |
erty, personal communication
- La
- 1997
(Show Context)
Citation Context ... n-gram models, this domain is well-suited to gauge the performance of maximum entropy smoothing methods. Over a large number of data sets, we nd that an ME smoothing method proposed to us by La erty =-=[1]-=- performs as well as or better than all other algorithms under consideration. This general and e cient method involves using a Gaussian prior on the parameters of the model and selecting maximum apost... |