## Smoothing Methods In Maximum Entropy Language Modeling (1999)

Venue: | In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume I |

Citations: | 7 - 1 self |

### BibTeX

@INPROCEEDINGS{Martin99smoothingmethods,

author = {S. C. Martin and H. Ney and J. Zaplo},

title = {Smoothing Methods In Maximum Entropy Language Modeling},

booktitle = {In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume I},

year = {1999},

pages = {545--548}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper discusses various aspects of smoothing techniques in maximum entropy language modeling, a topic not sufficiently covered by previous publications. We show (1) that straightforward maximum entropy models with nested features, e.g. tri-, bi-, and unigrams, result in unsmoothed relative frequencies models; (2) that maximum entropy models with nested features and discounted feature counts approximate backing-off smoothed relative frequencies models with Kneser's advanced marginal back-off distribution; this explains some of the reported success of maximum entropy models in the past; (3) perplexity results for nested and non-nested features, e.g. trigrams and distance-trigrams, on a 4-million word subset of the Wall Street Journal Corpus, showing that the smoothing method has more effect on the perplexity than the method to combine information.

### Citations

573 | Inducing features of random fields
- Pietra, Stephen, et al.
- 1997
(Show Context)
Citation Context ...f the Wall Street Journal Corpus, showing that the smoothing method has more effect on the perplexity than the method to combine information. 1. MAXIMUM ENTROPY APPROACH The maximum entropy principle =-=[1, 5]-=- is a well-defined method for incorporating different types of features into a language model [4, 9]. For a word w given its history h it has the following functional form [2, pp. 83-87]: ps(wjh) = ex... |

444 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...iliary function Q i () and the -- independent feature counts N i . There is no closed solution to the set of constraint equations. We train them with the Generalized Iterative Scaling (GIS) algorithm =-=[3]-=- implemented as described in [10] with the addition of Ristad's speedup technique [11]. In this paper the baseline maximum entropy model uses the nested trigram, bigram, and unigram features with (h; ... |

393 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...rginal back--off distribution we get a reduction in perplexity by 10% for the maximum entropy model with distance-- m--gram features. A similar figure is reported in [9] using Turing-- Good smoothing =-=[6]-=- for the maximum entropy model [9, p. 204], a smoothing method comparable to absolute discounting [8]. However, as can be seen from Table 2, roughly a third of this perplexity reduction is already ach... |

296 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...\Delta; v; w) P ~ w: N(u;v; ~ w)=0 n?0(\Delta; v; ~ w) : Thus, the resulting model is a standard backing--off model [8] with a back--off distribution fi uv (w) known as Kneser's marginal distribution =-=[7]-=-. A closed solution including unigram features is not yet found for both smoothing approaches, but we assume that the resulting models would be similar to the above. 3. EXPERIMENTAL RESULTS For the ex... |

253 | A maximum entropy approach to adaptive statistical language model
- Rosenfeld
- 1996
(Show Context)
Citation Context ... bigrams because bigrams are better trained than trigrams. Thus, the way in which the features are combined becomes more dominant, obviously in favour of the maximum entropy model, as theory suggests =-=[1, 9]-=-. Compared to the backing--off smoothed relative frequencies model without marginal back--off distribution we get a reduction in perplexity by 10% for the maximum entropy model with distance-- m--gram... |

209 | Discrete Multivariate Analysis - Bishop, Fienberg, et al. - 1975 |

50 |
Statistical language modeling using leaving-one-out
- Ney, Martin, et al.
- 1997
(Show Context)
Citation Context ...use of e uvw+vw+w ? 0, even though bigram and unigram features exist for backing--off. Therefore, smoothing must be applied, a technique that redistributes probability mass from seen to unseen events =-=[8]-=-. 2.2. Smoothing Using Cut--Offs and Absolute Discounting We do not know an obvious smoothing technique for maximum entropy, so we adapted techniques from known smoothing methods: ffl Cut--Offs: Proba... |

3 |
Distance bigram language modelling using maximum entropy
- Simons, Ney, et al.
- 1997
(Show Context)
Citation Context ...- independent feature counts N i . There is no closed solution to the set of constraint equations. We train them with the Generalized Iterative Scaling (GIS) algorithm [3] implemented as described in =-=[10]-=- with the addition of Ristad's speedup technique [11]. In this paper the baseline maximum entropy model uses the nested trigram, bigram, and unigram features with (h; w) = (u; v; w): fuvw (~u; ~ v; ~ ... |

2 |
Della Pietra: "A Maximum Entropy Approach to Natural Language Processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...f the Wall Street Journal Corpus, showing that the smoothing method has more effect on the perplexity than the method to combine information. 1. MAXIMUM ENTROPY APPROACH The maximum entropy principle =-=[1, 5]-=- is a well-defined method for incorporating different types of features into a language model [4, 9]. For a word w given its history h it has the following functional form [2, pp. 83-87]: ps(wjh) = ex... |

1 |
Roukos: "Adaptive Language Modeling Using Minimum Discriminant Information
- Pietra, Pietra, et al.
- 1992
(Show Context)
Citation Context ...y than the method to combine information. 1. MAXIMUM ENTROPY APPROACH The maximum entropy principle [1, 5] is a well-defined method for incorporating different types of features into a language model =-=[4, 9]-=-. For a word w given its history h it has the following functional form [2, pp. 83-87]: ps(wjh) = exp \Theta P isi f i (h; w) Z(h) (1) Z(h) := P ~ w exp P isi f i (h; ~ w) ; where for each feature i w... |

1 |
Khudanpur: "Dependency Language Modeling
- Stolcke, Chelba, et al.
- 1996
(Show Context)
Citation Context ... solution to the set of constraint equations. We train them with the Generalized Iterative Scaling (GIS) algorithm [3] implemented as described in [10] with the addition of Ristad's speedup technique =-=[11]-=-. In this paper the baseline maximum entropy model uses the nested trigram, bigram, and unigram features with (h; w) = (u; v; w): fuvw (~u; ~ v; ~ w) = ae 1 if w = ~ w and v = ~ v and u = ~ u 0 otherw... |