## Entropy-based Pruning of Backoff Language Models (1998)

### Cached

### Download Links

- [www-speech.sri.com]
- [www.speech.sri.com]
- [www.nist.gov]
- [arxiv.org]
- [www.nist.gov]
- [ftp.speech.sri.com]
- DBLP

### Other Repositories/Bibliography

Citations: | 129 - 7 self |

### BibTeX

@MISC{Stolcke98entropy-basedpruning,

author = {Andreas Stolcke},

title = {Entropy-based Pruning of Backoff Language Models},

year = {1998}

}

### Years of Citing Articles

### OpenURL

### Abstract

A criterion for pruning parameters from N-gram backoff language models is developed, based on the relative entropy between the original and the pruned model. It is shown that the relative entropy resulting from pruning a single N-gram can be computed exactly and efficiently for backoff models. The relative entropy measure can be expressed as a relative change in training set perplexity. This leads to a simple pruning criterion whereby all N-grams that change perplexity by less than a threshold are removed from the model. Experiments show that a production-quality Hub4 LM can be reduced to 26% its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld [9], and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N-grams (about 85% overlap), with the exact relative entropy criterion giving marginally bette...

### Citations

9294 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...tween the distribution embodied by the original model and that of the pruned model. A standard measure of divergence between distributions is relative entropy or Kullback-Leibler distance (see, e.g., =-=[2]-=-). Although not strictly a distance metric, it is a non-negative, continuous function that is zero if and only if the two distributions are identical. Let p(\Deltaj\Delta) denote the conditional proba... |

704 | Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1997
(Show Context)
Citation Context ...perimentally, both approaches select similar sets of N-grams (about 85% overlap), with the exact relative entropy criterion giving marginally better performance. 1. Introduction N-gram backoff models =-=[5]-=-, despite their shortcomings, still dominate as the technology of choice for state-of-the-art speech recognizers [4]. Two sources of performance improvements are the use of higher-order models (severa... |

652 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...s by no means new. Relative entropy minimization (sometimes in the guise of likelihood maximization) is the basis of many model optimizationtechniques proposed in the past, e.g., for text compression =-=[1]-=-, Markov model induction [10, 7]. Kneser [6] first suggested applying it to backoff N-gram models, although, as shown in Section 5, the heuristic pruning algorithm of Seymore and Rosenfeld [9] amounts... |

413 |
The Population Frequencies of Species and the Estimation of Population Parameters
- Good
- 1953
(Show Context)
Citation Context ... function of pruning threshold and language model sizes. As noted in Section 2, the pruning algorithm is applicable irrespective of the particular N-gram estimator used. We used Good-Turing smoothing =-=[3]-=- throughout and did not investigate possible interactions between smoothing methods and pruning. Table 1 shows model size, perplexity and word error results as determined on the development test set, ... |

140 | Hidden Markov model induction by Bayesian model merging
- Stolcke, Omohundro
- 1992
(Show Context)
Citation Context ...entropy minimization (sometimes in the guise of likelihood maximization) is the basis of many model optimizationtechniques proposed in the past, e.g., for text compression [1], Markov model induction =-=[10, 7]-=-. Kneser [6] first suggested applying it to backoff N-gram models, although, as shown in Section 5, the heuristic pruning algorithm of Seymore and Rosenfeld [9] amounts to an approximate relative entr... |

80 | The power of amnesia
- Ron, Singer, et al.
- 1994
(Show Context)
Citation Context ...entropy minimization (sometimes in the guise of likelihood maximization) is the basis of many model optimizationtechniques proposed in the past, e.g., for text compression [1], Markov model induction =-=[10, 7]-=-. Kneser [6] first suggested applying it to backoff N-gram models, although, as shown in Section 5, the heuristic pruning algorithm of Seymore and Rosenfeld [9] amounts to an approximate relative entr... |

66 | Scalable Backoff Language Models
- Seymore, Rosenfeld
- 1996
(Show Context)
Citation Context ... that a production-quality Hub4 LM can be reduced to 26% its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld =-=[9]-=-, and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N-grams (about 85% overlap), with the ex... |

33 |
Up from trigrams! The struggle for improved language models
- Jelinek
- 1991
(Show Context)
Citation Context ...iterion giving marginally better performance. 1. Introduction N-gram backoff models [5], despite their shortcomings, still dominate as the technology of choice for state-of-the-art speech recognizers =-=[4]-=-. Two sources of performance improvements are the use of higher-order models (several DARPAHub4 sites now use 4-gram or 5-gram models) and the inclusion of more training data from more sources (Hub4 m... |

27 |
Statistical language modeling using a variable context length
- Kneser
- 1996
(Show Context)
Citation Context ...ave explicit conditional probability estimates assigned by the model, so as to maximize performance (i.e., minimize perplexity and/or recognition error) while minimizing model size. As pointed out in =-=[6]-=-, pruning (selecting parameters from) a full N-gram model of higher order amounts to building a variable-length N-gram model, i.e., one in which training set contexts are not uniformly represented by ... |

5 | Acoustic modeling for the SRI Hub4 partitioned evaluation continuous speech recognition system
- Sankar, Heck, et al.
- 1997
(Show Context)
Citation Context ... ), respectively, to obtain ff 0 (h) for each pruned w. 4. Experiments We evaluated relative entropy-based language model pruning in the Broadcast News domain, using SRI's 1996 Hub4 evaluation system =-=[8]-=-. N-best lists generated with a bigram language model were rescored with various pruned versions of a large four-gram language model. 1 1 We used the 1996 system, partly due to time constraints, partl... |

2 |
Markov model induction by Bayesian model merging
- Hidden
- 1993
(Show Context)
Citation Context ...entropy minimization (sometimes in the guise of likelihood maximization) is the basis of many model optimizationtechniques proposed in the past, e.g., for text compression [1], Markov model induction =-=[10, 7]-=-. Kneser [6] first suggested applying it to backoff N-gram models, although, as shown in Section 5, the heuristic pruning algorithm of Seymore and Rosenfeld [9] amounts to an approximate relative entr... |

1 |
Scalable backoff languagemodels
- Rosenfeld
- 1996
(Show Context)
Citation Context ... that a production-quality Hub4 LM can be reduced to 26% its original size without increasing recognition error. We also compare the approach to a heuristic pruning criterion by Seymore and Rosenfeld =-=[9]-=-, and show that their approach can be interpreted as an approximation to the relative entropy criterion. Experimentally, both approaches select similar sets of N-grams (about 85% overlap), with the ex... |