## Pruning Exponential Language Models

### BibTeX

@MISC{Chen_pruningexponential,

author = {Stanley F. Chen and Abhinav Sethy and Bhuvana Ramabhadran},

title = {Pruning Exponential Language Models},

year = {}

}

### OpenURL

### Abstract

Abstract—Language model pruning is an essential technology for speech applications running on resource-constrained devices, and many pruning algorithms have been developed for conventional word n-gram models. However, while exponential language models can give superior performance, there has been little work on the pruning of these models. In this paper, we propose several pruning algorithms for general exponential language models. We show that our best algorithm applied to an exponential n-gram model outperforms existing n-gram model pruning algorithms by up to 0.4 % absolute in speech recognition word-error rate on Wall Street Journal and Broadcast News data sets. In addition, we show that Model M, an exponential class-based language model, retains its performance improvement over conventional word n-gram models when pruned to equal size, with gains of up to 2.5 % absolute in word-error rate. I.

### Citations

857 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1999
(Show Context)
Citation Context ..., and α and σ are regularization hyperparameters. Unpruned n-gram models regularized this way have about the same performance as conventional n-gram models smoothed with modified Kneser-Ney smoothing =-=[4]-=-. Model M is a class-based n-gram model composed of two separate exponential models, one for predicting classes and one for predicting words. Let png(y|θ) denote an exponential n-gram model and let pn... |

702 | Class-based n-gram models of natural language. Computational Linguistics
- Brown, Pietra, et al.
- 1990
(Show Context)
Citation Context ...For the Wall Street Journal experiments, we use the enhanced word classing algorithm for Model M developed in [5]. For the Broadcast News experiments, we use bigram mutual information word clustering =-=[12]-=- to build word classes. In Figure 5, we compare algorithms for pruning Model M against our main baselines for word n-gram model pruning. Despite its larger unpruned size, Model M consistently yields s... |

554 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...ogether, we get scoretrain(θ) = ctrn(θ)λθ + D · EpΛ[fθ] · (e −λθ − 1)− α|λθ| − λ2 θ 2σ2 (14) Our second method for approximating the effect of renormalization, norm pruning, is inspired by ideas from =-=[10]-=-. We make the assumption that pΛ(yθ|x) is constant across all x for which a feature θ is active. Then, we have EpΛ [fθ] = 1 D pΛ(yθ|x) ∑ x:fθ active ctrn(x) ≡ 1 D pΛ(yθ|x)chist(fθ) where chist(fθ) den... |

274 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ... exponential language models as well. A. Weighted difference pruning One of the earliest and simplest algorithms for pruning conventional n-gram models is weighted difference pruning [8]. As noted in =-=[9]-=-, smoothed conventional n-gram models can generally be expressed as ⎧ ⎪⎨ ps(wj|w j−1 q(wj|w j−n+1 ) = ⎪⎩ j−1 j−n+1 ) α(w j−1 j−n+1 )× ps(wj|w j−1 j−n+2 ) otherwise (5) where wk j ≡ wjwj+1 · · · wk and... |

246 | A maximum entropy approach to adaptive statistical language modelling
- Rosenfeld
- 1996
(Show Context)
Citation Context ...ns of up to 1.5% absolute with heavier pruning. IV. RELATED WORK Here, we discuss previous work on pruning for exponential language models. The use of count cutoffs is a common technique, e.g., [13], =-=[14]-=-. Count cutoffs are compared with smoothing techniques for exponential n-gram models in [2]. The use of ℓ1 or ℓ1 + ℓ 2 2 regularization [3], [15] has been noted to produce sparse solutions; i.e., many... |

122 | Entropy-based pruning of backoff language models
- Stolcke
- 1998
(Show Context)
Citation Context ... also be viewed as attempting to optimize test set perplexity. In particular, several algorithms attempt to minimize the Kullback-Leibler distance between the original model and the pruned model [6], =-=[7]-=-: D(porig ‖ pprune) = ∑ p(x)porig(y|x) log porig(y|x) (4) pprune(y|x) x,y Test set log perplexity is equivalent to the cross-entropy between the empirical test set distribution ptst and the final mode... |

65 | Scalable backoff language models
- Seymore, Rosenfeld
- 1996
(Show Context)
Citation Context ...can be applied to exponential language models as well. A. Weighted difference pruning One of the earliest and simplest algorithms for pruning conventional n-gram models is weighted difference pruning =-=[8]-=-. As noted in [9], smoothed conventional n-gram models can generally be expressed as ⎧ ⎪⎨ ps(wj|w j−1 q(wj|w j−n+1 ) = ⎪⎩ j−1 j−n+1 ) α(w j−1 j−n+1 )× ps(wj|w j−1 j−n+2 ) otherwise (5) where wk j ≡ wj... |

57 | Exponential priors for maximum entropy models
- Goodman
- 2004
(Show Context)
Citation Context ...se of count cutoffs is a common technique, e.g., [13], [14]. Count cutoffs are compared with smoothing techniques for exponential n-gram models in [2]. The use of ℓ1 or ℓ1 + ℓ 2 2 regularization [3], =-=[15]-=- has been noted to produce sparse solutions; i.e., many λi are set to 0 and can be trivially pruned. For example, 8% of the λi’s in our unpruned Wall Street Journal exponential 4-gram model are zero d... |

26 |
Statistical Language Modeling Using a Variable Context Length
- Kneser
- 1996
(Show Context)
Citation Context ...s can also be viewed as attempting to optimize test set perplexity. In particular, several algorithms attempt to minimize the Kullback-Leibler distance between the original model and the pruned model =-=[6]-=-, [7]: D(porig ‖ pprune) = ∑ p(x)porig(y|x) log porig(y|x) (4) pprune(y|x) x,y Test set log perplexity is equivalent to the cross-entropy between the empirical test set distribution ptst and the final... |

25 | Evaluation and extension of maximum entropy models with inequality constraints
- Kazama, Tsujii
- 2003
(Show Context)
Citation Context ...d can be thought of as an alternate parameterization of the same model space [2]. One advantage of this representation is that smoothing can be done simply and effectively via ℓ1 +ℓ2 2 regularization =-=[3]-=-, [1] where the parameters Λ are chosen to optimize Oℓ1+ℓ2(Λ) = log PPtrn + 2 α ∑ |λi| + D 1 2σ2 ∑ λ D 2 i (2) where PPtrn is training set perplexity, D is the number of words in the training set, and... |

21 |
A survey of smoothing techniques for maximum entropy models
- Chen, Rosenfeld
- 2000
(Show Context)
Citation Context ...ork on pruning for exponential language models. The use of count cutoffs is a common technique, e.g., [13], [14]. Count cutoffs are compared with smoothing techniques for exponential n-gram models in =-=[2]-=-. The use of ℓ1 or ℓ1 + ℓ 2 2 regularization [3], [15] has been noted to produce sparse solutions; i.e., many λi are set to 0 and can be trivially pruned. For example, 8% of the λi’s in our unpruned W... |

16 |
Adaptive Statistical Language Modelling
- Lau
- 1994
(Show Context)
Citation Context ...th gains of up to 1.5% absolute with heavier pruning. IV. RELATED WORK Here, we discuss previous work on pruning for exponential language models. The use of count cutoffs is a common technique, e.g., =-=[13]-=-, [14]. Count cutoffs are compared with smoothing techniques for exponential n-gram models in [2]. The use of ℓ1 or ℓ1 + ℓ 2 2 regularization [3], [15] has been noted to produce sparse solutions; i.e.... |

16 | A comparison of criteria for maximum entropy / minimum divergence feature selection
- Berger, Printz
- 1998
(Show Context)
Citation Context ... effective for light pruning, α needs to be increased for heavier pruning, likely leading to poor test set performance. A related technique to model pruning is model growing, or feature induction. In =-=[16]-=-, several criteria for inducing features in a maximum entropy language model are compared, including the feature gain computation described in [10], count cutoffs, and a criterion based on mutual info... |

11 | Performance prediction for exponential language models
- Chen
- 2009
(Show Context)
Citation Context ... has been little work on pruning for exponential (or maximum entropy) language models. However, recent work has shown that exponential language models such as Model M can achieve superior performance =-=[1]-=-. In this paper, we show how many existing n-gram model pruning algorithms can be viewed as attempting to optimize estimated test set perplexity, and discuss how ideas from these techniques can be ada... |

11 | Multi-class composite n-gram language model - Yamamoto, Isogai, et al. - 2003 |

7 | On growing and pruning Kneser-Ney smoothed n-gram models
- Siivola, Hirsimäki, et al.
- 2007
(Show Context)
Citation Context ...train eq. (14) with α = 0.5, σ2 = 6. perfpred eq. (14) with α = 0.938, σ2 = ∞. norm eq. (18) iter.grow.rand Iterative random growing and norm pruning. C. Revised Kneser pruning Revised Kneser pruning =-=[11]-=- is specific to n-gram models smoothed with Kneser-Ney smoothing and variants [9], and improves upon relative entropy pruning in several ways. One of the characteristics of Kneser-Ney smoothing is tha... |

6 | Enhanced word classing for model m
- Chen, Chu
(Show Context)
Citation Context ...ss of word wj. Model M has achieved among the largest word-error rate improvements over word ngram models ever reported, with gains as high as 3% absolute as compared to a Katz-smoothed trigram model =-=[5]-=-. II. PRUNING FOR EXPONENTIAL LANGUAGE MODELS In this section, we present a general framework for analyzing pruning algorithms. A natural goal in designing a pruning algorithm is to maximize test set ... |