## POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS

Citations: | 1 - 0 self |

### BibTeX

@MISC{Huang_powerlaw,

author = {Songfang Huang and Steve Renals},

title = {POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS},

year = {}

}

### OpenURL

### Abstract

We present an approximation to the Bayesian hierarchical Pitman-Yor process language model which maintains the power law distribution over word tokens, while not requiring a computationally expensive approximate inference process. This approximation, which we term power law discounting, has a similar computational complexity to interpolated and modified Kneser-Ney smoothing. We performed experiments on meeting transcription using the NIST RT06s evaluation data and the AMI corpus, with a vocabulary of 50,000 words and a language model training set of up to 211 million words. Our results indicate that power law discounting results in statistically significant reductions in perplexity and word error rate compared to both interpolated and modified Kneser-Ney smoothing, while producing similar results to the hierarchical Pitman-Yor process language model. Index Terms — language model, smoothing, absolute discount, Kneser-Ney, Bayesian, Pitman-Yor, power law 1.

### Citations

857 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1996
(Show Context)
Citation Context ...te discount, Kneser-Ney, Bayesian, Pitman-Yor, power law 1. INTRODUCTION Smoothing is crucial when estimating a language model (LM), and a large number of methods have been proposed in the literature =-=[1]-=-, including interpolated Kneser-Ney [2] and modified Kneser-Ney [1] smoothing which are generally regarded as the best approaches in practice. The Kneser-Ney approaches are based on absolute discounti... |

760 | SRILM - an extensible language modeling toolkit
- Stolcke
(Show Context)
Citation Context ... AMI meeting corpus. In each case we compared the PLDLM to the IKNLM, the MKNLM and the HPYLM. We trained the all following trigram LMs using cutoff values for counts of 1, by using the SRILM toolkit =-=[11]-=- and the PLDLM program 1 . We did not use the strength parameter θ when training PLDLMs, i.e., we set θ =0in (6). We used the AMIASR system [12] as the baseline platform for our ASR experiments, using... |

274 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ...an-Yor, power law 1. INTRODUCTION Smoothing is crucial when estimating a language model (LM), and a large number of methods have been proposed in the literature [1], including interpolated Kneser-Ney =-=[2]-=- and modified Kneser-Ney [1] smoothing which are generally regarded as the best approaches in practice. The Kneser-Ney approaches are based on absolute discounting, with lower order distributions refl... |

221 | The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator
- Pitman, Yor
- 1997
(Show Context)
Citation Context ...terms of the frequency ranking of words: a few outcomes have very high probability while most outcomes occur with low probability. It follows that a stochastic process, such as the Pitman-Yor process =-=[7]-=-, that has the “rich-getricher” capacity to generate a power law distribution is able to take advantage of this property for natural language modeling. In this paper we briefly outline the HPYLM, whic... |

79 | A hierarchical Dirichlet language model
- MACKAY, L
- 1995
(Show Context)
Citation Context ... where S denotes latent variables and Θ={dm,θm :0≤ m<n} represents hyperparameters. If we set the discounting parameters d|u| = 0 for all u, we obtain the hierarchical Dirichlet language model (HDLM) =-=[9]-=-. The overall predictive probability can be approximately obtained by collecting I samples from the posterior over S and Θ, and then averaging (1) to approximate the integral with samples: P (w|u) ≈ I... |

76 | Interpolating between types and tokens by estimating power-law generators
- Goldwater, Griffiths, et al.
(Show Context)
Citation Context ...oportional to θ + dt•, where ck is the number of customers sitting at table k and t• is the current 978-1-4244-4296-6/10/$25.00 ©2010 IEEE 5178 ICASSP 2010number of occupied tables. Goldwater et al. =-=[8]-=- demonstrated that a Pitman-Yor process is capable of producing a power law distribution with index 1+d over the number of customers seated at each table . When the Pitman-Yor process is applied to la... |

24 | The AMI system for the transcription of speech in meetings
- Hain, Burget, et al.
- 2007
(Show Context)
Citation Context ...toff values for counts of 1, by using the SRILM toolkit [11] and the PLDLM program 1 . We did not use the strength parameter θ when training PLDLMs, i.e., we set θ =0in (6). We used the AMIASR system =-=[12]-=- as the baseline platform for our ASR experiments, using all LMs in the first pass decoding. 5.1. NIST Rich Transcription 2006 Evaluation For the RT06s task we trained LMs on 1.8M words of transcribed... |

9 |
Strategies for language model web-data collection
- Wan, Hain
- 2006
(Show Context)
Citation Context ...ationaltelephone speech (fisher-03-p1), and web data matched to meeting (webmeet; 36.1M words) and conversational (webconv; 162.9M words) speech collected using the approach described by Wan and Hain =-=[13]-=-. In total this resulted in 211.4M words of LM training data (ALL-1). We performed experiments using a vocabulary of 50,000 words. Table 1 shows the perplexity results on the NIST RT06s test data rt06... |

6 |
Whye Teh. A Hierarchical Bayesian Language Model based on Pitman-Yor Processes
- Yee
- 2006
(Show Context)
Citation Context ...recent body of work in which the Kneser-Ney methods have been shown to approximate a hierarchical Bayesian language model which incorporates a nonparametric prior distribution, the Pitman-Yor process =-=[4]-=-. Our previous work [5] demonstrated the practical application of hierarchical Pitman-Yor process language models (HPYLM) to large vocabulary automatic speech recognition (ASR) of conversational speec... |

5 |
Whye Teh. A Bayesian interpretation of interpolated Kneser-Ney
- Yee
- 2006
(Show Context)
Citation Context ...of tuw is time and memory intensive. However, the expected number of tables E(tu•) in a Pitman-Yor process used in the HPYLM follows a power law growth with cu• where • denotes the marginal operation =-=[10]-=-. Based on this observation, we therefore propose a power law discounting LM (PLDLM) which smoothes n-grams as follows: d = n1 n1 +2n2 tuw = f(cuw) =c d uw (4) P PLD (w|u)= tu• = ∑ tuw = ∑ w w max(cuw... |

2 | Extensions of Absolute Discounting (Kneser-Ney Method
- Andrés-Ferrer, Ney
(Show Context)
Citation Context ...Kneser-Ney approaches are based on absolute discounting, with lower order distributions reflecting the marginal constraints. In addition to exploring further constraints and more efficient algorithms =-=[3]-=-, there has been a recent body of work in which the Kneser-Ney methods have been shown to approximate a hierarchical Bayesian language model which incorporates a nonparametric prior distribution, the ... |