## A bayesian interpretation of interpolated kneserney (2006)

### Cached

### Download Links

Citations: | 20 - 2 self |

### BibTeX

@TECHREPORT{Teh06abayesian,

author = {Yee Whye Teh},

title = {A bayesian interpretation of interpolated kneserney},

institution = {},

year = {2006}

}

### OpenURL

### Abstract

Interpolated Kneser-Ney is one of the best smoothing methods for n-gram language models. Previous explanations for its superiority have been based on intuitive and empirical justifications of specific properties of the method. We propose a novel interpretation of interpolated Kneser-Ney as approximate inference in a hierarchical Bayesian model consisting of Pitman-Yor processes. As opposed to past explanations, our interpretation can recover exactly the formulation of interpolated Kneser-Ney, and performs better than interpolated Kneser-Ney when a better inference procedure is used. 1

### Citations

1589 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...ial priors. We will give a new interpretation of interpolated Kneser-Ney as an approximate inference method in a Bayesian model. The model we propose is a straightforward hierarchical Bayesian model (=-=Gelman et al. 1995-=-), where each hidden variable represents the distribution over next words given a particular context. These variables are related hierarchically such that the prior mean of a hidden variable correspon... |

1182 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ...essing and language modelling given the probabilistic nature of most approaches. Maximum-entropy models have found many uses relating features of inputs to distributions over outputs (Rosenfeld 1994; =-=Berger et al. 1996-=-; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and Rosenfeld 2000; Goodman... |

961 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1996
(Show Context)
Citation Context ...ng arrangement and over all trigrams that occurred c times in the full training set. The last entry is averaged over all trigrams that occurred at least 50 times. cross-entropy on the validation set (=-=Chen and Goodman 1998-=-). At the optimal values, we folded the validation set into the training set to obtain the final trigram probability estimates. For the hierarchical Pitman-Yor language model we inferred the posterior... |

613 | Hierarchical dirichlet processes
- Teh, Jordan, et al.
- 2006
(Show Context)
Citation Context ... namely interpolated Kneser-Ney, is a great approximation to a Bayesian model. The hierarchical Pitman-Yor process is a natural generalization of the recently proposed hierarchical Dirichlet process (=-=Teh et al. 2006-=-). The hierarchical Dirichlet process was proposed to solve a clustering problem instead and it is interesting to note that such a direct generalization leads us to a well-established solution for a d... |

476 | Maximum entropy markov models for information extraction and segmentation
- McCallum, Freitag, et al.
- 2000
(Show Context)
Citation Context ...modelling given the probabilistic nature of most approaches. Maximum-entropy models have found many uses relating features of inputs to distributions over outputs (Rosenfeld 1994; Berger et al. 1996; =-=McCallum et al. 2000-=-; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and Rosenfeld 2000; Goodman 2004). Even hierarchi... |

357 |
A constructive definition of Dirichlet priors
- Sethuraman
- 1994
(Show Context)
Citation Context ...and d control the amount of variability around the base distribution G0. An explicit construction of draws G1 ∼ PY(d,θ,G0) from a Pitman-Yor process is given by the stick-breaking construction 4 (7)s(=-=Sethuraman 1994-=-; Ishwaran and James 2001). This construction shows that G1 is a weighted sum of an infinite sequence of point masses (with probability one). Let V1,V2,... and φ1,φ2,... be two sequence of independent... |

313 | Improved backing-off for m-gram language modeling - Kneser, Ney - 1995 |

249 | Gibbs sampling methods for stick breaking priors
- Ishwaran, James
- 2001
(Show Context)
Citation Context ...ed according to a well-studied nonparametric generalization of the Dirichlet distribution variously known as the twoparameter Poisson-Dirichlet process or the Pitman-Yor process (Pitman and Yor 1997; =-=Ishwaran and James 2001-=-; Pitman 2002) (in this paper we shall refer to this as the Pitman-Yor process for succinctness). As we shall show in this paper, this hierarchical structure corresponds exactly to the technique of in... |

239 | The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability
- Pitman, Yor
- 1997
(Show Context)
Citation Context ...riables are distributed according to a well-studied nonparametric generalization of the Dirichlet distribution variously known as the twoparameter Poisson-Dirichlet process or the Pitman-Yor process (=-=Pitman and Yor 1997-=-; Ishwaran and James 2001; Pitman 2002) (in this paper we shall refer to this as the Pitman-Yor process for succinctness). As we shall show in this paper, this hierarchical structure corresponds exact... |

191 | Adaptive statistical language modeling: a maximum entropy approach
- Rosenfeld
- 1994
(Show Context)
Citation Context ...al language processing and language modelling given the probabilistic nature of most approaches. Maximum-entropy models have found many uses relating features of inputs to distributions over outputs (=-=Rosenfeld 1994-=-; Berger et al. 1996; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and Ros... |

182 | A neural probabilistic language model - Bengio, Ducharme, et al. - 2003 |

176 | Two decades of statistical language modeling: where do we go from here
- Rosenfeld
- 2000
(Show Context)
Citation Context ...994; Berger et al. 1996; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and =-=Rosenfeld 2000-=-; Goodman 2004). Even hierarchical Bayesian models have been applied to language modelling—MacKay and Peto (1994) have proposed one based on Dirichlet distributions. Our model is a natural generalizat... |

140 |
Combinatorial Stochastic Processes
- Pitman
- 2006
(Show Context)
Citation Context ...udied nonparametric generalization of the Dirichlet distribution variously known as the twoparameter Poisson-Dirichlet process or the Pitman-Yor process (Pitman and Yor 1997; Ishwaran and James 2001; =-=Pitman 2002-=-) (in this paper we shall refer to this as the Pitman-Yor process for succinctness). As we shall show in this paper, this hierarchical structure corresponds exactly to the technique of interpolating b... |

134 | A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of english bigrams - Church - 1991 |

97 | A Bit of Progress in Language Modeling - Goodman - 2000 |

96 | Factored Language Models and Generalized Parallel Backoff - Bilmes, Kirchhoff - 2003 |

88 | A survey of smoothing techniques for ME models
- Chen, Rosenfeld
- 2000
(Show Context)
Citation Context ...senfeld 1994; Berger et al. 1996; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; =-=Chen and Rosenfeld 2000-=-; Goodman 2004). Even hierarchical Bayesian models have been applied to language modelling—MacKay and Peto (1994) have proposed one based on Dirichlet distributions. Our model is a natural generalizat... |

86 | Interpolating between types and tokens by estimating power-law generators. Advances in neural information processing systems - Goldwater, Griffiths, et al. - 2006 |

70 | Structure learning in conditional probability models via an entropic prior and parameter extinction
- Brand
- 1999
(Show Context)
Citation Context ... outputs (Rosenfeld 1994; Berger et al. 1996; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (=-=Brand 1999-=-; Chen and Rosenfeld 2000; Goodman 2004). Even hierarchical Bayesian models have been applied to language modelling—MacKay and Peto (1994) have proposed one based on Dirichlet distributions. Our model... |

70 | Offline recognition of unconstrained handwritten texts using hmms and statistical langage models - Vinciarelli, Bengio, et al. |

60 | Exponential priors for maximum entropy models
- Goodman
- 2004
(Show Context)
Citation Context ...l. 1996; McCallum et al. 2000; Lafferty et al. 2001). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and Rosenfeld 2000; =-=Goodman 2004-=-). Even hierarchical Bayesian models have been applied to language modelling—MacKay and Peto (1994) have proposed one based on Dirichlet distributions. Our model is a natural generalization of this mo... |

26 | A hierarchical Dirichlet language model. Natural language engineering - MacKay, Peto - 1995 |

23 | A unified approach to generalized Stirling numbers - Hsu, Shiue - 1998 |

22 | 2004), Random forest in language modeling - Xu, Jelinek |

13 | Dirichlet processes Chinese restaurant processes and all that - Jordan - 2005 |

11 | Distributed latent variable models of lexical co-occurrences - Blitzer, Globerson, et al. |

8 | Immediate Head Parsing for Language Models - Charniak - 2001 |

2 |
Conditional random fields: Propabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...obabilistic nature of most approaches. Maximum-entropy models have found many uses relating features of inputs to distributions over outputs (Rosenfeld 1994; Berger et al. 1996; McCallum et al. 2000; =-=Lafferty et al. 2001-=-). Use of priors is widespread and a number of studies have been conducted comparing different types of priors (Brand 1999; Chen and Rosenfeld 2000; Goodman 2004). Even hierarchical Bayesian models ha... |

1 | a model for words have been used in, e.g. Vinciarelli et al - Such - 2004 |

1 | Two decades of statistical language modeling: Where do we go from here - unknown authors - 2000 |