## A hierarchical Bayesian language model based on Pitman–Yor processes (2006)

### Cached

### Download Links

Venue: | In Coling/ACL, 2006. 9 |

Citations: | 90 - 8 self |

### BibTeX

@INPROCEEDINGS{Teh06ahierarchical,

author = {Yee Whye Teh},

title = {A hierarchical Bayesian language model based on Pitman–Yor processes},

booktitle = {In Coling/ACL, 2006. 9},

year = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney. 1

### Citations

1546 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1997
(Show Context)
Citation Context ...sian language models had been dismal compared to other smoothing methods (Nadas, 1984; MacKay and Peto, 1994). In this paper, we propose a novel language model based on a hierarchical Bayesian model (=-=Gelman et al., 1995-=-) where each hidden variable is distributed according to a Pitman-Yor process, a nonparametric generalization of the Dirichlet distribution that is widely studied in the statistics and probability the... |

949 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...ng severely overfits to the training data, and smoothing methods are indispensible for proper training of n-gram models. A large number of smoothing methods have been proposed in the literature (see (=-=Chen and Goodman, 1998-=-; Goodman, 2001; Rosenfeld, 2000) for good overviews). Most methods take a rather ad hoc approach, where n-gram probabilities for various values of n are combined together, using either interpolation ... |

599 | Hierarchical dirichlet processes
- Teh, Jordan, et al.
(Show Context)
Citation Context ...cess), Bayesian methods can be competitive with the best smoothing techniques. The hierarchical Pitman-Yor process is a natural generalization of the recently proposed hierarchical Dirichlet process (=-=Teh et al., 2006-=-). The hierarchical Dirichlet process was proposed to solve a different problem—that of clustering, and it is interesting to note that such a direct generalization leads us to a good language model. B... |

597 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...,Θ) (12) θ |u| + cu·· where the counts are obtained from the seating arrangement Su in the Chinese restaurant process corresponding to Gu. We use Gibbs sampling to obtain the posterior samples {S,Θ} (=-=Neal, 1993-=-). Gibbs sampling keeps track of the current state of each variable of interest in the model, and iteratively resamples the state of each variable given the current states of all other variables. It c... |

309 |
Improved backing-off for m-gram language modeling
- Kneser, Ney
- 1995
(Show Context)
Citation Context ... be interpreted as a particular approximate inference scheme in the hierarchical Pitman-Yor language model. Our interpretation is more useful than past interpretations involving marginal constraints (=-=Kneser and Ney, 1995-=-; Chen and Goodman, 1998) or maximum-entropy models (Goodman, 2004) as it can recover the exact formulation of interpolated Kneser-Ney, and actually produces superior results. (Goldwater et al., 2006)... |

240 | Gibbs sampling methods for stick-breaking priors
- ISHWARAN, JAMES
- 2001
(Show Context)
Citation Context ...uted according to a Pitman-Yor process, a nonparametric generalization of the Dirichlet distribution that is widely studied in the statistics and probability theory communities (Pitman and Yor, 1997; =-=Ishwaran and James, 2001-=-; Pitman, 2002). 985 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 985–992, Sydney, July 2006. c○2006 Association for Computat... |

238 | The two parameter Poisson–Dirichlet distribution derived from a stable subordinator
- Pitman, Yor
- 1997
(Show Context)
Citation Context ...en variable is distributed according to a Pitman-Yor process, a nonparametric generalization of the Dirichlet distribution that is widely studied in the statistics and probability theory communities (=-=Pitman and Yor, 1997-=-; Ishwaran and James, 2001; Pitman, 2002). 985 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 985–992, Sydney, July 2006. c○200... |

173 | Two decades of statistical language modeling: Where do we go from here - Rosenfeld |

96 | A bit of progress in language modeling
- Goodman
- 2001
(Show Context)
Citation Context ...the training data, and smoothing methods are indispensible for proper training of n-gram models. A large number of smoothing methods have been proposed in the literature (see (Chen and Goodman, 1998; =-=Goodman, 2001-=-; Rosenfeld, 2000) for good overviews). Most methods take a rather ad hoc approach, where n-gram probabilities for various values of n are combined together, using either interpolation or back-off sch... |

84 | Interpolating Between Types and Tokens by Estimating Power-Law Generators
- Goldwater, Griffiths, et al.
- 2006
(Show Context)
Citation Context ...roduce power-law distributions that more closely resemble those seen in natural languages, and it has been argued that as a result they are more suited to applications in natural language processing (=-=Goldwater et al., 2006-=-). We show experimentally that our hierarchical Pitman-Yor language model does indeed produce results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney, two of the currently be... |

35 |
Estimation of probabilities in the language model of the IBM speech recognition system
- Nadas
- 1984
(Show Context)
Citation Context ...sources and to include them in larger models in a principled manner. Unfortunately the performance of previously proposed Bayesian language models had been dismal compared to other smoothing methods (=-=Nadas, 1984-=-; MacKay and Peto, 1994). In this paper, we propose a novel language model based on a hierarchical Bayesian model (Gelman et al., 1995) where each hidden variable is distributed according to a Pitman-... |

26 |
A hierarchical Dirichlet language model. Natural language engineering
- MacKay, Peto
- 1995
(Show Context)
Citation Context ...o include them in larger models in a principled manner. Unfortunately the performance of previously proposed Bayesian language models had been dismal compared to other smoothing methods (Nadas, 1984; =-=MacKay and Peto, 1994-=-). In this paper, we propose a novel language model based on a hierarchical Bayesian model (Gelman et al., 1995) where each hidden variable is distributed according to a Pitman-Yor process, a nonparam... |

12 |
Dirichlet processes, chinese restaurant processes and all that. Tutorial presentation at the NIPS Conference
- Jordan
- 2005
(Show Context)
Citation Context ... nonparametric Bayesian models. Here we give a quick description of the Pitman-Yor process in the context of a unigram language model; good tutorials on such models are provided in (Ghahramani, 2005; =-=Jordan, 2005-=-). Let W be a fixed and finite vocabulary of V words. For each word w ∈ W let G(w) be the (to be estimated) probability of w, and let G = [G(w)]w∈W be the vector of word probabilities. We place a Pitm... |

4 |
Non-parametric Bayesian Methods’, Tutorial presentation at
- Ghahramani
- 2005
(Show Context)
Citation Context ...es are examples of nonparametric Bayesian models. Here we give a quick description of the Pitman-Yor process in the context of a unigram language model; good tutorials on such models are provided in (=-=Ghahramani, 2005-=-; Jordan, 2005). Let W be a fixed and finite vocabulary of V words. For each word w ∈ W let G(w) be the (to be estimated) probability of w, and let G = [G(w)]w∈W be the vector of word probabilities. W... |