## The infinite PCFG using hierarchical Dirichlet processes (2007)

### Cached

### Download Links

Venue: | In EMNLP ’07 |

Citations: | 63 - 6 self |

### BibTeX

@INPROCEEDINGS{Liang07theinfinite,

author = {Percy Liang and Slav Petrov and Michael I. Jordan and Dan Klein},

title = {The infinite PCFG using hierarchical Dirichlet processes},

booktitle = {In EMNLP ’07},

year = {2007},

pages = {688--697}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a nonparametric Bayesian model of tree structures based on the hierarchical Dirichlet process (HDP). Our HDP-PCFG model allows the complexity of the grammar to grow as more training data is available. In addition to presenting a fully Bayesian model for the PCFG, we also develop an efficient variational inference procedure. On synthetic data, we recover the correct grammar without having to specify its complexity in advance. We also show that our techniques can be applied to full-scale parsing applications by demonstrating its effectiveness in learning state-split grammars. 1

### Citations

953 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ... Probabilistic context-free grammars (PCFGs) have been a core modeling technique for many aspects of linguistic structure, particularly syntactic phrase structure in treebank parsing (Charniak, 1996; =-=Collins, 1999-=-). An important question when learning PCFGs is how many grammar symbols to allocate to the learning algorithm based on the amount of available data. The question of “how many clusters (symbols)?” has... |

822 | A Maximum-Entropy-Inspired Parser
- Charniak
(Show Context)
Citation Context ...are more suitable for the modeling task. Lexical methods split each pre-terminal symbol into many subsymbols, one for each word, and then focus on smoothing sparse lexical statistics (Collins, 1999; =-=Charniak, 2000-=-). Unlexicalized methods refine the grammar in a more conservative fashion, splitting each non-terminal or pre-terminal symbol into a much smaller number of subsymbols (Klein and Manning, 2003; Matsuz... |

708 |
A Bayesian analysis of some nonparametric problems
- Ferguson
- 1973
(Show Context)
Citation Context ...rametric Bayesian mixture model based on the Dirichlet process. We focus on the stick-breaking representation (Sethuraman, 1994) of the Dirichlet process instead of the stochastic process definition (=-=Ferguson, 1973-=-) or the Chinese restaurant process (Pitman, 2002). The stickbreaking representation captures the DP prior most explicitly and allows us to extend the finite mixture model with minimal changes. Later,... |

682 | Accurate unlexicalized parsing
- Klein, Manning
- 2003
(Show Context)
Citation Context ... (Collins, 1999; Charniak, 2000). Unlexicalized methods refine the grammar in a more conservative fashion, splitting each non-terminal or pre-terminal symbol into a much smaller number of subsymbols (=-=Klein and Manning, 2003-=-; Matsuzaki et al., 2005; Petrov et al., 2006). We apply our HDP-PCFG-GR model to automatically learn the number of subsymbols for each symbol. 2 Models based on Dirichlet processes At the heart of th... |

489 | Factorial Hidden Markov Models
- Ghahramani, Jordan
- 1998
(Show Context)
Citation Context ...arametrics literature via Dirichlet process (DP) mixture models (Antoniak, 1974). DP mixture models have since been extended to hierarchical Dirichlet processes (HDPs) and HDP-HMMs (Teh et al., 2006; =-=Beal et al., 2002-=-) and applied to many different types of clustering/induction problems in NLP (Johnson et al., 2006; Goldwater et al., 2006). In this paper, we present the hierarchical Dirichlet process PCFG (HDP-PCF... |

416 |
Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics
- Antoniak
- 1974
(Show Context)
Citation Context ...g algorithm based on the amount of available data. The question of “how many clusters (symbols)?” has been tackled in the Bayesian nonparametrics literature via Dirichlet process (DP) mixture models (=-=Antoniak, 1974-=-). DP mixture models have since been extended to hierarchical Dirichlet processes (HDPs) and HDP-HMMs (Teh et al., 2006; Beal et al., 2002) and applied to many different types of clustering/induction ... |

397 | Bayesian density estimation and inference using mixtures
- Escobar, West
- 1995
(Show Context)
Citation Context ...or the HDPPCFG model described in Section 2.4, which can also be adapted to the HDP-PCFG-GR model with a bit more bookkeeping. Most previous inference algorithms for DP-based models involve sampling (=-=Escobar and West, 1995-=-; Teh et al., 2006). However, we chose to use variational inference (Blei and Jordan, 2005), which provides a fast deterministic alternative to sampling, hence avoiding issues of diagnosing convergenc... |

307 |
A constructive definition of dirichlet priors
- Sethuraman
- 1994
(Show Context)
Citation Context ...xture model We now consider the extension of the Bayesian finite mixture model to a nonparametric Bayesian mixture model based on the Dirichlet process. We focus on the stick-breaking representation (=-=Sethuraman, 1994-=-) of the Dirichlet process instead of the stochastic process definition (Ferguson, 1973) or the Chinese restaurant process (Pitman, 2002). The stickbreaking representation captures the DP prior most e... |

283 | Learning accurate, compact, and interpretable tree annotation
- Petrov, Barrett, et al.
- 2006
(Show Context)
Citation Context ...ethods refine the grammar in a more conservative fashion, splitting each non-terminal or pre-terminal symbol into a much smaller number of subsymbols (Klein and Manning, 2003; Matsuzaki et al., 2005; =-=Petrov et al., 2006-=-). We apply our HDP-PCFG-GR model to automatically learn the number of subsymbols for each symbol. 2 Models based on Dirichlet processes At the heart of the HDP-PCFG is the Dirichlet process (DP) mixt... |

228 | Tree-bank grammars
- Charniak
- 1996
(Show Context)
Citation Context .... 1 Introduction Probabilistic context-free grammars (PCFGs) have been a core modeling technique for many aspects of linguistic structure, particularly syntactic phrase structure in treebank parsing (=-=Charniak, 1996-=-; Collins, 1999). An important question when learning PCFGs is how many grammar symbols to allocate to the learning algorithm based on the amount of available data. The question of “how many clusters ... |

212 | Gibbs sampling methods for stick-breaking priors
- Ishwaran, James
- 2001
(Show Context)
Citation Context ... for z > K. While the posterior grammar does have an infinite number of symbols, the exponential decay of the DP prior ensures that most of the probability mass is contained in the first few symbols (=-=Ishwaran and James, 2001-=-). 2 While our variational approximation q is truncated, the actual PCFG model is not. As K increases, our approximation improves. 2.8 Coordinate-wise ascent The optimization problem defined by Equati... |

129 | Variational inference for dirichlet process mixtures
- Blei, Jordan
(Show Context)
Citation Context ... model with a bit more bookkeeping. Most previous inference algorithms for DP-based models involve sampling (Escobar and West, 1995; Teh et al., 2006). However, we chose to use variational inference (=-=Blei and Jordan, 2005-=-), which provides a fast deterministic alternative to sampling, hence avoiding issues of diagnosing convergence and aggregating samples. Furthermore, our variational inference algorithm establishes a ... |

129 | Inducing probabilistic grammars by Bayesian model merging - Stolcke, Omohundro - 1994 |

68 | Probabilistic CFG with latent annotations
- Matsuzaki, Miyao, et al.
- 2005
(Show Context)
Citation Context ..., 2000). Unlexicalized methods refine the grammar in a more conservative fashion, splitting each non-terminal or pre-terminal symbol into a much smaller number of subsymbols (Klein and Manning, 2003; =-=Matsuzaki et al., 2005-=-; Petrov et al., 2006). We apply our HDP-PCFG-GR model to automatically learn the number of subsymbols for each symbol. 2 Models based on Dirichlet processes At the heart of the HDP-PCFG is the Dirich... |

56 | Contextual dependencies in unsupervised word segmentation
- Goldwater, Griffiths, et al.
(Show Context)
Citation Context ...tended to hierarchical Dirichlet processes (HDPs) and HDP-HMMs (Teh et al., 2006; Beal et al., 2002) and applied to many different types of clustering/induction problems in NLP (Johnson et al., 2006; =-=Goldwater et al., 2006-=-). In this paper, we present the hierarchical Dirichlet process PCFG (HDP-PCFG). a nonparametric Bayesian model of syntactic tree structures based on Dirichlet processes. Specifically, an HDP-PCFG is ... |

54 |
Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models
- Griffiths, Goldwater
- 2007
(Show Context)
Citation Context ...els have since been extended to hierarchical Dirichlet processes (HDPs) and HDP-HMMs (Teh et al., 2006; Beal et al., 2002) and applied to many different types of clustering/induction problems in NLP (=-=Johnson et al., 2006-=-; Goldwater et al., 2006). In this paper, we present the hierarchical Dirichlet process PCFG (HDP-PCFG). a nonparametric Bayesian model of syntactic tree structures based on Dirichlet processes. Speci... |

27 | The infinite tree - Finkel, Grenager, et al. - 2007 |

24 | Variational Bayesian grammar induction for natural language - Kurihara, Sato - 2006 |

23 | An application of the variational bayesian approach to probabilistic contextfree grammars - Kurihara, Sato - 2004 |

17 | Learning and inference for hierarchically split PCFGs - Petrov, Klein - 2007 |

2 |
Nonparametric PCFGs using Dirichlet processes
- Liang, Petrov, et al.
- 2007
(Show Context)
Citation Context ... PCFG: optimizing q(z) is the analogue of the E-step, and optimizing q(φ) is the analogue of the M-step; however, optimizing q(β) has no analogue in EM. We summarize each of these updates below (see (=-=Liang et al., 2007-=-) for complete derivations). Parse trees q(z): The distribution over parse trees q(z) can be summarized by the expected sufficient statistics (rule counts), which we denote as C(z → zl zr) for binary ... |