## Variational Inference for Adaptor Grammars

### Cached

### Download Links

Citations: | 14 - 2 self |

### BibTeX

@MISC{Cohen_variationalinference,

author = {Shay B. Cohen and David M. Blei and Noah A. Smith},

title = {Variational Inference for Adaptor Grammars},

year = {}

}

### OpenURL

### Abstract

Adaptor grammars extend probabilistic context-free grammars to define prior distributions over trees with “rich get richer” dynamics. Inference for adaptor grammars seeks to find parse trees for raw text. This paper describes a variational inference algorithm for adaptor grammars, providing an alternative to Markov chain Monte Carlo methods. To derive this method, we develop a stick-breaking representation of adaptor grammars, a representation that enables us to define adaptor grammars with recursion. We report experimental results on a word segmentation task, showing that variational inference performs comparably to MCMC. Further, we show a significant speed-up when parallelizing the algorithm. Finally, we report promising results for a new application for adaptor grammars, dependency grammar induction.

### Citations

4270 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...ariational inference (Fig. 1). The M-step optimizes the hyperparameters (a, b and α) with respect to expected sufficient statistics under the variational distribution. We use Newton-Raphson for each (=-=Boyd and Vandenberghe, 2004-=-); Fig. 2 gives the objectives. 3.1 Note about Recursive Grammars With recursive grammars, the stick-breaking process representation gives probability mass to events which are ill-defined. In step 2(i... |

2146 | MapReduce: Simplified Data Processing on Large Clusters
- DEAN, GHEMAWAT
- 2004
(Show Context)
Citation Context ...stribution is close to the posterior of interest. Variational methods tend to converge faster than MCMC, and can be more easily parallelized over multiple processors in a framework such as MapReduce (=-=Dean and Ghemawat, 2004-=-). The variational bound on the likelihood of the data is: log p(x | a, α) ≥ H(q) + ∑ Eq[log p(vA | aA)] A∈M + ∑ Eq[log p(θA | αA)] A∈M + ∑ Eq[log p(zA | v, θ)] + Eq[log p(z | vA)] A∈MExpectations ar... |

1057 | Monte Carlo Statistical Methods
- Robert, Casella
- 1999
(Show Context)
Citation Context ...nal problem of determining the posterior distribution of parse trees given a set of observed sentences. Current posterior inference algorithms for adaptor grammars are based on MCMC sampling methods (=-=Robert and Casella, 2005-=-). MCMC methods are theoretically guaranteed to converge to the true posterior, but come at great expense: they are notoriously slow to converge, especially with complex hidden structures such as synt... |

890 | An introduction to variational methods for graphical models. Machine learning 37
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ... in 3(a) to sample trees for an adapted non-terminal. 3 Variational Inference Variational inference is a deterministic alternative to MCMC, which casts posterior inference as an optimization problem (=-=Jordan et al., 1999-=-; Wainwright and Jordan, 2008). The optimized function is a bound on the marginal likelihood of the observations, which is expressed in terms of a so-called “variational distribution” over the hidden ... |

694 | Tree Adjoining Grammars
- Joshi, Schabes
- 1997
(Show Context)
Citation Context ... Johnson et al. (2006) use an embedded Metropolis-Hastings sampler (Robert and Casella, 2005) inside a Gibbs sampler. The proposal distribution is a PCFG, resembling a tree substitution grammar (TSG; =-=Joshi, 2003-=-). The sampler of Johnson et al. is based on the representation of the PY process as a distribution over partitions of integers. This representation is not amenable to variational inference. 2.2 Stick... |

483 | Graphical models, exponential families, and variational inference
- Wainwright, Jordan
- 2008
(Show Context)
Citation Context ...ees for an adapted non-terminal. 3 Variational Inference Variational inference is a deterministic alternative to MCMC, which casts posterior inference as an optimization problem (Jordan et al., 1999; =-=Wainwright and Jordan, 2008-=-). The optimized function is a bound on the marginal likelihood of the observations, which is expressed in terms of a so-called “variational distribution” over the hidden variables. When the bound is ... |

469 |
Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics
- Antoniak
- 1974
(Show Context)
Citation Context ...riffiths, 2007; Toutanova and Johnson, 2007; Johnson et al., 2007). Such methods have been made more flexible with nonparametric Bayesian (NP Bayes) methods, such as Dirichlet process mixture models (=-=Antoniak, 1974-=-; Pitman, 2002). One line of research uses NP Bayes methods on whole tree structures, in the form of adaptor grammars (Johnson et al., 2006; Johnson, 2008b; Johnson, 2008a; Johnson and Goldwater, 2009... |

359 |
A constructive definition of Dirichlet priors
- Sethuraman
- 1994
(Show Context)
Citation Context ...Dirichlet process mixtures by Blei and Jordan (2005) and applied to infinite grammars by Liang et al. (2007). With NP Bayes models, variational methods are based on the stick-breaking representation (=-=Sethuraman, 1994-=-). Devising a stick-breaking representation is a central challenge to using variational inference in this setting. The rest of this paper is organized as follows. In §2 we describe a stick-breaking re... |

239 | The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability
- Pitman, Yor
- 1997
(Show Context)
Citation Context ...definition of adaptor grammars (Johnson et al., 2006): (i) in our construction, we assume (as prior work does in practice) that the adaptors in A = 〈G, M, a, b, α〉 follow the Pitman-Yor (PY) process (=-=Pitman and Yor, 1997-=-), though in general other stochastic processes might be used; and (ii) we place a symmetric Dirichlet over the parameters of the PCFG, θ, whereas Johnson et al. used a fixed PCFG for the definition (... |

181 | Corpus-based induction of syntactic structure: Models of dependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...or grammars, which have so far been used in segmentation (Johnson and Goldwater, 2009) and named entity recognition (Elsner et al., 2009). The grammar we use is the dependency model with valence (DMV =-=Klein and Manning, 2004-=-) represented as a probabilistic context-free grammar, GDMV (Smith, 2006). We note that GDMV is recursive; this is not a problem (§3.1). We used part-of-speech sequences from the Wall Street Journal P... |

153 | Variational inference for Dirichlet process mixtures
- Blei, Jordan
- 2005
(Show Context)
Citation Context ..., the hidden variables are highly coupled.) To account for the infinite collection of random variables, for which we cannot define a variational distribution, we use the truncated stick distribution (=-=Blei and Jordan, 2005-=-). Hence, we assume that, for all A ∈ M, there is some value NA such that q(vA,NA = 1) = 1. The assigned probability to parse trees in the stick will be 0 for i > NA, so we can ignore zA,i for i > NA.... |

145 | Distributional regularity and phonotactic constraints are useful for segmentation
- Brent, Cartwright
- 1996
(Show Context)
Citation Context ...tal setting of Johnson and Goldwater (2009), who present state-of-the-art results for inference with adaptor grammars using Gibbs sampling on a segmentation problem. We use the standard Brent corpus (=-=Brent and Cartwright, 1996-=-), which includes 9,790 unsegmented phonemic representations of utterances of child-directed speech from the Bernstein-Ratner (1987) corpus. Johnson and Goldwater (2009) test three grammars for this s... |

133 | A fully Bayesian approach to unsupervised part-of-speech tagging
- Goldwater, Griffiths
- 2007
(Show Context)
Citation Context ...sults for a new application for adaptor grammars, dependency grammar induction. 1 Introduction Recent research in unsupervised learning for NLP focuses on Bayesian methods for probabilistic grammars (=-=Goldwater and Griffiths, 2007-=-; Toutanova and Johnson, 2007; Johnson et al., 2007). Such methods have been made more flexible with nonparametric Bayesian (NP Bayes) methods, such as Dirichlet process mixture models (Antoniak, 1974... |

99 | Parsing algorithms and metrics
- Goodman
- 1996
(Show Context)
Citation Context ...2006) and Wang and Blei (2009), but we did not achieve better performance and it had an adverse effect on runtime. For completeness, we give these results in §4. and minimum Bayes risk decoding (MBR; =-=Goodman, 1996-=-). To parse a string with Viterbi (or MBR) decoding, we find the tree with highest score for the grammaton which is attached to that string. For all rules which rewrite to strings in the resulting tre... |

78 | The infinite PCFG using hierarchical Dirichlet processes - Liang, Petrov, et al. - 2007 |

70 |
Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models
- Johnson, Griffiths, et al.
- 2006
(Show Context)
Citation Context ...sian (NP Bayes) methods, such as Dirichlet process mixture models (Antoniak, 1974; Pitman, 2002). One line of research uses NP Bayes methods on whole tree structures, in the form of adaptor grammars (=-=Johnson et al., 2006-=-; Johnson, 2008b; Johnson, 2008a; Johnson and Goldwater, 2009), in order to identify recurrent subtree patterns. Adaptor grammars provide a flexible distribution over parse trees that has more structu... |

51 | Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction
- Cohen, Smith
- 2009
(Show Context)
Citation Context ...or strings s (and trees for x) globally normalized weighted grammars. Choosing such distributions is motivated by their ability to make the variational bound tight (similar to Cohen et al., 2008, and =-=Cohen and Smith, 2009-=-). In practice we do not have to use rewrite rules for all strings t ⊆ s in the grammaton. It suffices to add rewrite rules only for the strings t = sA,i that have some grammaton attached to them, G(A... |

46 | Bayesian Inference for PCFGs via Markov Chain Monte Carlo
- Johnson, Griffiths, et al.
- 2007
(Show Context)
Citation Context ... grammar induction. 1 Introduction Recent research in unsupervised learning for NLP focuses on Bayesian methods for probabilistic grammars (Goldwater and Griffiths, 2007; Toutanova and Johnson, 2007; =-=Johnson et al., 2007-=-). Such methods have been made more flexible with nonparametric Bayesian (NP Bayes) methods, such as Dirichlet process mixture models (Antoniak, 1974; Pitman, 2002). One line of research uses NP Bayes... |

42 |
A Bayesian LDA-based model for semisupervised part-of-speech tagging
- Toutanova, Johnson
- 2008
(Show Context)
Citation Context ... adaptor grammars, dependency grammar induction. 1 Introduction Recent research in unsupervised learning for NLP focuses on Bayesian methods for probabilistic grammars (Goldwater and Griffiths, 2007; =-=Toutanova and Johnson, 2007-=-; Johnson et al., 2007). Such methods have been made more flexible with nonparametric Bayesian (NP Bayes) methods, such as Dirichlet process mixture models (Antoniak, 1974; Pitman, 2002). One line of ... |

39 | The phonology of parent-child speech - Bernstein-Ratner - 1987 |

39 | Improving unsupervised dependency parsing with richer contexts and smoothing - Johnson, McClosky |

32 | Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
- Smith
- 2006
(Show Context)
Citation Context ...9) and named entity recognition (Elsner et al., 2009). The grammar we use is the dependency model with valence (DMV Klein and Manning, 2004) represented as a probabilistic context-free grammar, GDMV (=-=Smith, 2006-=-). We note that GDMV is recursive; this is not a problem (§3.1). We used part-of-speech sequences from the Wall Street Journal Penn Treebank (Marcus et al., 1993), stripped of words and punctuation. W... |

29 | Accelerated variational Dirichlet process mixtures - Kurihara, Welling, et al. - 2007 |

23 | Logistic normal priors for unsupervised probabilistic grammar induction
- Cohen, Gimpel, et al.
- 2008
(Show Context)
Citation Context ...butions over the trees for strings s (and trees for x) globally normalized weighted grammars. Choosing such distributions is motivated by their ability to make the variational bound tight (similar to =-=Cohen et al., 2008-=-, and Cohen and Smith, 2009). In practice we do not have to use rewrite rules for all strings t ⊆ s in the grammaton. It suffices to add rewrite rules only for the strings t = sA,i that have some gram... |

22 | Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling - Mochihashi, Yamada, et al. - 2009 |

11 | Structured generative models for unsupervised named-entity clustering. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics. Association for Computational
- Elsner, Charniak, et al.
- 2009
(Show Context)
Citation Context ...minary results for unsupervised syntax learning. This is a new application of adaptor grammars, which have so far been used in segmentation (Johnson and Goldwater, 2009) and named entity recognition (=-=Elsner et al., 2009-=-). The grammar we use is the dependency model with valence (DMV Klein and Manning, 2004) represented as a probabilistic context-free grammar, GDMV (Smith, 2006). We note that GDMV is recursive; this i... |

5 | Unsupervised word segmentation for Sesotho using Adaptor Grammars - 2008a |

5 | Using Adaptor Grammars to identify synergies in the unsupervised acquisition of linguistic structure - 2008b |

1 | algorithms for topic models - Distributed |