## Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction (2009)

### Cached

### Download Links

Venue: | In Proceedings of NAACL-HLT 2009. Shay |

Citations: | 50 - 8 self |

### BibTeX

@INPROCEEDINGS{Cohen09sharedlogistic,

author = {Shay B. Cohen and Noah A. Smith},

title = {Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction},

booktitle = {In Proceedings of NAACL-HLT 2009. Shay},

year = {2009}

}

### OpenURL

### Abstract

We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM algorithm for learning a probabilistic grammar based on this family of priors. We then experiment with unsupervised dependency grammar induction and show significant improvements using our model for both monolingual learning and bilingual learning with a non-parallel, multilingual corpus. 1

### Citations

2577 | Latent Dirichlet allocation
- Blei, Ng, et al.
- 2003
(Show Context)
Citation Context ... 7, ℓ4 = 2, N1 = 2, N2 = 4, and N3 = 3. This figure is best viewed in color. Blei and Lafferty (2006) defined correlated topic models by replacing the Dirichlet in latent Dirichlet allocation models (=-=Blei et al., 2003-=-) with a LN distribution. Cohen et al. (2008) compared Dirichlet and LN distributions for learning DMV using empirical Bayes, finding substantial improvements for English using the latter. In that wor... |

856 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...re learned. 2 Our inference algorithm aims to find the posterior over the grammar probabilities θ and the hidden structures (grammar trees y). To do that, we use variational approximation techniques (=-=Jordan et al., 1999-=-), which treat the problem of finding the posterior as an optimization problem aimed to find the best approximation q(θ, y) of the posterior p(θ, y | x, µ, Σ, S). The posterior q needs to be constrain... |

456 | Stochastic inversion transduction grammars and bilingual parsing of parallel corpora
- Wu
- 1997
(Show Context)
Citation Context ...l language processing. They are most commonly used for parsing and linguistic analysis (Charniak and Johnson, 2005; Collins, 2003), but are now commonly seen in applications like machine translation (=-=Wu, 1997-=-) and question answering (Wang et al., 2007). An attractive property of probabilistic grammars is that they permit the use of well-understood parameter estimation methods for learning—both from labele... |

423 | Dynamic topic models - Blei, Lafferty - 2006 |

403 | Coarse-to-fine nbest parsing and maxent discriminative reranking
- Charniak, Johnson
- 2005
(Show Context)
Citation Context ...non-parallel, multilingual corpus. 1 Introduction Probabilistic grammars have become an important tool in natural language processing. They are most commonly used for parsing and linguistic analysis (=-=Charniak and Johnson, 2005-=-; Collins, 2003), but are now commonly seen in applications like machine translation (Wu, 1997) and question answering (Wang et al., 2007). An attractive property of probabilistic grammars is that the... |

218 |
The Statistical Analysis of Compositional Data
- Aitchison
- 1986
(Show Context)
Citation Context ...ere Γ(α, 1) is a Gamma distribution with shape α and scale 1. Correlation among θi and θj, i ̸= j, cannot be modeled directly, only through the normalization in step 2. In contrast, LN distributions (=-=Aitchison, 1986-=-) provide a natural way to model such correlation. The LN draws a multinomial θ as follows: 1. Generate η ∼ Normal(µ, Σ). 2. θj ← exp(ηj)/ ∑ i exp(ηi).I1 = {1:2, 3:6, 7:9} = { I1,1, I1,2, I1,L1 } I2 ... |

177 | Corpus-based induction of syntactic structure: models of sependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...escribed above. The attachment accuracy for this set of experiments is described in Table 1. The baselines include right attachment (where each word is attached to the word to its right), MLE via EM (=-=Klein and Manning, 2004-=-), and empirical Bayes with Dirichlet and LN priors (Cohen et al., 2008). We also include a “ceiling” (DMV trained using supervised MLE from the training sentences’ trees). For English, we see that ty... |

150 | Products of experts
- Hinton
- 1999
(Show Context)
Citation Context ...relationships among θk,j and permit the model—at least in principle— to learn patterns from the data. Def. 1 also implies that we multiply several multinomials together in a product-of-experts style (=-=Hinton, 1999-=-), because the exponential of a mixture of normals becomes a product of (unnormalized) probabilities. Our extension to the model in Cohen et al. (2008) follows naturally after we have defined the shar... |

112 | Two languages are more informative than one
- Dagan, Itai, et al.
- 1991
(Show Context)
Citation Context ...ificantly worse (binomial sign test, p < 0.05). 4.3 Bilingual Experiments Leveraging information from one language for the task of disambiguating another language has received considerable attention (=-=Dagan, 1991-=-; Smith and Smith, 2004; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. 5 Our bil... |

97 | Parsing algorithms and metrics
- Goodman
- 1996
(Show Context)
Citation Context ...grammar: the most probable “Viterbi” parse (argmax y p(y | x, θ)) and the minimum Bayes risk (MBR) parse (argmin y E p(y ′ |x,θ)[ℓ(y; x, y ′ )]) with dependency attachment error as the loss function (=-=Goodman, 1996-=-). Performance with MBR parsing is consistently higher than its Viterbi counterpart, so we report only performance with MBR parsing. 4.1 Nouns, Verbs, and Adjectives In this paper, we use a few simple... |

88 | A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes
- Teh
- 2006
(Show Context)
Citation Context ...parallel monolingual corpora. 5 Future Work In future work we plan to lexicalize the model, including a Bayesian grammar prior that accounts for the syntactic patterns of words. Nonparametric models (=-=Teh, 2006-=-) may be appropriate. We also believe that Bayesian discovery of cross-linguistic patterns is an exciting topic worthy of further exploration. 6 Conclusion We described a Bayesian model that allows so... |

72 | The Infinite PCFG using Hierarchical Dirichlet Processes
- Liang, Petrov, et al.
- 2007
(Show Context)
Citation Context ...ting priors over grammar probabilities (Johnson et al., 2007) to putting non-parametric priors over derivations (Johnson et al., 2006) to learning the set of states in a grammar (Finkel et al., 2007; =-=Liang et al., 2007-=-). Bayesian methods offer an elegant framework for combining prior knowledge with data. The main challenge in Bayesian grammar learning is efficiently approximating probabilistic inference, which is g... |

60 | Learning bilingual lexicons from monolingual corpora - Haghighi, Liang, et al. - 2008 |

58 |
Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models
- Johnson, Griffiths, et al.
- 2007
(Show Context)
Citation Context ...ng Bayesian modeling for probabilistic grammars in different settings, ranging from putting priors over grammar probabilities (Johnson et al., 2007) to putting non-parametric priors over derivations (=-=Johnson et al., 2006-=-) to learning the set of states in a grammar (Finkel et al., 2007; Liang et al., 2007). Bayesian methods offer an elegant framework for combining prior knowledge with data. The main challenge in Bayes... |

51 | Bilingual parsing with factored estimation: Using english to parse korean
- Smith, Smith
- 2004
(Show Context)
Citation Context ...se (binomial sign test, p < 0.05). 4.3 Bilingual Experiments Leveraging information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; =-=Smith and Smith, 2004-=-; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. 5 Our bilingual experiments use ... |

44 | Two languages are better than one (for syntactic parsing
- Burkett, Klein
- 2008
(Show Context)
Citation Context ...xperiments Leveraging information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Smith and Smith, 2004; Snyder and Barzilay, 2008; =-=Burkett and Klein, 2008-=-). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. 5 Our bilingual experiments use the English and Chinese treebanks, which are not par... |

44 | 2007b. Bayesian inference for PCFGs via Markov chain Monte Carlo
- Johnson, Griffiths, et al.
(Show Context)
Citation Context ...amily of priors. There has been an increased interest recently in employing Bayesian modeling for probabilistic grammars in different settings, ranging from putting priors over grammar probabilities (=-=Johnson et al., 2007-=-) to putting non-parametric priors over derivations (Johnson et al., 2006) to learning the set of states in a grammar (Finkel et al., 2007; Liang et al., 2007). Bayesian methods offer an elegant frame... |

42 | Unsupervised multilingual learning for morphological segmentation
- Snyder, Barzilay
- 2008
(Show Context)
Citation Context ... p < 0.05). 4.3 Bilingual Experiments Leveraging information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Smith and Smith, 2004; =-=Snyder and Barzilay, 2008-=-; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. 5 Our bilingual experiments use the English and Chinese tre... |

38 | Improving unsupervised dependency parsing with richer contexts and smoothing - Headden, Johnson, et al. - 2009 |

32 | What is the Jeopardy model? A quasi-synchronous grammar for QA
- Wang, Smith, et al.
- 2007
(Show Context)
Citation Context ...st commonly used for parsing and linguistic analysis (Charniak and Johnson, 2005; Collins, 2003), but are now commonly seen in applications like machine translation (Wu, 1997) and question answering (=-=Wang et al., 2007-=-). An attractive property of probabilistic grammars is that they permit the use of well-understood parameter estimation methods for learning—both from labeled and unlabeled data. Here we tackle the un... |

30 | Novel estimation methods for unsupervised discovery of latent structures in natural language text
- Smith
- 2006
(Show Context)
Citation Context ... in this paper, because it does not fit directly in a Bayesian setting (it is highly deficient) and because state-of-the-art unsupervised dependency parsing results have been achieved with DMV alone (=-=Smith, 2006-=-). Using the notation above, DMV defines x = 〈x1, x2, ..., xn〉 to be a sentence. x0 is a special “wall” symbol, $, on the left of every sentence. A tree y is defined by a pair of functions yleft and y... |

29 | The Infinite Tree
- Finkel, Grenager, et al.
- 2007
(Show Context)
Citation Context ...ngs, ranging from putting priors over grammar probabilities (Johnson et al., 2007) to putting non-parametric priors over derivations (Johnson et al., 2006) to learning the set of states in a grammar (=-=Finkel et al., 2007-=-; Liang et al., 2007). Bayesian methods offer an elegant framework for combining prior knowledge with data. The main challenge in Bayesian grammar learning is efficiently approximating probabilistic i... |

25 | Variational Bayesian grammar induction for natural language
- Kurihara, Sato
- 2006
(Show Context)
Citation Context ...or knowledge with data. The main challenge in Bayesian grammar learning is efficiently approximating probabilistic inference, which is generally intractable. Most commonly variational (Johnson, 2007; =-=Kurihara and Sato, 2006-=-) or sampling techniques are applied (Johnson et al., 2006). Because probabilistic grammars are built out of multinomial distributions, the Dirichlet family (or, more precisely, a collection of Dirich... |

22 | Logistic normal priors for unsupervised probabilistic grammar induction
- Cohen, Gimpel, et al.
- 2008
(Show Context)
Citation Context ...en and Smith (2009). tics rely mainly on the centrality of content words: nouns, verbs, and adjectives. For example, in the English treebank, the most common attachment errors (with the LN prior from =-=Cohen et al., 2008-=-) happen with a noun (25.9%) or a verb (16.9%) parent. In the Chinese treebank, the most common attachment errors happen with noun (36.0%) and verb (21.2%) parents as well. The errors being governed b... |

7 | Transformational priors over grammars - Eisner - 2002 |

1 | Inference for probabilistic grammars with shared logistic normal distributions - Cohen, Smith - 2009 |