## Covariance in Unsupervised Learning of Probabilistic Grammars

Citations: | 12 - 5 self |

### BibTeX

@MISC{Cohen_covariancein,

author = {Shay B. Cohen and Noah A. Smith and Alex Clark and Dorota Glowacka and Colin De La Higuera and Mark Johnson and John Shawe-taylor},

title = {Covariance in Unsupervised Learning of Probabilistic Grammars},

year = {}

}

### OpenURL

### Abstract

Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data.

### Citations

2419 | Latent dirichlet allocation - Blei, Ng, et al. |

2121 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Santorini, et al.
- 1993
(Show Context)
Citation Context ...ts in a column. Training is done on sentences of length ≤ 10, though testing is done on longer sentences as well. 5.1 English Text We begin our experiments with the Wall Street Journal Penn treebank (=-=Marcus et al., 1993-=-). Following standard practice, sentences were stripped of words and punctuation, leaving part-of-speech tags for the unsupervised induction of dependency structure. We note that, in this setting, usi... |

1759 | MapReduce: Simplified data processing on large clusters
- Dean, Ghemawat
- 2008
(Show Context)
Citation Context ...rammar, providing a fast, parallelizable, 8 and deterministic alternative to MCMC methods to approximate the posterior over derivations and grammar parameters. 8. We used a cluster running MapReduce (=-=Dean and Ghemawat, 2004-=-) to perform inference when training our models. 3040COVARIANCE IN UNSUPERVISED LEARNING OF PROBABILISTIC GRAMMARS We experimented with grammar induction on six different languages, demonstrating the... |

958 | Head-Driven Statistical Models for Natural Language Parsing - Collins - 1999 |

706 | Class-Based n-Gram Models of Natural Language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...entence (see Section 2.2). 2.1 Simple Example: Class-Based Unigram Model It is helpful to keep in mind a simple model with a relatively small number of parameters such as a class-based unigram model (=-=Brown et al., 1990-=-). Let the observed symbols in x range over words in some language’s vocabulary Γ. Let each word token xi have an associated word class from a finite set Λ, denoted yi; the yi are all hidden. The deri... |

693 | Accurate unlexicalized parsing
- Klein, Manning
- 2003
(Show Context)
Citation Context ...from annotated data, thus allowing easy comparison between unsupervised and supervised techniques. NLP applications of probabilistic grammars and their generalizations include parsing (Collins, 2003; =-=Klein and Manning, 2003-=-; Charniak and Johnson, 2005), machine translation (Wu, 1997; Ding and Palmer, 2005; Chiang, 2005), and question answering (Wang et al., 2007). Probabilistic grammars are probabilistic models, so they... |

511 |
Learning regular sets from queries and counterexamples
- Angluin
- 1987
(Show Context)
Citation Context ...he probability of the ith rule for the kth nonterminal. The field of grammatical inference also includes algorithms and methods for learning the structure of a (formal) language generator or grammar (=-=Angluin, 1988-=-; de la Higuera, 2005; Clark and Thollard, 2004; Clark et al., 2008, inter alia). This paper is complementary, focusing on the estimation of the weights assigned to the grammar’s rules. The choice of ... |

444 | Graphical models, exponential families, and variational inference - Wainwright, Jordan - 2008 |

431 | Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora
- Wu
- 1997
(Show Context)
Citation Context ... supervised techniques. NLP applications of probabilistic grammars and their generalizations include parsing (Collins, 2003; Klein and Manning, 2003; Charniak and Johnson, 2005), machine translation (=-=Wu, 1997-=-; Ding and Palmer, 2005; Chiang, 2005), and question answering (Wang et al., 2007). Probabilistic grammars are probabilistic models, so they permit the use of well-understood methods for learning. Mea... |

398 | Dynamic topic models
- Blei, Lafferty
- 2006
(Show Context)
Citation Context ...er and interpretability. We begin by motivating the modeling of covariance among the probabilities of grammar derivation events, and propose the use of logistic normal distributions (Aitchison, 1986; =-=Blei and Lafferty, 2006-=-) over multinomials to build priors over grammars (Section 3). Our motivation relies on the observation that various grammar parameters are expected to be correlated because of the elements in languag... |

390 | Coarse-tofine n-best parsing and MaxEnt discriminative reranking
- Charniak, Johnson
- 2005
(Show Context)
Citation Context ... allowing easy comparison between unsupervised and supervised techniques. NLP applications of probabilistic grammars and their generalizations include parsing (Collins, 2003; Klein and Manning, 2003; =-=Charniak and Johnson, 2005-=-), machine translation (Wu, 1997; Ding and Palmer, 2005; Chiang, 2005), and question answering (Wang et al., 2007). Probabilistic grammars are probabilistic models, so they permit the use of well-unde... |

390 |
Maximum likelihood estimation from incomplete data
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...TH den derivations from the grammar. Baker (1979) and Lari and Young (1990) describe how dynamic programming (the “inside-outside” algorithm) can be used within an Expectation-Maximization algorithm (=-=Dempster et al., 1977-=-) to estimate the grammar’s probabilities from a corpus of text, in the context-free case. Probabilistic grammars are attractive for several reasons. Like symbolic grammars, they are amenable to inspe... |

375 |
The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ... and Manning learned the DMV using maximum likelihood estimation, carried out by the Expectation-Maximization (EM) algorithm. Because EM for probabilistic grammars has been well documented elsewhere (=-=Lari and Young, 1990-=-; Pereira and Schabes, 1992; Carroll and Charniak, 1992), we only briefly mention that it proceeds by alternating between two steps that update the model parameters. Let θ (t) denote their values at t... |

366 | A hierarchical phrase-based model for statistical machine translation
- Chiang
- 2005
(Show Context)
Citation Context ...cations of probabilistic grammars and their generalizations include parsing (Collins, 2003; Klein and Manning, 2003; Charniak and Johnson, 2005), machine translation (Wu, 1997; Ding and Palmer, 2005; =-=Chiang, 2005-=-), and question answering (Wang et al., 2007). Probabilistic grammars are probabilistic models, so they permit the use of well-understood methods for learning. Meanwhile, in machine learning, signific... |

273 | Inside-outside reestimation from partially bracketed corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ...n it comes to the problem of grammar induction from natural language data, a fruitful research direction has built on the view of a grammar as a parameterized, generative process explaining the data (=-=Pereira and Schabes, 1992-=-; Carroll and Charniak, 1992; Chen, 1995; Klein and Manning, 2002, 2004, inter alia). If the grammar is a probability model, then learning a grammar means selecting a model from a prespecified model f... |

269 | Trainable grammars for speech recognition - Baker - 1979 |

252 | Conll-x shared task on multilingual dependency parsing
- Buchholz, Marsi
- 2006
(Show Context)
Citation Context ...ank (Xue et al., 2004). We train on §1–270, use §301–1151 for development and test on §271–300. • For Portuguese, we used the Bosque treebank (Afonso et al., 2002) from the CoNLL shared task in 2006 (=-=Buchholz and Marsi, 2006-=-). • For Turkish, we used the METU-Sabancı treebank (Atalay et al., 2003; Oflazer et al., 2003) from the CoNLL shared task in 2006. • For Czech, we used the Prague treebank (Hajič et al., 2000) from t... |

205 |
The Statistical Analysis of Compositional Data
- Aitchison
- 1986
(Show Context)
Citation Context ...ng expressive power and interpretability. We begin by motivating the modeling of covariance among the probabilities of grammar derivation events, and propose the use of logistic normal distributions (=-=Aitchison, 1986-=-; Blei and Lafferty, 2006) over multinomials to build priors over grammars (Section 3). Our motivation relies on the observation that various grammar parameters are expected to be correlated because o... |

191 |
Elements de syntaxe structurale
- Tesniere
- 1959
(Show Context)
Citation Context ...tion.) In this case, the derivation vector y corresponds to a set of topics selected for each word in the bag of words representing the document. 2.2 Dependency Model with Valence Dependency grammar (=-=Tesnière, 1959-=-) refers to linguistic theories that posit graphical representations of sentences in which words are vertices and the syntax is a directed tree. Such grammars can be context-free or context-sensitive ... |

171 | Corpus-based induction of syntactic structure: Models of dependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...wering, and other natural language processing applications. Our experiments perform unsupervised induction of probabilistic dependency grammars using a model known as “dependency model with valence” (=-=Klein and Manning, 2004-=-). The model is a probabilistic split head automaton grammar 3020COVARIANCE IN UNSUPERVISED LEARNING OF PROBABILISTIC GRAMMARS y= x=〈$ DT NN IN DT NN VBD IN RBR IN CD NN〉 p(x,y|θ)=θc(VBD|$,r)× p(y (1... |

147 |
The CoNLL 2007 shared task on dependency parsing
- Nivre, Hall, et al.
- 2007
(Show Context)
Citation Context ...METU-Sabancı treebank (Atalay et al., 2003; Oflazer et al., 2003) from the CoNLL shared task in 2006. • For Czech, we used the Prague treebank (Hajič et al., 2000) from the CoNLL shared task in 2007 (=-=Nivre et al., 2007-=-). • For Japanese, we used the VERBMOBIL Treebank for Japanese (Kawata and Bartels, 2000) from the CoNLL shared task in 2006. Whenever using CoNLL shared task data, we used the first 80% of the data d... |

146 | Products of experts
- Hinton
- 1999
(Show Context)
Citation Context ...e relationships among θk, j and permit learning of the one to affect the learning of the other. Definition 1 also implies that we multiply several multinomials together in a product-of-experts style (=-=Hinton, 1999-=-), because the exponential of an average of normals becomes a product of (unnormalized) probabilities. We note that the partition structure is a hyperparameter. In our experiments, it encodes domain k... |

117 | A fully Bayesian approach to unsupervised partof-speech tagging
- Goldwater, Griffiths
- 2007
(Show Context)
Citation Context ...: Draw (xm, ym) from p(xm, ym | θ, G). Figure 2: Two variations on Bayesian modeling of probabilistic grammars. literature has focused on Bayesian models with a Dirichlet prior (Johnson et al., 2007; =-=Goldwater and Griffiths, 2007-=-; Toutanova and Johnson, 2007; Kurihara and Sato, 2006, inter alia), which is conjugate to the multinomial family. We argue that the last two requirements are actually more important than the first on... |

113 |
Dependency systems and phrase-structure systems
- Gaifman
- 1965
(Show Context)
Citation Context ...phical representations of sentences in which words are vertices and the syntax is a directed tree. Such grammars can be context-free or context-sensitive in power, and they can be made probabilistic (=-=Gaifman, 1965-=-). Dependency syntax is used in information extraction, machine translation, question answering, and other natural language processing applications. Our experiments perform unsupervised induction of p... |

111 | Two languages are more informative than one
- Dagan, Itai, et al.
- 1991
(Show Context)
Citation Context ...D LEARNING OF PROBABILISTIC GRAMMARS 5.4 Bilingual Experiments Leveraging linguistic information from one language for the task of disambiguating another language has received considerable attention (=-=Dagan, 1991-=-; Yarowsky et al., 2001; Hwa et al., 2005; Smith and Smith, 2004; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ti... |

104 | The Penn Chinese TreeBank: Phrase structure annotation of a large corpus - Xue, Xia, et al. - 2005 |

98 | Two experiments on learning probabilistic dependency grammars from corpora
- Carroll, Charniak
- 1992
(Show Context)
Citation Context ...f grammar induction from natural language data, a fruitful research direction has built on the view of a grammar as a parameterized, generative process explaining the data (Pereira and Schabes, 1992; =-=Carroll and Charniak, 1992-=-; Chen, 1995; Klein and Manning, 2002, 2004, inter alia). If the grammar is a probability model, then learning a grammar means selecting a model from a prespecified model family. In much prior work, t... |

90 | A generative constituent-context model for improved grammar induction
- Klein, Manning
- 2002
(Show Context)
Citation Context ...e data, a fruitful research direction has built on the view of a grammar as a parameterized, generative process explaining the data (Pereira and Schabes, 1992; Carroll and Charniak, 1992; Chen, 1995; =-=Klein and Manning, 2002-=-, 2004, inter alia). If the grammar is a probability model, then learning a grammar means selecting a model from a prespecified model family. In c○2010 Shay B. Cohen and Noah A. Smith.Cohen and Smith... |

88 | Bootstrapping parsers via syntactic projection across parallel texts
- Hwa, Resnik, et al.
- 2005
(Show Context)
Citation Context ... 5.4 Bilingual Experiments Leveraging linguistic information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Yarowsky et al., 2001; =-=Hwa et al., 2005-=-; Smith and Smith, 2004; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. One notab... |

78 | Pac-bayesian model averaging - McAllester - 1999 |

65 | The infinite PCFG using hierarchical Dirichlet processes
- Liang, Petrov, et al.
- 2007
(Show Context)
Citation Context ...lied to probabilistic grammars in various ways: specifying priors over the parameters of a PCFG (Johnson et al., 2007; Headden et al., 2009) as well as over the states in a PCFG (Finkel et al., 2007; =-=Liang et al., 2007-=-), and even over grammatical derivation structures larger than context-free production rules (Johnson et al., 2006; Cohn et al., 2009). The challenge in Bayesian grammar learning is efficiently approx... |

64 | Floresta sintá(c)tica: A treebank for Portuguese
- Afonso, Bick, et al.
- 2002
(Show Context)
Citation Context ...ch and Japanese. • For Chinese, we used the Chinese treebank (Xue et al., 2004). We train on §1–270, use §301–1151 for development and test on §271–300. • For Portuguese, we used the Bosque treebank (=-=Afonso et al., 2002-=-) from the CoNLL shared task in 2006 (Buchholz and Marsi, 2006). • For Turkish, we used the METU-Sabancı treebank (Atalay et al., 2003; Oflazer et al., 2003) from the CoNLL shared task in 2006. • For ... |

62 | Machine translation using probabilistic synchronous dependency insertion grammars
- Ding, Palmer
- 2005
(Show Context)
Citation Context ...d techniques. NLP applications of probabilistic grammars and their generalizations include parsing (Collins, 2003; Klein and Manning, 2003; Charniak and Johnson, 2005), machine translation (Wu, 1997; =-=Ding and Palmer, 2005-=-; Chiang, 2005), and question answering (Wang et al., 2007). Probabilistic grammars are probabilistic models, so they permit the use of well-understood methods for learning. Meanwhile, in machine lear... |

57 | Learning bilingual lexicons from monolingual corpora - Haghighi, Liang, et al. - 2008 |

57 |
Adaptor grammars: A framework for specifying compositional nonparametric bayesian models
- Johnson, Griffiths, et al.
- 2007
(Show Context)
Citation Context ... 2007; Headden et al., 2009) as well as over the states in a PCFG (Finkel et al., 2007; Liang et al., 2007), and even over grammatical derivation structures larger than context-free production rules (=-=Johnson et al., 2006-=-; Cohn et al., 2009). The challenge in Bayesian grammar learning is efficiently approximating probabilistic inference. Variational approximations (Johnson, 2007; Kurihara and Sato, 2006) and randomize... |

55 | Painless unsupervised learning with features
- Berg-Kirkpatrick, Bouchard-Cote, et al.
- 2010
(Show Context)
Citation Context ...a base model within various estimation techniques with the goal of improving its performance by relying on other properties of language and text such as: dependencies between parameters in the model (=-=Berg-Kirkpatrick et al., 2010-=-), sparsity (Gillenwater et al., 2010), preference for short attachments (Smith and Eisner, 2006), and additional annotation offered by hypertext markup as found on the Web (Spitkovsky et al., 2010c).... |

53 | Bayesian grammar induction for language modeling
- Chen
- 1995
(Show Context)
Citation Context ...ural language data, a fruitful research direction has built on the view of a grammar as a parameterized, generative process explaining the data (Pereira and Schabes, 1992; Carroll and Charniak, 1992; =-=Chen, 1995-=-; Klein and Manning, 2002, 2004, inter alia). If the grammar is a probability model, then learning a grammar means selecting a model from a prespecified model family. In much prior work, the family is... |

49 | Bilexical grammars and a cubic-time probabilistic parser
- Eisner
- 1997
(Show Context)
Citation Context ... marked in blue (lighter). l, r, t, and f denote left, right, true, and false, respectively (see Equation 4). (Alshawi and Buchsbaum, 1996) that renders inference cubic in the length of the sentence (=-=Eisner, 1997-=-). The language of the grammar is context-free, though our models are permissive and allow the derivation of any string in Γ ∗ . This is a major point of departure between theoretical work in grammati... |

49 | Bilingual parsing with factored estimation: Using English to parse Korean
- Smith, Smith
- 2004
(Show Context)
Citation Context ...eriments Leveraging linguistic information from one language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Yarowsky et al., 2001; Hwa et al., 2005; =-=Smith and Smith, 2004-=-; Snyder and Barzilay, 2008; Burkett and Klein, 2008). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. One notable exception is Haghigh... |

48 | Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction - Cohen, Smith - 2009 |

46 | Syntactic topic models - Graber, Blei - 2009 |

45 |
The Prague dependency treebank: A three-level annotation scenario. Treebanks: Building and Using Parsed Corpora
- Böhmová, Hajičová, et al.
- 2000
(Show Context)
Citation Context ... (Buchholz and Marsi, 2006). • For Turkish, we used the METU-Sabancı treebank (Atalay et al., 2003; Oflazer et al., 2003) from the CoNLL shared task in 2006. • For Czech, we used the Prague treebank (=-=Hajič et al., 2000-=-) from the CoNLL shared task in 2007 (Nivre et al., 2007). • For Japanese, we used the VERBMOBIL Treebank for Japanese (Kawata and Bartels, 2000) from the CoNLL shared task in 2006. Whenever using CoN... |

43 | Head automata and bilingual tiling: Translation with minimal representations
- Alshawi, Buchsbaum
- 1996
(Show Context)
Citation Context ...tions. We break the probability of the tree down into recursive parts, one per head word, marked in blue (lighter). l, r, t, and f denote left, right, true, and false, respectively (see Equation 4). (=-=Alshawi and Buchsbaum, 1996-=-) that renders inference cubic in the length of the sentence (Eisner, 1997). The language of the grammar is context-free, though our models are permissive and allow the derivation of any string in Γ ∗... |

43 | Two languages are better than one (for syntactic parsing
- Burkett, Klein
- 2008
(Show Context)
Citation Context ...e language for the task of disambiguating another language has received considerable attention (Dagan, 1991; Yarowsky et al., 2001; Hwa et al., 2005; Smith and Smith, 2004; Snyder and Barzilay, 2008; =-=Burkett and Klein, 2008-=-). Usually such a setting requires a parallel corpus or other annotated data that ties between those two languages. One notable exception is Haghighi et al. (2008), where bilingual lexicons were learn... |

42 | Bayesian inference for PCFGs via Markov chain Monte Carlo
- Johnson, Griffiths, et al.
- 2007
(Show Context)
Citation Context ...f the size of sample required for estimation of probabilistic grammars). Bayesian methods have been applied to probabilistic grammars in various ways: specifying priors over the parameters of a PCFG (=-=Johnson et al., 2007-=-; Headden et al., 2009) as well as over the states in a PCFG (Finkel et al., 2007; Liang et al., 2007), and even over grammatical derivation structures larger than context-free production rules (Johns... |

39 |
A bibliographical study of grammatical inference
- Higuera, C
(Show Context)
Citation Context ... ith rule for the kth nonterminal. The field of grammatical inference also includes algorithms and methods for learning the structure of a (formal) language generator or grammar (Angluin, 1988; de la =-=Higuera, 2005-=-; Clark and Thollard, 2004; Clark et al., 2008, inter alia). This paper is complementary, focusing on the estimation of the weights assigned to the grammar’s rules. The choice of using a fixed model f... |

37 | Improving unsupervised dependency parsing with richer contexts and smoothing
- Headden, Johnson, et al.
- 2009
(Show Context)
Citation Context ...equired for estimation of probabilistic grammars). Bayesian methods have been applied to probabilistic grammars in various ways: specifying priors over the parameters of a PCFG (Johnson et al., 2007; =-=Headden et al., 2009-=-) as well as over the states in a PCFG (Finkel et al., 2007; Liang et al., 2007), and even over grammatical derivation structures larger than context-free production rules (Johnson et al., 2006; Cohn ... |

37 | Unsupervised multilingual learning for morphological segmentation - Snyder, Barzilay - 2008 |

36 | The annotation process in the Turkish treebank
- Atalay, Oflazer, et al.
- 2003
(Show Context)
Citation Context ... test on §271–300. • For Portuguese, we used the Bosque treebank (Afonso et al., 2002) from the CoNLL shared task in 2006 (Buchholz and Marsi, 2006). • For Turkish, we used the METU-Sabancı treebank (=-=Atalay et al., 2003-=-; Oflazer et al., 2003) from the CoNLL shared task in 2006. • For Czech, we used the Prague treebank (Hajič et al., 2000) from the CoNLL shared task in 2007 (Nivre et al., 2007). • For Japanese, we us... |

36 |
A Bayesian LDA-based model for semisupervised part-of-speech tagging
- K, Johnson
- 2007
(Show Context)
Citation Context ... of analytical tractability. As a result, most of the Bayesian language learning literature has focused on Bayesian models with a Dirichlet prior (Johnson et al., 2007; Goldwater and Griffiths, 2007; =-=Toutanova and Johnson, 2007-=-; Kurihara and Sato, 2006, inter alia), which is conjugate to the multinomial family. We argue that the last two requirements are actually more important than the first one, which is motivated by mere... |