## Annealing structural bias in multilingual weighted grammar induction (2006)

### Cached

### Download Links

Venue: | In Proc. ACL |

Citations: | 30 - 9 self |

### BibTeX

@INPROCEEDINGS{Smith06annealingstructural,

author = {Noah A. Smith and Jason Eisner},

title = {Annealing structural bias in multilingual weighted grammar induction},

booktitle = {In Proc. ACL},

year = {2006},

pages = {569--576}

}

### Years of Citing Articles

### OpenURL

### Abstract

We first show how a structural locality bias can improve the accuracy of state-of-the-art dependency grammar induction models trained by EM from unannotated examples (Klein and Manning, 2004). Next, by annealing the free parameter that controls this bias, we achieve further improvements. We then describe an alternative kind of structural bias, toward “broken ” hypotheses consisting of partial structures over segmented sentences, and show a similar pattern of improvement. We relate this approach to contrastive estimation (Smith and Eisner, 2005a), apply the latter to grammar induction in six languages, and show that our new approach improves accuracy by 1–17 % (absolute) over CE (and 8–30% over EM), achieving to our knowledge the best results on this task to date. Our method, structural annealing, is a general technique with broad applicability to hidden-structure discovery problems. 1

### Citations

2105 | Building a large annotated corpus of English: the Penn Treebank - Marcus, Santorini - 1994 |

1173 | The mathematics of statistical machine translation: Parameter estimation - Brown, Pietra, et al. - 1994 |

561 |
Statistical language learning
- Charniak
- 1996
(Show Context)
Citation Context ... not necessarily assign a linguistically defensible syntactic structure. Second, the likelihood surface is not globally concave, and learners such as the EM algorithm can get trapped on local maxima (=-=Charniak, 1993-=-). We seek here to capitalize on the intuition that, at least early in learning, the learner should search primarily for string-local structure, because most structure is local. 1 By penalizing depend... |

503 | Three generative, lexicalised models for statistical parsing - Collins - 1997 |

490 | Unsupervised word sense disambiguation rivaling supervised methods
- Yarowsky
- 1995
(Show Context)
Citation Context ...tion. Boldface marks scores better than EM-trained models selected the same way (Table 1). The score is the F1 measure on non-$ attachments. Annealing β resembles the popular bootstrapping technique (=-=Yarowsky, 1995-=-), which starts out aiming for high precision, and gradually improves coverage over time. With strong bias (β ≫ 0), we seek a model that maintains high dependency precision on (non-$) attachments by a... |

384 |
Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...blems. 1 Introduction Inducing a weighted context-free grammar from flat text is a hard problem. A common starting point for weighted grammar induction is the Expectation-Maximization (EM) algorithm (=-=Dempster et al., 1977-=-; Baker, 1979). EM’s mediocre performance (Table 1) reflects two problems. First, it seeks to maximize likelihood, but a grammar that makes the training data likely does not necessarily assign a lingu... |

268 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ...nducing a weighted context-free grammar from flat text is a hard problem. A common starting point for weighted grammar induction is the Expectation-Maximization (EM) algorithm (Dempster et al., 1977; =-=Baker, 1979-=-). EM’s mediocre performance (Table 1) reflects two problems. First, it seeks to maximize likelihood, but a grammar that makes the training data likely does not necessarily assign a linguistically def... |

249 | CoNLL-X shared task on multilingual dependency parsing - Wiley-Blackwell, Marsi - 2006 |

247 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...milarity to deterministic annealing (DA), a technique used in clustering and classification to smooth out objective functions that are piecewise constant (hence discontinuous) or bumpy (non-concave) (=-=Rose, 1998-=-; Ueda and Nakano, 1998). In unsupervised learning, DA iteratively re-estimates parameters like EM, but begins by requiring that the entropy of the posterior pΘ(y | x) be maximal, then gradually relax... |

243 | The TIGER treebank - Brants, Dipper, et al. - 2002 |

240 | Noun classification from predicate-argument structures - Hindle - 1990 |

225 | Online large-margin training of dependency parsers - McDonald, Crammer, et al. - 2005 |

201 | Fast Exact Inference with a Factored Model for Natural Language Parsing - Klein, Manning - 2002 |

170 | Corpus-based induction of syntactic structure: models of dependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...smith,jason}@cs.jhu.edu Abstract We first show how a structural locality bias can improve the accuracy of state-of-the-art dependency grammar induction models trained by EM from unannotated examples (=-=Klein and Manning, 2004-=-). Next, by annealing the free parameter that controls this bias, we achieve further improvements. We then describe an alternative kind of structural bias, toward “broken” hypotheses consisting of par... |

117 | Contrastive estimation: Training log-linear models on unlabeled data - Smith, Eisner - 2005 |

103 | The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering - Xue, Xia, et al. - 2004 |

92 | Efficient parsing for bilexical context-free grammars and head automaton grammars - Eisner, Satta - 1999 |

89 | A generative constituent-context model for improved grammar induction - Klein, Manning - 2002 |

62 | Floresta sintá(c)tica”: a treebank for Portuguese - Afonso, Bick, et al. - 2002 |

51 | Bootstrapping statistical parsers from small datasets - Steedman, Osborne, et al. - 2003 |

48 | Bilexical grammars and a cubic-time probabilistic parser
- Eisner
- 1997
(Show Context)
Citation Context ...simple unlexicalized dependency model due to Klein and Manning (2004). The model is a probabilistic head automaton grammar (Alshawi, 1996) with a “split” form that renders it parseable in cubic time (=-=Eisner, 1997-=-). Let x = 〈x1, x2, ..., xn〉 be the sentence. x0 is a special “wall” symbol, $, on the left of every sentence. A tree y is defined by a pair of functions yleft and yright (both {0, 1, 2, ..., n} → 2 {... |

42 | Head automata and bilingual tiling: Translation with minimal representations
- Alshawi
- 1996
(Show Context)
Citation Context ...hen augmented with structural bias. 2 Task and Model In this paper we use a simple unlexicalized dependency model due to Klein and Manning (2004). The model is a probabilistic head automaton grammar (=-=Alshawi, 1996-=-) with a “split” form that renders it parseable in cubic time (Eisner, 1997). Let x = 〈x1, x2, ..., xn〉 be the sentence. x0 is a special “wall” symbol, $, on the left of every sentence. A tree y is de... |

41 | Automatically acquiring phrase structure using distributional analysis - Brill, Marcus - 1992 |

35 | The annotation process in the Turkish treebank - Atalay, Oflazer, et al. - 2003 |

32 | Parsing with soft and hard constraints on dependency length - Eisner, Smith - 2005 |

29 | Annealing techniques for unsupervised statistical language learning - Smith, Eisner - 2004 |

26 | Building a Turkish treebank - Oflazer, Say, et al. - 2003 |

25 | Guiding unsupervised grammar induction using contrastive estimation - Smith, Eisner - 2005 |

24 |
Deterministic annealing em algorithm. Neural Network
- Ueda, Nakano
- 1998
(Show Context)
Citation Context ...deterministic annealing (DA), a technique used in clustering and classification to smooth out objective functions that are piecewise constant (hence discontinuous) or bumpy (non-concave) (Rose, 1998; =-=Ueda and Nakano, 1998-=-). In unsupervised learning, DA iteratively re-estimates parameters like EM, but begins by requiring that the entropy of the posterior pΘ(y | x) be maximal, then gradually relaxes this entropy constra... |

23 | Learning hidden variable networks: The information bottleneck approach - Elidan, Friedman - 2005 |

21 | Design and implementation of the Bulgarian HPSG-based treebank - Simov, Osenova, et al. - 2004 |

12 | HPSG-based syntactic treebank of Bulgarian (BulTreeBank - Simov, Popova, et al. - 2002 |

8 | Bootstrapping without the boot
- Eisner, Karakos
- 2005
(Show Context)
Citation Context ...select values simultaneously for many hyperparameters, perhaps using a small annotated corpus (as done here), extrinsic figures of merit on successful learning trajectories, or plausibility criteria (=-=Eisner and Karakos, 2005-=-). Grammar induction serves as a tidy example for structural annealing. In future work, we envision that other kinds of structural bias and annealing will be useful in other difficult learning problem... |

5 | Practical annotation scheme for an HPSG treebank of Bulgarian - Simov, Osenova - 2003 |