## Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

### Cached

### Download Links

Citations: | 8 - 1 self |

### BibTeX

@MISC{Smith_bootstrappingfeature-rich,

author = {David A. Smith and Jason Eisner},

title = {Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors},

year = {}

}

### OpenURL

### Abstract

One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney’s (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Rényi entropy) on unlabeled data. In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied. For our models and training sets, more peaked measures of confidence, measured by Rényi entropy, outperformed smoother ones. We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping. 1

### Citations

1257 | Combining Labeled and Unlabeled Data with Co-Training. Conference on Computational Learning Theory
- Blum, Mitchell
- 1998
(Show Context)
Citation Context ...unctions. 668 the projective case, crossings. Our observation is that this situation is ideal for so-called “bootstrapping,” “co-training,” or “minimally supervised” learning methods (Yarowsky, 1995; =-=Blum and Mitchell, 1998-=-; Yarowsky and Wicentowski, 2000). Such methods should thrive when the right answer is overdetermined owing to redundant features and/or global constraints. Concretely, suppose we start by training a ... |

391 |
Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...proved our results by refining the values of known, non-lexicalized features. 3.4 Comparison with EM Perhaps the most popular statistical method for learning from incomplete data is the EM algorithm (=-=Dempster et al., 1977-=-). Since we cannot try EM on McDonald’s conditional model, we ran some pilot experiments using the generative dependency model with valence (DMV) of Klein and Manning (2004). As in their experiments, ... |

255 | Three new probabilistic models for dependency parsing: An exploration
- Eisner
- 1996
(Show Context)
Citation Context ...rser begins by scoring each of the O(n 2 ) possible edges, and then seeks the highest-scoring legal dependency tree formed by any n − 1 of these edges, using an O(n 3 ) dynamic programming algorithm (=-=Eisner, 1996-=-) for projective trees. For non-projective parsing, O(n 3 ), or with some trickery O(n 2 ), greedy algorithms exist (Chu and Liu, 1965; Edmonds, 1967; Gabow et al., 1986). The feature function f may p... |

253 | E.: Conll-x shared task on multilingual dependency parsing
- Buchholz, Marsi
- 2006
(Show Context)
Citation Context ...ning (§3.4). Section 3.3 presents overall bootstrapping accuracy. 3.1 Experimental Design We bootstrapped non-projective parsers for languages assembled for the CoNLL dependency parsing competitions (=-=Buchholz and Marsi, 2006-=-). We selected German, Spanish, and Czech (Brants et al., 2002; Civit Torruella and Martí Antonín, 2002; Böhmová et al., 2003). After removing sentences more than 60 words long, we randomly divided ea... |

249 | 2002 The TIGER treebank
- Brants, Dipper, et al.
(Show Context)
Citation Context ....1 Experimental Design We bootstrapped non-projective parsers for languages assembled for the CoNLL dependency parsing competitions (Buchholz and Marsi, 2006). We selected German, Spanish, and Czech (=-=Brants et al., 2002-=-; Civit Torruella and Martí Antonín, 2002; Böhmová et al., 2003). After removing sentences more than 60 words long, we randomly divided each corpus into small seed sets of 100 and 1000 trees; developm... |

248 | Tagging English text with a probabilistic model - Merialdo - 1994 |

228 | Online large-margin training of dependency parsers
- McDonald, Crammer, et al.
- 2005
(Show Context)
Citation Context ...abeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (=-=McDonald et al., 2005-=-). Drawing on Abney’s (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and conf... |

191 | Scalable inference and training of context-rich syntactic translation models
- Galley, Graehl, et al.
- 2006
(Show Context)
Citation Context ... paths in dependency parse trees to define their extraction patterns and classification features. Parsing is also key to the latest advances in machine translation, which translate syntactic phrases (=-=Galley et al., 2006-=-; Marcu et al., 2006; Cowan et al., 2006). 2 Our Approach Our approach rests on three observations: • Recent “feature-based” parsing models are an excellent fit for bootstrapping, because the parse is... |

179 |
Optimum branchings
- Edmonds
- 1967
(Show Context)
Citation Context ...es, using an O(n 3 ) dynamic programming algorithm (Eisner, 1996) for projective trees. For non-projective parsing, O(n 3 ), or with some trickery O(n 2 ), greedy algorithms exist (Chu and Liu, 1965; =-=Edmonds, 1967-=-; Gabow et al., 1986). The feature function f may pay attention to many properties of the directed edge e. Of course, features may consider the parent and child words connected by e, and their parts o... |

172 | Corpus-based induction of syntactic structure: Models of dependency and constituency
- Klein, Manning
- 2004
(Show Context)
Citation Context ...d features. % accuracy train Bulg. German Spanish supervised ML 74.2 80.0 71.3 CL 77.5 79.3 75.0 semi- EM 58.6 58.8 68.4 supervised Conf. 80.0 80.5 76.7 Table 4: Dependency accuracy of the DMV model (=-=Klein and Manning, 2004-=-). Maximizing confidence using R1 (Shannon) entropy improved performance over its own conditional likelihood (CL) baseline and over maximum likelihood (ML). EM degraded its ML baseline. Since these mo... |

162 | 2006. Online learning of approximate dependency parsing algorithms
- McDonald, Pereira
(Show Context)
Citation Context ...ntroducing noise from out-of-domain data. We could also better exploit the data we have with richer models of syntax. In supervised dependency parsing, secondorder edge features provide improvements (=-=McDonald and Pereira, 2006-=-; Riedel and Clarke, 2006); moreover, the feature-based approach is not limited to dependency parsing. Similar techniques could score parses in other formalisms, such as CFG or TAG. In this case, f ex... |

148 | Domain adaptation with structural correspondence learning
- Blitzer, McDonald, et al.
- 2006
(Show Context)
Citation Context ...eranking”), as we would predict. However, another approach is to train a separate out-of-domain parser, and use this to generate additional features on the supervised and unsupervised in-domain data (=-=Blitzer et al., 2006-=-). Bootstrapping now teaches us where to trust the out-of-domain parser. If our basic model has 100 features, we could add features 101 through 200, where for example f123(e) = f23 · log ˜ Pr(e) and ˜... |

116 |
On the shortest arborescence of a directed graph
- Chu, Liu
- 1965
(Show Context)
Citation Context ... n − 1 of these edges, using an O(n 3 ) dynamic programming algorithm (Eisner, 1996) for projective trees. For non-projective parsing, O(n 3 ), or with some trickery O(n 2 ), greedy algorithms exist (=-=Chu and Liu, 1965-=-; Edmonds, 1967; Gabow et al., 1986). The feature function f may pay attention to many properties of the directed edge e. Of course, features may consider the parent and child words connected by e, an... |

88 | Bootstrapping parsers via syntactic projection across parallel texts - Hwa, Resnik, et al. - 2005 |

86 |
Efficient algorithms for finding minimum spanning trees in undirected and directed graphs
- Gabow, Galil, et al.
- 1986
(Show Context)
Citation Context ...n 3 ) dynamic programming algorithm (Eisner, 1996) for projective trees. For non-projective parsing, O(n 3 ), or with some trickery O(n 2 ), greedy algorithms exist (Chu and Liu, 1965; Edmonds, 1967; =-=Gabow et al., 1986-=-). The feature function f may pay attention to many properties of the directed edge e. Of course, features may consider the parent and child words connected by e, and their parts of speech. 2 But some... |

82 | Spmt: Statistical machine translation with syntactified target language phrases
- Marcu, Wang, et al.
- 2006
(Show Context)
Citation Context ...parse trees to define their extraction patterns and classification features. Parsing is also key to the latest advances in machine translation, which translate syntactic phrases (Galley et al., 2006; =-=Marcu et al., 2006-=-; Cowan et al., 2006). 2 Our Approach Our approach rests on three observations: • Recent “feature-based” parsing models are an excellent fit for bootstrapping, because the parse is often overdetermine... |

81 | Semi-supervised learning by entropy minimization
- Grandvalet, Bengio
- 2004
(Show Context)
Citation Context ...is bootstrapping algorithms locally minimize K. We now present a generalization of Abney’s K function and relate it to another semi-supervised learning technique, entropy regularization (Brand, 1999; =-=Grandvalet and Bengio, 2005-=-; Jiao et al., 2006). Our experiments will tune the feature weight vector, θ, to minimize our function. We will do so simply by applying a generic function minimization method (stochastic gradient des... |

69 | Reranking and self-training for parser adaptation - McClosky, Charniak, et al. - 2006 |

66 | Structure learning in conditional probability models via an entropic prior and parameter extinction
- Brand
- 1998
(Show Context)
Citation Context ..., and shows his bootstrapping algorithms locally minimize K. We now present a generalization of Abney’s K function and relate it to another semi-supervised learning technique, entropy regularization (=-=Brand, 1999-=-; Grandvalet and Bengio, 2005; Jiao et al., 2006). Our experiments will tune the feature weight vector, θ, to minimize our function. We will do so simply by applying a generic function minimization me... |

61 |
Stochastic learning
- Bottou
- 2004
(Show Context)
Citation Context ...vised corpus. For many machine learning problems over large datasets, online learning methods such as stochastic gradient descent (SGD) have been empirically observed to converge in fewer iterations (=-=Bottou, 2003-=-). In SGD, instead of taking an optimization step in the direction of the gradient calculated over all unsupervised training examples, we parse each example, calculate the gradient of the objective fu... |

60 | Semi-supervised conditional random fields for improved sequence segmentation and labeling
- Jiao, Wang, et al.
- 2006
(Show Context)
Citation Context ...ocally minimize K. We now present a generalization of Abney’s K function and relate it to another semi-supervised learning technique, entropy regularization (Brand, 1999; Grandvalet and Bengio, 2005; =-=Jiao et al., 2006-=-). Our experiments will tune the feature weight vector, θ, to minimize our function. We will do so simply by applying a generic function minimization method (stochastic gradient descent), rather than ... |

51 | Bootstrapping statistical parsers from small datasets
- Steedman, Osborne, et al.
- 2003
(Show Context)
Citation Context ...ple sentence if there is no consensus. Previous work on parser bootstrapping has not been able to exploit this redundancy among features, because it has used PCFG-like models with far fewer features (=-=Steedman et al., 2003-=-). 2.3 Adaptation and Projection via Features The previous section assumed that we had a small supervised treebank in the target language and domain (plus a large unsupervised corpus). We now consider... |

48 | Incremental integer linear programming for non-projective dependency parsing
- Riedel, Clarke
(Show Context)
Citation Context ...-domain data. We could also better exploit the data we have with richer models of syntax. In supervised dependency parsing, secondorder edge features provide improvements (McDonald and Pereira, 2006; =-=Riedel and Clarke, 2006-=-); moreover, the feature-based approach is not limited to dependency parsing. Similar techniques could score parses in other formalisms, such as CFG or TAG. In this case, f extracts features from each... |

44 | Minimally supervised morphological analysis by multimodal alignment
- Yarowsky, Wicentowski
- 2000
(Show Context)
Citation Context ...ive case, crossings. Our observation is that this situation is ideal for so-called “bootstrapping,” “co-training,” or “minimally supervised” learning methods (Yarowsky, 1995; Blum and Mitchell, 1998; =-=Yarowsky and Wicentowski, 2000-=-). Such methods should thrive when the right answer is overdetermined owing to redundant features and/or global constraints. Concretely, suppose we start by training a supervised parser on only 100 ex... |

43 | Understanding the Yarowsky algorithm - ABNEY - 2004 |

31 | Structured prediction models via the matrixtree theorem. EMNLP
- Koo, Globerson, et al.
- 2007
(Show Context)
Citation Context ...ide algorithm. For nonprojective parsing, the analogy to the inside algorithm is the O(n 3 ) “matrix-tree algorithm,” which is dominated asymptotically by a matrix determinant (Smith and Smith, 2007; =-=Koo et al., 2007-=-; McDonald and Satta, 2007). The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n 3 ) complexity as evaluating the function. The second ... |

29 | Inducing translation lexicons via diverse similarity measures and bridge languages
- Schafer, Yarowsky
- 2002
(Show Context)
Citation Context ... apple is a plausible direct object for eat. We can discover this last bit of world knowledge from comparable English text. Translation dictionaries can themselves be induced from comparable corpora (=-=Schafer and Yarowsky, 2002-=-; Schafer, 2006; Klementiev and Roth, 2006), or extracted from bitext or digitized versions of human-readable dictionaries if these are available. The above inference pattern can be captured by featur... |

27 | 2006b. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora
- Klementiev, Roth
(Show Context)
Citation Context .... We can discover this last bit of world knowledge from comparable English text. Translation dictionaries can themselves be induced from comparable corpora (Schafer and Yarowsky, 2002; Schafer, 2006; =-=Klementiev and Roth, 2006-=-), or extracted from bitext or digitized versions of human-readable dictionaries if these are available. The above inference pattern can be captured by features similar to those in equation (1). For e... |

26 | On the complexity of non-projective datadriven dependency parsing
- McDonald, Satta
- 2007
(Show Context)
Citation Context ... nonprojective parsing, the analogy to the inside algorithm is the O(n 3 ) “matrix-tree algorithm,” which is dominated asymptotically by a matrix determinant (Smith and Smith, 2007; Koo et al., 2007; =-=McDonald and Satta, 2007-=-). The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n 3 ) complexity as evaluating the function. The second term of (4) is the Shannon... |

25 | Probabilistic models of nonprojective dependency trees
- Smith, Smith
- 1995
(Show Context)
Citation Context ...responding O(n 3 ) outside algorithm. For nonprojective parsing, the analogy to the inside algorithm is the O(n 3 ) “matrix-tree algorithm,” which is dominated asymptotically by a matrix determinant (=-=Smith and Smith, 2007-=-; Koo et al., 2007; McDonald and Satta, 2007). The gradient of a determinant may be computed by matrix inversion, so evaluating the gradient again has the same O(n 3 ) complexity as evaluating the fun... |

24 | 2006. Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies - Smith, Eisner |

22 |
Design and implementation of the Bulgarian HPSG-based treebank
- Simov, Osenova, et al.
- 2005
(Show Context)
Citation Context ... information. Since the DMV models projective trees, we ran experiments on three CoNLL corpora that had augmented their primary non-projective parses with alternate projective annotations: Bulgarian (=-=Simov et al., 2005-=-), German, and Spanish. We performed supervised maximum likelihood and conditional likelihood estimation on a seed set of 100 sentences for each language. These models respectively initialized EM and ... |

19 |
The PDT: a 3-level annotation scenario
- Böhomovà, Hajic, et al.
- 2003
(Show Context)
Citation Context ...for languages assembled for the CoNLL dependency parsing competitions (Buchholz and Marsi, 2006). We selected German, Spanish, and Czech (Brants et al., 2002; Civit Torruella and Martí Antonín, 2002; =-=Böhmová et al., 2003-=-). After removing sentences more than 60 words long, we randomly divided each corpus into small seed sets of 100 and 1000 trees; development and test sets of 200 trees each; and an unlabeled training ... |

14 | Efficient computation of entropy gradient for semisupervised conditional random fields
- Mann, McCallum
- 2007
(Show Context)
Citation Context ...gle known parse. But the denominator Zi is a normalizing constant that sums over all parses; it is found by a dependency-parsing variant of the inside algorithm, following (Eisner, 1996). 6 See also (=-=Mann and McCallum, 2007-=-) for similar results on conditional random fields. 672 is in fact the Shannon entropy H(p) and that limα→∞ Rα(p) = − log maxy p(y), i.e. the negative log probability of the modal or “Viterbi” label (... |

13 | A discriminative model for tree-to-tree translation - Cowan, Ku˘cerová, et al. - 2006 |

11 |
Information Measures
- Arndt
- 2001
(Show Context)
Citation Context ...2007) for similar results on conditional random fields. is in fact the Shannon entropy H(p) and that limα→∞ Rα(p) = − log maxy p(y), i.e. the negative log probability of the modal or “Viterbi” label (=-=Arndt, 2001-=-; Karakos et al., 2007). The α = 2 case, widely used as a measure of purity in decision tree learning, is often called the “Gini index.” Finally, when α = 0, we get the log of the number of labels, wh... |

7 |
Cross-instance tuning of unsupervised document clustering algorithms
- Karakos, Eisner, et al.
- 2007
(Show Context)
Citation Context ... results on conditional random fields. 672 is in fact the Shannon entropy H(p) and that limα→∞ Rα(p) = − log maxy p(y), i.e. the negative log probability of the modal or “Viterbi” label (Arndt, 2001; =-=Karakos et al., 2007-=-). The α = 2 case, widely used as a measure of purity in decision tree learning, is often called the “Gini index.” Finally, when α = 0, we get the log of the number of labels, which equals the H(unifo... |

6 | Combining deep linguistics analysis and surface pattern learning: A hybrid approach to chinese definitional question answering
- Peng, Weischedel, et al.
- 2005
(Show Context)
Citation Context ...ation extracDavid A. Smith and Jason Eisner Department of Computer Science Johns Hopkins University Balitmore, MD 21218, USA {dasmith,eisner}@jhu.edu tion (Weischedel, 2004) 1 and question answering (=-=Peng et al., 2005-=-). These systems rely on edges or paths in dependency parse trees to define their extraction patterns and classification features. Parsing is also key to the latest advances in machine translation, wh... |

4 | Grandvalet and Yoshua Bengio. 2005. Semisupervised learning by entropy minimization - Yves |

3 | Martí Antonín. 2002. Design principles for a Spanish treebank - Torruella, A |

1 | Treebank transfer - Jansche - 2005 |

1 |
Translation Discovery Using Diverse Smilarity Measures
- Schafer
- 2006
(Show Context)
Citation Context ... object for eat. We can discover this last bit of world knowledge from comparable English text. Translation dictionaries can themselves be induced from comparable corpora (Schafer and Yarowsky, 2002; =-=Schafer, 2006-=-; Klementiev and Roth, 2006), or extracted from bitext or digitized versions of human-readable dictionaries if these are available. The above inference pattern can be captured by features similar to t... |

1 |
Extracting dynamic evidence networks
- Weischedel
- 2004
(Show Context)
Citation Context ... key component in leading systems for information extracDavid A. Smith and Jason Eisner Department of Computer Science Johns Hopkins University Balitmore, MD 21218, USA {dasmith,eisner}@jhu.edu tion (=-=Weischedel, 2004-=-) 1 and question answering (Peng et al., 2005). These systems rely on edges or paths in dependency parse trees to define their extraction patterns and classification features. Parsing is also key to t... |