## Discriminative log-linear grammars with latent variables (2008)

### Cached

### Download Links

- [books.nips.cc]
- [www.cs.berkeley.edu]
- [www.petrovi.de]
- [books.nips.cc]
- [nlp.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of NIPS 20 |

Citations: | 38 - 6 self |

### BibTeX

@INPROCEEDINGS{Petrov08discriminativelog-linear,

author = {Slav Petrov and Dan Klein},

title = {Discriminative log-linear grammars with latent variables},

booktitle = {In Proceedings of NIPS 20},

year = {2008}

}

### OpenURL

### Abstract

We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated in a gradient-based procedure. We compare L1 and L2 regularization and show that L1 regularization is superior, requiring fewer iterations to converge, and yielding sparser solutions. On full-scale treebank parsing experiments, the discriminative latent models outperform both the comparable generative latent models as well as the discriminative non-latent baselines. 1

### Citations

2306 | Conditional random fields: probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ..., object NPs, and so on. At the same time, discriminative methods have consistently provided advantages over their generative counterparts, including less restriction on features and greater accuracy =-=[3, 4, 5]-=-. In this work, we therefore investigate discriminative learning of latent PCFGs, hoping to gain the best from both lines of work. Discriminative methods for parsing are not new. However, most discrim... |

1883 |
Numerical Optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...ze the log conditional likelihood: Lcond(θ) = log � Pθ(Ti|wi) = log � � Pθ(wi, t) (6) Z(θ, wi) i We directly optimize this non-convex objective function using a numerical gradient based method (LBFGS =-=[19]-=- in our implementation). 3 Fitting the log-linear model involves the following derivatives: ∂Lcond(θ) = � � � Eθ [fX→γ(t)|Ti] − Eθ[fX→γ(t)|wi] , (7) ∂θX→γ i where the first term is the expected count ... |

953 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...The original tree. (b) The (binarized) X-bar tree. (c) The annotated tree. 2 Grammars with latent annotations Context-free grammars (CFGs) underlie most high-performance parsers in one way or another =-=[13, 12, 14]-=-. However, a CFG which simply takes the empirical productions and probabilities off of a treebank does not perform well. This naive grammar is a poor one because its context-freedom assumptions are to... |

385 | Coarse-to-fine n-best parsing and maxent discriminative reranking
- Charniak, Johnson
- 2005
(Show Context)
Citation Context ...training set can be avoided because the discriminative component only needs to select the best tree from a fixed candidate list. While most state-of-the-art parsing systems apply this hybrid approach =-=[10, 11, 12]-=-, it has the limitation that the candidate list often does not contain the correct parse tree. For example 41% of the correct parses were not in the candidate pool of ≈30-best parses in [10]. In this ... |

373 | The estimation of stochastic context-free grammars using the Inside–Outside algorithm. Computer Speech and Language - Lari, Young - 1990 |

367 | On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes - Ng, Jordan - 2002 |

283 | Learning accurate, compact, and interpretable tree annotation
- Petrov, Barrett, et al.
- 2006
(Show Context)
Citation Context ...he discriminative non-latent baselines. 1 Introduction In recent years, latent annotation of PCFG has been shown to perform as well as or better than standard lexicalized methods for treebank parsing =-=[1, 2]-=-. In the latent annotation scenario, we imagine that the observed treebank is a coarse trace of a finer, unobserved grammar. For example, the single treebank category NP (noun phrase) may be better mo... |

272 | Inside-Outside Reestimation from Partially Bracketed Corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ... sentences in the training set. We will discuss ways to make their computation on large data sets practical in the next section. 2 Since the tree structure is observed this can be done in linear time =-=[17]-=-. 3 Alternatively, maximum conditional likelihood estimation can also be seen as a special case of maximum likelihood estimation, where P(w) is assumed to be the empirical one and not learned. The con... |

268 | Discriminative reranking for natural language parsing
- Collins, Koo
- 2005
(Show Context)
Citation Context ...training set can be avoided because the discriminative component only needs to select the best tree from a fixed candidate list. While most state-of-the-art parsing systems apply this hybrid approach =-=[10, 11, 12]-=-, it has the limitation that the candidate list often does not contain the correct parse tree. For example 41% of the correct parses were not in the candidate pool of ≈30-best parses in [10]. In this ... |

180 | Improved inference for unlexicalized parsing
- Petrov, Klein
- 2007
(Show Context)
Citation Context ...The original tree. (b) The (binarized) X-bar tree. (c) The annotated tree. 2 Grammars with latent annotations Context-free grammars (CFGs) underlie most high-performance parsers in one way or another =-=[13, 12, 14]-=-. However, a CFG which simply takes the empirical productions and probabilities off of a treebank does not perform well. This naive grammar is a poor one because its context-freedom assumptions are to... |

138 | Max-margin parsing
- Taskar, Klein, et al.
- 2004
(Show Context)
Citation Context ...epeated parsing of the training set, which is generally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to “toy setups,” considering only sentences of length 15 =-=[6, 7, 8]-=- or extremely small corpora [9]. To get the benefits of discriminative methods, it has therefore become common practice to extract n-best candidate lists from a generative parser and then use a discri... |

123 | Scalable training of l1-regularized log-linear models
- Andrew, Gao
- 2007
(Show Context)
Citation Context ...L1 case, however, the penalty term is discontinuous whenever some parameter equals zero. To handle the discontinuinty of the gradient, we used the orthant-wise limitedmemory quasi-Newton algorithm of =-=[24]-=-. Table 2 shows that while there is no significant performance difference in models trained with L1 or L2 regularization, there is significant difference in the number of training iterations and the s... |

116 | Contrastive estimation: Training log-linear models on unlabeled data
- Smith, Eisner
- 2005
(Show Context)
Citation Context ...unlikely items are removed from the chart, pruning has virtually no effect on the conditional likelihood. Pruning more aggressively leads to a training procedure reminiscent of contrastive estimation =-=[23]-=-, where the denominator is restricted to a neighborhood of the correct parse tree (rather than containing all possible parse trees). In our experiments, pruning more aggressively did not hurt performa... |

77 |
R.L.: The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...rchical Estimation The number of training iterations can be reduced by training models of increasing complexity in a hierarchical fashion. For example in mixture modeling [20] and machine translation =-=[21]-=-, a sequence of increasingly more complex models is constructed and each model is initialized with its (simpler) predecessor. In our case, we begin with the unsplit X-Bar grammar and iteratively split... |

68 | Probabilistic CFG with latent annotations
- Matsuzaki, Miyao, et al.
- 2005
(Show Context)
Citation Context ...he discriminative non-latent baselines. 1 Introduction In recent years, latent annotation of PCFG has been shown to perform as well as or better than standard lexicalized methods for treebank parsing =-=[1, 2]-=-. In the latent annotation scenario, we imagine that the observed treebank is a coarse trace of a finer, unobserved grammar. For example, the single treebank category NP (noun phrase) may be better mo... |

63 | The infinite PCFG using hierarchical Dirichlet processes
- Liang, Petrov, et al.
- 2007
(Show Context)
Citation Context ...sed on different estimation techniques for learning generative grammars with latent labels (training with basic EM [1], an EM-based split and merge approach [2], a non-parametric variational approach =-=[16]-=-). In the following, we review how generative grammars are learned and present an algorithm for estimating discriminative grammars with latent variables. 2.1 Generative Grammars Generative grammars wi... |

42 | Discriminative training of a neural network statistical parser
- Henderson
- 2004
(Show Context)
Citation Context ...epeated parsing of the training set, which is generally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to “toy setups,” considering only sentences of length 15 =-=[6, 7, 8]-=- or extremely small corpora [9]. To get the benefits of discriminative methods, it has therefore become common practice to extract n-best candidate lists from a generative parser and then use a discri... |

30 | Hidden-Variable Models for Discriminative Reranking
- Koo, Collins
- 2005
(Show Context)
Citation Context ...training set can be avoided because the discriminative component only needs to select the best tree from a fixed candidate list. While most state-of-the-art parsing systems apply this hybrid approach =-=[10, 11, 12]-=-, it has the limitation that the candidate list often does not contain the correct parse tree. For example 41% of the correct parses were not in the candidate pool of ≈30-best parses in [10]. In this ... |

20 | Scalable discriminative learning for natural language parsing and translation
- Turian, Wellington, et al.
- 2007
(Show Context)
Citation Context ...epeated parsing of the training set, which is generally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to “toy setups,” considering only sentences of length 15 =-=[6, 7, 8]-=- or extremely small corpora [9]. To get the benefits of discriminative methods, it has therefore become common practice to extract n-best candidate lists from a generative parser and then use a discri... |

16 | Joint and conditional estimation of tagging and parsing models
- Johnson
- 2001
(Show Context)
Citation Context ...which is generally impractical. Previous work on end-to-end discriminative parsing has therefore resorted to “toy setups,” considering only sentences of length 15 [6, 7, 8] or extremely small corpora =-=[9]-=-. To get the benefits of discriminative methods, it has therefore become common practice to extract n-best candidate lists from a generative parser and then use a discriminative component to rerank th... |

14 | Weighted and probabilistic contextfree grammars are equally expressive
- Smith, Johnson
- 2007
(Show Context)
Citation Context ...free grammars (CFGs), where � γ ′ φX→γ ′ = 1 and Z(θ) = 1. Note, however, that this normalization constraint poses no restriction on the model class, as probabilistic and weighted CFGs are equivalent =-=[18]-=-. 2.2 Discriminative Grammars Discriminative grammars with latent variables can be seen as conditional random fields [4] over trees. For discriminative grammars, we maximize the log conditional likeli... |

7 |
Conditional structure vs. conditional estimation in NLP models
- Klein, Manning
- 2002
(Show Context)
Citation Context ..., object NPs, and so on. At the same time, discriminative methods have consistently provided advantages over their generative counterparts, including less restriction on features and greater accuracy =-=[3, 4, 5]-=-. In this work, we therefore investigate discriminative learning of latent PCFGs, hoping to gain the best from both lines of work. Discriminative methods for parsing are not new. However, most discrim... |

3 |
Split and merge EM algorithm for mixture models
- Ueda, Nakano, et al.
(Show Context)
Citation Context ...ired per iteration. 3.1 Hierarchical Estimation The number of training iterations can be reduced by training models of increasing complexity in a hierarchical fashion. For example in mixture modeling =-=[20]-=- and machine translation [21], a sequence of increasingly more complex models is constructed and each model is initialized with its (simpler) predecessor. In our case, we begin with the unsplit X-Bar ... |