## Wide-coverage efficient statistical parsing with CCG and log-linear models (2007)

### Cached

### Download Links

Venue: | COMPUTATIONAL LINGUISTICS |

Citations: | 164 - 36 self |

### BibTeX

@ARTICLE{Clark07wide-coverageefficient,

author = {Stephen Clark and James R. Curran},

title = {Wide-coverage efficient statistical parsing with CCG and log-linear models},

journal = {COMPUTATIONAL LINGUISTICS},

year = {2007},

volume = {33}

}

### OpenURL

### Abstract

This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminative training is used to estimate the models, which requires incorrect parses for each sentence in the training data as well as the correct parse. The lexicalized grammar formalism used is Combinatory Categorial Grammar (CCG), and the grammar is automatically extracted from CCGbank, a CCG version of the Penn Treebank. The combination of discriminative training and an automatically extracted grammar leads to a significant memory requirement (over 20 GB), which is satisfied using a parallel implementation of the BFGS optimisation algorithm running on a Beowulf cluster. Dynamic programming over a packed chart, in combination with the parallel implementation, allows us to solve one of the largest-scale estimation problems in the statistical parsing literature in under three hours. A key component of the parsing system, for both training and testing, is a Maximum Entropy supertagger which assigns CCG lexical categories to words in a sentence. The supertagger makes the discriminative training feasible, and also leads to a highly efficient parser. Surprisingly,

### Citations

2541 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...monstrate that both accurate and highly efficient parsing is possible with CCG. 1 Introduction Log-linear models have been applied to a number of problems in NLP, e.g. POS tagging (Ratnaparkhi, 1996; =-=Lafferty, McCallum, and Pereira, 2001-=-), named entity recognition (Borthwick, 1999), chunking (Koeling, 2000) and parsing (Johnson et al., 1999). Loglinear models are also referred to as maximum entropy models and random fields in the NLP... |

2273 | Building a large annotated corpus of English: The Penn Treebank
- Marcus, Marcinkiewicz, et al.
- 1993
(Show Context)
Citation Context ...ules used by the parser, and it is used as training data for the statistical models. The treebank is CCGbank (Hockenmaier and Steedman, 2002a; Hockenmaier, 2003a), a CCG version of the Penn Treebank (=-=Marcus, Santorini, and Marcinkiewicz, 1993-=-). Penn Treebank conversions have also been carried out for other linguistic formalisms, including TAG (Chen and Vijay-Shanker, 2000; Xia, Palmer, and Joshi, 2000), LFG (Burke et al., 2004) and HPSG (... |

2248 |
Numerical optimization
- Nocedal, Wright
- 2000
(Show Context)
Citation Context ...rgence was extremely slow; Sha and Pereira (2003) present a similar finding for globally optimised log-linear models for sequences. As an alternative to GIS, we use the limited-memory BFGS algorithm (=-=Nocedal and Wright 1999-=-). As Malouf (2002) demonstrates, general purpose numerical optimisation algorithms such as BFGS can converge much faster than iterative scaling algorithms (including Improved Iterative Scaling (Della... |

887 | A maximum-entropy-inspired parser - Charniak |

802 | A highperformance, portable implementation of the MPI Message-Passing Interface standard
- Gropp, Lusk, et al.
- 1996
(Show Context)
Citation Context ...reduction in estimation time: using 18 nodes allows our best-performing model to be estimated in less than three hours. We use the the Message Passing Interface (MPI) standard for the implementation (=-=Gropp et al. 1996-=-). The parallel implementation is a straightforward extension of the BFGS algorithm. Each machine in the cluster deals with a subset of the training data, holding the packed charts for that subset in ... |

604 |
2000, The Syntactic Process
- Steedman
(Show Context)
Citation Context ...e estimation problem by developing a parallelised version of the estimation algorithm which runs on a Beowulf cluster. The lexicalized grammar formalism we use is Combinatory Categorial Grammar (CCG; =-=Steedman, 2000-=-). A number of statistical parsing models have recently been developed for CCG and used in parsers applied to newspaper text (Clark, Hockenmaier, and Steedman 2002; Hockenmaier and Steedman 2002b; Hoc... |

582 | Inducing Features Of Random Fields - Pietra, Pietra, et al. - 1995 |

479 | Shallow parsing with conditional random fields - Sha, Pereira - 2003 |

455 | A New Statistical Parser Based on Bigram Lexical Dependencies - Collins - 1996 |

454 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...nd outside scores to calculate expectations, similar to the inside-outside algorithm for estimating the parameters of a PCFG from unlabelled data (Lari and Young 1990). Generalised Iterative Scaling (=-=Darroch and Ratcliff 1972-=-) is a common choice in the NLP literature for estimating a log-linear model, e.g. (Ratnaparkhi 1998; Curran and Clark 2003). Initially we used GIS for the parsing models described here, but found tha... |

383 |
The estimation of stochastic contextfree grammars using the inside-outside algorithm
- Lari, Young
- 1990
(Show Context)
Citation Context ...entence. The dynamic programming method uses inside and outside scores to calculate expectations, similar to the inside-outside algorithm for estimating the parameters of a PCFG from unlabelled data (=-=Lari and Young 1990-=-). Generalised Iterative Scaling (Darroch and Ratcliff 1972) is a common choice in the NLP literature for estimating a log-linear model, e.g. (Ratnaparkhi 1998; Curran and Clark 2003). Initially we us... |

380 | Statistical parsing with a context-free grammar and word statistics
- Charniak
- 1997
(Show Context)
Citation Context ...ext. Hockenmaier (2003a) and Hockenmaier and Steedman (2002b) present a generative model of normal-form derivations, based on various techniques from the statistical parsing literature (Collins 2003; =-=Charniak 1997-=-; Goodman 1997). A CCG binary derivation tree is generated top-down, with the probability of generating particular child nodes being conditioned on some limited context from the previously generated s... |

339 |
A maximum entropy part-ofspeech tagger
- Ratnaparkhi
- 1996
(Show Context)
Citation Context ... CCG parser. We demonstrate that both accurate and highly efficient parsing is possible with CCG. 1. Introduction Log-linear models have been applied to a number of problems in NLP, e.g. POS tagging (=-=Ratnaparkhi 1996-=-; Lafferty, McCallum, and Pereira 2001), named entity recognition ∗ Oxford University Computing Laboratory, Wolfson Building, Parks Road, Oxford, OX1 3QD, UK. E-mail: stephen.clark@comlab.ox.ac.uk ∗∗ ... |

285 | Discriminative reranking for natural language parsing
- Collins, Koo
- 2005
(Show Context)
Citation Context ...ng the generative model of Hockenmaier and Steedman (2002b) in a similar way. Using a generative model’s score as a feature in a discriminative framework has been beneficial for reranking approaches (=-=Collins and Koo 2005-=-). Since the generative model uses local features similar to those in our loglinear models, it could be incorporated into the estimation and decoding processes without the need for reranking. One way ... |

257 | Categorial type logics
- Moortgat
- 1997
(Show Context)
Citation Context ...iments. A recent development in the theory of CCG is the multi-modal treatment given by Baldridge (2002) and Baldridge and Kruijff (2003), following the type-logical approaches to categorial grammar (=-=Moortgat 1997-=-). One possible extension to the parser and grammar described in this paper is to incorporate the multi-modal approach; Baldridge (2002) suggests that, as well as having theoretical motivation, a mult... |

234 | A gaussian prior for smoothing maximum entropy models
- Rosenfeld
- 1999
(Show Context)
Citation Context ... follow Riezler et al. (2002) in using a discriminative estimation method by maximising the conditional log-likelihood of the model given the data, minus a Gaussian prior term to prevent overfitting (=-=Chen and Rosenfeld 1999-=-; Johnson et al. 1999). Thus, given training sentences S1, . . . , Sm, gold-standard dependency structures, π1, . . . , πm, and the definition of the probability of a dependency structure (13), the ob... |

229 | PCFG models of linguistic tree representations
- Johnson
- 1998
(Show Context)
Citation Context ...aseline are considered: increasing the amount of lexicalisation; generating a lexical category at its maximal projection; conditioning the probability of a rule instantiation on the grandparent node (=-=Johnson 1998-=-); adding features designed to deal with coordination; and adding distance to the dependency features. Some of these extensions, such as increased lexicalisation and generating a lexical category at i... |

214 | Maximum Entropy Models for Natural Language Ambiguity Resolution
- Ratnaparkhi
- 1998
(Show Context)
Citation Context ...eters of a PCFG from unlabelled data (Lari and Young 1990). Generalised Iterative Scaling (Darroch and Ratcliff 1972) is a common choice in the NLP literature for estimating a log-linear model, e.g. (=-=Ratnaparkhi 1998-=-; Curran and Clark 2003). Initially we used GIS for the parsing models described here, but found that convergence was extremely slow; Sha and Pereira (2003) present a similar finding for globally opti... |

190 |
Surface structure and interpretation
- Steedman
- 1996
(Show Context)
Citation Context ...rties of the extracted grammars, is an open question. Related work on statistical parsing with CCG will described in Section 3. 3. Combinatory Categorial Grammar Combinatory Categorial Grammar (CCG) (=-=Steedman 1996-=-, 2000) is a type-driven lexicalised theory of grammar based on categorial grammar (Wood 1993). CCG lexical entries consist of a syntactic category, which defines valency and directionality, and a sem... |

187 | Robust accurate statistical annotation of general text - Briscoe, Carroll - 2002 |

172 | Parsing the WSJ using CCG and log-linear models - Clark, Curran - 2004 |

171 | Learning to parse natural language with maximum entropy models
- Ratnaparkhi
- 1999
(Show Context)
Citation Context ...highly efficient parsing is possible with CCG. 2 Related Work The first application of log-linear models to parsing is the work of Ratnaparkhi (Ratnaparkhi, Roukos, and Ward, 1994; Ratnaparkhi, 1996; =-=Ratnaparkhi, 1999-=-). Similar to Della Pietra, Della Pietra, and Lafferty (1997), Ratnaparkhi motivates log-linear models from the perspective of maximising entropy, subject to certain constraints. Ratnaparkhi models th... |

171 |
Recognition and parsing of context-free languages in time n3. Information and control
- Younger
- 1967
(Show Context)
Citation Context ...ng loss in accuracy. Section 10.3 gives results for the speed of the parser. 9.2 Chart parsing algorithm The algorithm used to build the packed charts is the CKY chart parsing algorithm (Kasami 1965; =-=Younger 1967-=-) described in Steedman (2000). The CKY algorithm applies naturally to CCG since the grammar is binary. It builds the chart bottom-up, starting with constituents spanning a single word, incrementally ... |

156 | AMaximumEntropy Approach to Named Entity Recognition
- Borthwick
- 1999
(Show Context)
Citation Context ...ved: 27 April 2006; Revised submission received: 30 November 2006; Accepted for publication: 16 March 2007. © 0 Association for Computational LinguisticssComputational Linguistics Volume 0, Number 0 (=-=Borthwick 1999-=-), chunking (Koeling 2000) and parsing (Johnson et al. 1999). Log-linear models are also referred to as maximum entropy models and random fields in the NLP literature. They are popular because of the ... |

151 | Incremental parsing with the perceptron algorithm - Collins, Roark - 2004 |

150 | Estimators for stochastic “unification-based” grammars
- Johnson, Geman, et al.
- 1999
(Show Context)
Citation Context ...vember 2006; Accepted for publication: 16 March 2007. © 0 Association for Computational LinguisticssComputational Linguistics Volume 0, Number 0 (Borthwick 1999), chunking (Koeling 2000) and parsing (=-=Johnson et al. 1999-=-). Log-linear models are also referred to as maximum entropy models and random fields in the NLP literature. They are popular because of the ease with which complex discriminating features can be incl... |

146 | Stochastic attribute-value grammars - Abney - 1997 |

146 | Supertagging: an approach to almost parsing
- Bangalore, Joshi
- 1999
(Show Context)
Citation Context ...ng a number of incorrect but plausible lexical categories for each word in the sentence. Second, it greatly increases the efficiency of the parser, which was the original motivation for supertagging (=-=Bangalore and Joshi 1999-=-). One possible criticism of CCG has been that highly efficient parsing is not possible because of the additional “spurious" derivations. In fact, we show that a novel method which tightly integrates ... |

143 | Prepositional Phrase Attachment through a Backed-Off Model
- Collins, Brooks
- 1995
(Show Context)
Citation Context ...gramming performed over the chart. Two possible extensions, which we have not investigated, include defining dependency features which account for all three elements of the triple in a PP-attachment (=-=Collins and Brooks 1995-=-), and defining a rule feature which includes the grandparent node (Johnson 1998). Another alternative for future work is to compare the dynamic programming approach taken here with the beam-search ap... |

141 | Max-margin parsing
- Taskar, Klein, et al.
- 2004
(Show Context)
Citation Context ...that it is difficult to test different configurations of the system, for example different feature sets. It may also not be possible to train or run the system on anything other than short sentences (=-=Taskar et al. 2004-=-). 49sComputational Linguistics Volume 0, Number 0 The supertagger is a key component in our parsing system. It reduces the size of the charts considerably compared with naive methods for assigning le... |

137 |
A quasi-arithmetical notation for syntactic description, Language 29
- Bar-Hillel
- 1953
(Show Context)
Citation Context ...ive, infinitival and wh-question. This additional information will be described in later sections. Categories are combined in a derivation using combinatory rules. In the original Categorial Grammar (=-=Bar-Hillel 1953-=-), which is context-free, there are two rules of functional application: < X /Y Y ⇒ X (>) (3) Y X \Y ⇒ X (<) (4) where X and Y denote categories (either basic or complex). The first rule is forward ap... |

135 | Parser evaluation: a survey and a new proposal
- Carroll, Briscoe, et al.
- 1998
(Show Context)
Citation Context ...-CCG parser, namely the RASP parser – and since we are converting the CCG output into the format used by RASP the CCG parser is not at an unfair advantage. There is also the SUSANNE GR gold standard (=-=Carroll, Briscoe, and Sanfilippo, 1998-=-), on which the B&C annotation is based, but we chose not to use this for evaluation. This earlier GR scheme is less like the dependencies output by the CCG parser, and the comparison would be complic... |

129 |
An efficient recognition and syntax analysis algorithm for context free languages
- Kasami
- 1965
(Show Context)
Citation Context ...y corresponding loss in accuracy. Section 10.3 gives results for the speed of the parser. 9.2 Chart parsing algorithm The algorithm used to build the packed charts is the CKY chart parsing algorithm (=-=Kasami 1965-=-; Younger 1967) described in Steedman (2000). The CKY algorithm applies naturally to CCG since the grammar is binary. It builds the chart bottom-up, starting with constituents spanning a single word, ... |

126 | Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques
- Riezler, King, et al.
- 2002
(Show Context)
Citation Context ...the model, and have been shown to give good performance across a range of NLP tasks. Log-linear models have previously been applied to statistical parsing (Johnson et al. 1999; Toutanova et al. 2002; =-=Riezler et al. 2002-=-; Malouf and van Noord 2004) but typically under the assumption that all possible parses for a sentence can be enumerated. For manually constructed grammars, this assumption is usually sufficient for ... |

98 | Parsing algorithms and metrics - Goodman - 1996 |

84 | 2003. ‘The PARC 700 Dependency Bank
- King, Crouch, et al.
(Show Context)
Citation Context ...rature. The parser is evaluated on CCGbank (available through the LDC). In order to facilitate comparisons with parsers using different formalisms, we also evaluate on the publicly available DepBank (=-=King et al. 2003-=-), using the Briscoe and Carroll annotation consistent with the RASP parser (Briscoe, Carroll, and Watson 2006). The dependency annotation is designed to be as theory-neutral as possible to allow easy... |

83 | Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. 42nd Meeting of the Association for Computational Linguistics - Cahill, Burke, et al. - 2004 |

77 | The Importance of Supertagging for Wide-Coverage CCG Parsing - Clark, Curran - 2004 |

72 | Multi-modal combinatory categorial grammar - Baldridge, Kruijff - 2003 |

71 | Maximum entropy estimation for feature forests - Miyao, Tsujii - 2002 |

69 | Wide coverage parsing with stochastic attribute value grammars - Malouf, Noord |

68 | 2000. Automated Extraction of TAGs from the Penn Treebank
- Chen, Vijay-Shanker
(Show Context)
Citation Context ...provide. The formalism most closely related to CCG from this list is TAG. TAG grammars have been automatically extracted from the Penn Treebank, using techniques similar to those used by Hockenmaier (=-=Chen and Vijay-Shanker 2000-=-; Xia, Palmer, and Joshi 2000). Also, the supertagging idea which is central to the efficiency of the CCG parser originated with TAG (Bangalore and Joshi 1999). Chen et al. (2002) describe the results... |

59 |
The proposition bank: A corpus annotated with semantic roles
- Palmer, Gildea, et al.
- 2005
(Show Context)
Citation Context ... (King et al., 2003). Cahill et al. (2004) evaluate an LFG parser, which uses an automatically extracted grammar, against DepBank. Miyao and Tsujii (2004) evaluate their HPSG parser against PropBank (=-=Palmer, Gildea, and Kingsbury, 2005-=-). Kaplan et al. (2004) compare the Collins parser with the Parc LFG parser by mapping Penn Treebank parses into the dependencies of DepBank, claiming that the LFG parser is more accurate with only a ... |

58 | Parsing biomedical literature
- Lease, M, et al.
- 2005
(Show Context)
Citation Context ...e showing that, perhaps not surprisingly, the performance of parsers trained on the WSJ Penn Treebank drops significantly when the parser is applied to domains outside of newspaper text (Gildea 2001; =-=Lease and Charniak 2005-=-). The difficulty is that developing new treebanks for each of these domains is infeasible. Developing the techniques to extract a CCG grammar from the Penn Treebank, together with the pre-processing ... |

56 |
Categorial Grammars
- Wood, M
- 1993
(Show Context)
Citation Context ...G will described in Section 3. 3. Combinatory Categorial Grammar Combinatory Categorial Grammar (CCG) (Steedman 1996, 2000) is a type-driven lexicalised theory of grammar based on categorial grammar (=-=Wood 1993-=-). CCG lexical entries consist of a syntactic category, which defines valency and directionality, and a semantic interpretation. In this paper we are concerned with the syntactic component; see Steedm... |

54 | Cascaded grammatical relation assignment - Buchholz, Veenstra, et al. - 1999 |

53 | Speed and accuracy in shallow and deep stochastic parsing
- Kaplan, Riezler, et al.
- 2004
(Show Context)
Citation Context ...istical parsing using linguistically motivated grammar formalisms is large and growing. Statistical parsers have been developed for TAG (Chiang 2000; Sarkar and Joshi 2003), LFG (Riezler et al. 2002; =-=Kaplan et al. 2004-=-; Cahill et al. 2004) and HPSG (Toutanova et al. 2002; Toutanova, Markova, and Manning 2004; Miyao and Tsujii 2004; Malouf and van Noord 2004), among others. The motivation for using these formalisms ... |

52 | EÆcient normal-form parsing for Combinatory Categorial Grammar. ACL
- Eisner
- 1996
(Show Context)
Citation Context ...ons in CCG complicates the modelling and parsing problems. In this paper we consider two solutions. The first, following Hockenmaier (2003a), is to define a model in terms of normal-form derivations (=-=Eisner 1996-=-). In this approach we recover only one derivation leading to a given set of predicate-argument dependencies and ignore the rest. The second approach is to define a model over the predicate-argument d... |

50 | A classier-based parser with linear run-time complexity - Sagae, Lavie - 2005 |

46 | Investigating GIS and smoothing for maximum entropy taggers
- Curran, Clark
- 2003
(Show Context)
Citation Context ...om unlabelled data (Lari and Young 1990). Generalised Iterative Scaling (Darroch and Ratcliff 1972) is a common choice in the NLP literature for estimating a log-linear model, e.g. (Ratnaparkhi 1998; =-=Curran and Clark 2003-=-). Initially we used GIS for the parsing models described here, but found that convergence was extremely slow; Sha and Pereira (2003) present a similar finding for globally optimised log-linear models... |