Results 1 - 10
of
17
Wide-coverage efficient statistical parsing with CCG and log-linear models
- COMPUTATIONAL LINGUISTICS
, 2007
"... This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminativ ..."
Abstract
-
Cited by 87 (20 self)
- Add to MetaCart
This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminative training is used to estimate the models, which requires incorrect parses for each sentence in the training data as well as the correct parse. The lexicalized grammar formalism used is Combinatory Categorial Grammar (CCG), and the grammar is automatically extracted from CCGbank, a CCG version of the Penn Treebank. The combination of discriminative training and an automatically extracted grammar leads to a significant memory requirement (over 20 GB), which is satisfied using a parallel implementation of the BFGS optimisation algorithm running on a Beowulf cluster. Dynamic programming over a packed chart, in combination with the parallel implementation, allows us to solve one of the largest-scale estimation problems in the statistical parsing literature in under three hours. A key component of the parsing system, for both training and testing, is a Maximum Entropy supertagger which assigns CCG lexical categories to words in a sentence. The supertagger makes the discriminative training feasible, and also leads to a highly efficient parser. Surprisingly,
Linguistically motivated large-scale NLP with C&C and Boxer
- In Proceedings of the Demonstrations Session of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07
, 2007
"... The statistical modelling of language, together with advances in wide-coverage grammar development, have led to high levels of robustness and efficiency in NLP systems and made linguistically motivated ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
The statistical modelling of language, together with advances in wide-coverage grammar development, have led to high levels of robustness and efficiency in NLP systems and made linguistically motivated
Parsing noun phrase structure with CCG
- In Proc. ACL-08:HLT
, 2008
"... Statistical parsing of noun phrase (NP) structure has been hampered by a lack of goldstandard data. This is a significant problem for CCGbank, where binary branching NP derivations are often incorrect, a result of the automatic conversion from the Penn Treebank. We correct these errors in CCGbank us ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Statistical parsing of noun phrase (NP) structure has been hampered by a lack of goldstandard data. This is a significant problem for CCGbank, where binary branching NP derivations are often incorrect, a result of the automatic conversion from the Penn Treebank. We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus. We also implement novel NER features that generalise the lexical information needed to parse NPs and provide important semantic information. Finally, evaluating against DepBank demonstrates the effectiveness of our modified corpus and novel features, with an increase in parser performance of 1.51%. 1
Task-oriented Evaluation of Syntactic Parsers and Their Representations
- PROCEEDINGS OF THE 46TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES
, 2008
"... This paper presents a comparative evaluation of several state-of-the-art English parsers based on different frameworks. Our approach is to measure the impact of each parser when it is used as a component of an information extraction system that performs protein-protein interaction (PPI) identificati ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
This paper presents a comparative evaluation of several state-of-the-art English parsers based on different frameworks. Our approach is to measure the impact of each parser when it is used as a component of an information extraction system that performs protein-protein interaction (PPI) identification in biomedical papers. We evaluate eight parsers (based on dependency parsing, phrase structure parsing, or deep parsing) using five different parse representations. We run a PPI system with several combinations of parser and parse representation, and examine their impact on PPI identification accuracy. Our experiments show that the levels of accuracy obtained with these different parsers are similar, but that accuracy improvements vary when the parsers are retrained with domain-specific data.
Which Are the Best Features for Automatic Verb Classification
- In Proc. of ACL, 2008. Diana McCarthy. Lexical Acquisition at the SyntaxSemantics Interface: Diathesis Alternations, Subcategorization Frames and Selectional Preferences
"... In this work, we develop and evaluate a wide range of feature spaces for deriving Levinstyle verb classifications (Levin, 1993). We perform the classification experiments using ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this work, we develop and evaluate a wide range of feature spaces for deriving Levinstyle verb classifications (Levin, 1993). We perform the classification experiments using
Adapting a lexicalized-grammar parser to contrasting domains
, 2008
"... Most state-of-the-art wide-coverage parsers are trained on newspaper text and suffer a loss of accuracy in other domains, making parser adaptation a pressing issue. In this paper we demonstrate that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by usi ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Most state-of-the-art wide-coverage parsers are trained on newspaper text and suffer a loss of accuracy in other domains, making parser adaptation a pressing issue. In this paper we demonstrate that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by using manually-annotated training data at the POS and lexical category levels only. This approach achieves parser accuracy comparable to that on newspaper data without the need for annotated parse trees in the new domain. We find that retraining at the lexical category level yields a larger performance increase for questions than for biomedical text and analyze the two datasets to investigate why different domains might behave differently for parser adaptation. 1
Challenges in Mapping of Syntactic Representations for Framework-Independent Parser Evaluation
, 2008
"... We explore some of the issues and challenges created by the incompatibility of diverse representation schemes for syntactic parsing. In particular, we examine the problem of output format conversion for evaluation of parsers that use different formalisms. We discuss recent related efforts, and prese ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We explore some of the issues and challenges created by the incompatibility of diverse representation schemes for syntactic parsing. In particular, we examine the problem of output format conversion for evaluation of parsers that use different formalisms. We discuss recent related efforts, and present an evaluation of different parsers that use representations that vary not only in formalisms, but also in depth of syntactic information. We attempt to compare these parsers in a domain widely used for parser evaluation, the Wall Street Journal section of the Penn Treebank, and in the academic biomedical literature, where the use of parsing technologies is expected to contribute in practical applications, such as information extraction and text mining.
Perceptron training for a wide-coverage lexicalized-grammar parser
- IN PROCEEDINGS OF THE ACL WORKSHOP ON DEEP LINGUISTIC PROCESSING
, 2007
"... This paper investigates perceptron training for a wide-coverage CCG parser and compares the perceptron with a log-linear model. The CCG parser uses a phrase-structure parsing model and dynamic programming in the form of the Viterbi algorithm to find the highest scoring derivation. The difficulty in ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
This paper investigates perceptron training for a wide-coverage CCG parser and compares the perceptron with a log-linear model. The CCG parser uses a phrase-structure parsing model and dynamic programming in the form of the Viterbi algorithm to find the highest scoring derivation. The difficulty in using the perceptron for a phrase-structure parsing model is the need for an efficient decoder. We exploit the lexicalized nature of CCG by using a finite-state supertagger to do much of the parsing work, resulting in a highly efficient decoder. The perceptron performs as well as the log-linear model; it trains in a few hours on a single machine; and it requires only a few hundred MB of RAM for practical training compared to 20 GB for the log-linear model. We also investigate the order in which the training examples are presented to the online perceptron learner, and find that order does not significantly affect the results.
Accurate Conversion of Dependency Parses: Targeting the Stanford Scheme
"... We present a conversion from the dependency scheme employed by the Pro3Gres parser to the Stanford scheme, as a further step towards unification of dependency schemes. An evaluation of the conversion shows that it is highly reliable, resulting in less than one percentage point performance penalty on ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a conversion from the dependency scheme employed by the Pro3Gres parser to the Stanford scheme, as a further step towards unification of dependency schemes. An evaluation of the conversion shows that it is highly reliable, resulting in less than one percentage point performance penalty on the actual parser output. This supports the suitability of the Stanford scheme as a unifying representation and the applicability of our conversion formalism to parser scheme conversions. We further provide an evaluation of the Pro3Gres parser, thus adding it to the growing set of parsers evaluated under comparable conditions using the Stanford scheme. 1
Parsing Noun Phrases in the Penn Treebank
"... Noun phrases (NPs) are a crucial part of natural language, and can have a very complex structure. However, this NP structure is largely ignored by the statistical parsing field, as the most widely used corpus is not annotated with it. This lack of gold-standard data has restricted previous efforts t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Noun phrases (NPs) are a crucial part of natural language, and can have a very complex structure. However, this NP structure is largely ignored by the statistical parsing field, as the most widely used corpus is not annotated with it. This lack of gold-standard data has restricted previous efforts to parse NPs, making it impossible to perform the supervised experiments that have achieved high performance in so many Natural Language Processing (NLP) tasks. We comprehensively solve this problem by manually annotating NP structure for the entire Wall Street Journal section of the Penn Treebank. The inter-annotator agreement scores that we attain dispel the belief that the task is too difficult, and demonstrate that consistent NP annotation is possible. Our gold-standard NP data is now available for use in all parsers. We experiment with this new data, applying the Collins (2003) parsing model, and find that its recovery of NP structure is significantly worse than its overall performance. The parser’s F-score is up to 5.69 % lower than a baseline that uses deterministic rules. Through much experimentation, we determine that this result is primarily caused by a lack of lexical information. To solve this problem we construct a wide-coverage, large-scale NP Bracketing system. With our Penn Treebank data set, which is orders of magnitude larger than those used previously, we build a supervised model that achieves excellent results. Our model performs at 93.8 % F-score on the simple NP task that most previous work has undertaken, and extends to bracket longer, more complex NPs that are rarely dealt with in the literature. We attain 89.14 % F-score on this much more difficult task. Finally, we implement a post-processing module that brackets NPs identified by the Bikel (2004) parser. Our NP Bracketing model includes a wide variety of features that provide the lexical information that was missing during the parser experiments, and as a result, we outperform the parser’s F-score by 9.04%. These experiments demonstrate the utility of the corpus, and show that many NLP applications can now make use of NP structure. 1.

