Results 1 - 10
of
16
Parsing noun phrase structure with CCG
- In Proc. ACL-08:HLT
, 2008
"... Statistical parsing of noun phrase (NP) structure has been hampered by a lack of goldstandard data. This is a significant problem for CCGbank, where binary branching NP derivations are often incorrect, a result of the automatic conversion from the Penn Treebank. We correct these errors in CCGbank us ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Statistical parsing of noun phrase (NP) structure has been hampered by a lack of goldstandard data. This is a significant problem for CCGbank, where binary branching NP derivations are often incorrect, a result of the automatic conversion from the Penn Treebank. We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus. We also implement novel NER features that generalise the lexical information needed to parse NPs and provide important semantic information. Finally, evaluating against DepBank demonstrates the effectiveness of our modified corpus and novel features, with an increase in parser performance of 1.51%. 1
Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
"... We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-theart by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3 % accuracy out-of-domain, against the Brown corpus — nearly 10 % higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP. 1
Creating Robust Supervised Classifiers via Web-Scale N-gram Data
"... In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that in ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.
LTAG-spinal and the Treebank a new resource for incremental, dependency and semantic parsing
, 2007
"... Abstract. We introduce LTAG-spinal, a novel variant of traditional Lexicalized Tree Adjoining Grammar (LTAG) with desirable linguistic, computational and statistical properties. Unlike in traditional LTAG, subcategorization frames and the argument-adjunct distinction are left underspecified in LTAG- ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. We introduce LTAG-spinal, a novel variant of traditional Lexicalized Tree Adjoining Grammar (LTAG) with desirable linguistic, computational and statistical properties. Unlike in traditional LTAG, subcategorization frames and the argument-adjunct distinction are left underspecified in LTAG-spinal. LTAG-spinal with adjunction constraints is weakly equivalent to LTAG. The LTAG-spinal formalism is used to extract an LTAG-spinal Treebank from the Penn Treebank with Propbank annotation. Based on Propbank annotation, predicate coordination and LTAG adjunction structures are successfully extracted. The LTAG-spinal Treebank makes explicit semantic relations that are implicit or absent from the original PTB. LTAG-spinal provides a very desirable resource for statistical LTAG parsing, incremental parsing, dependency parsing, and semantic parsing. This treebank has been successfully used to
Quadratic-Time Dependency Parsing for Machine Translation
"... Efficiency is a prime concern in syntactic MT decoding, yet significant developments in statistical parsing with respect to asymptotic efficiency haven’t yet been explored in MT. Recently, McDonald et al. (2005b) formalized dependency parsing as a maximum spanning tree (MST) problem, which can be so ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Efficiency is a prime concern in syntactic MT decoding, yet significant developments in statistical parsing with respect to asymptotic efficiency haven’t yet been explored in MT. Recently, McDonald et al. (2005b) formalized dependency parsing as a maximum spanning tree (MST) problem, which can be solved in quadratic time relative to the length of the sentence. They show that MST parsing is almost as accurate as cubic-time dependency parsing in the case of English, and that it is more accurate with free word order languages. This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores. Our results show that augmenting a state-ofthe-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. 1
Web-Scale Features for Full-Scale Parsing
"... Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address t ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical affinities as well as paraphrase-based cues to syntactic structure. We then integrate our features into full-scale dependency and constituent parsers. We show relative error reductions of 7.0 % over the second-order dependency parser of McDonald and Pereira (2006), 9.2 % over the constituent parser of Petrov et al. (2006), and 3.4 % over a non-local constituent reranker. 1
An English Dependency Treebank à la Tesnière
"... During the last decade, the Computational Linguistics community has shown an increased interest in Dependency Treebanks. Several groups have developed new annotated corpora using dependency representation, while other people have proposed several automatic conversion algorithms to transform availabl ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
During the last decade, the Computational Linguistics community has shown an increased interest in Dependency Treebanks. Several groups have developed new annotated corpora using dependency representation, while other people have proposed several automatic conversion algorithms to transform available Phrase Structure (PS) treebanks into Dependency Structure (DS) notation. Such projects typically refer to Tesnière as the father of dependency syntax, but little attempt has been made to explain how the chosen representation relates to the original work. A careful comparison reveals substantial differences: modern DS annotations discard some relevant features characterizing Tesnière’s model. This paper is presenting our attempt to go back to the roots of dependency theory, and show how it is possible to transform a PS English treebank to a DS notation that is closer to the one proposed by Tesnière, which we will refer to as TDS. We will show how this representation can incorporate all main advantages of modern DS, while avoiding well known problems concerning the choice of heads, and better representing common linguistic phenomena such as coordination. 1
Ad Hoc Treebank Structures
- In Proceedings of ACL-08
, 2008
"... We outline the problem of ad hoc rules in treebanks, rules used for specific constructions in one data set and unlikely to be used again. These include ungeneralizable rules, erroneous rules, rules for ungrammatical text, and rules which are not consistent with the rest of the annotation scheme. Bas ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We outline the problem of ad hoc rules in treebanks, rules used for specific constructions in one data set and unlikely to be used again. These include ungeneralizable rules, erroneous rules, rules for ungrammatical text, and rules which are not consistent with the rest of the annotation scheme. Based on a simple notion of rule equivalence and on the idea of finding rules unlike any others, we develop two methods for detecting ad hoc rules in flat treebanks and show they are successful in detecting such rules. This is done by examining evidence across the grammar and without making any reference to context. 1
Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
"... Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained u ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96 % accuracy on Treebank data and makes 20 % fewer errors than a supervised system trained with Treebank annotations. 1
Parsing Internal Noun Phrase Structure with Collins’ Models
, 2007
"... Collins’ widely-used parsing models treat noun phrases (NPs) in a different manner to other constituents. We investigate these differences, using the recently released internal NP bracketing data (Vadas and Curran, 2007a). Altering the structure of the Treebank, as this data does, has a number of co ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Collins’ widely-used parsing models treat noun phrases (NPs) in a different manner to other constituents. We investigate these differences, using the recently released internal NP bracketing data (Vadas and Curran, 2007a). Altering the structure of the Treebank, as this data does, has a number of consequences, as parsers built using Collins’ models assume that their training and test data will have structure similar to the Penn Treebank’s. Our results demonstrate that it is difficult for Collins’ models to adapt to this new NP structure, and that parsers using these models make mistakes as a result. This emphasises how important treebank structure itself is, and the large amount of influence it can have.

