Results 1 - 10
of
30
Domain adaptation with structural correspondence learning
- In EMNLP
, 2006
"... Discriminative learning methods are widely used in natural language processing. These methods work best when their training and test data are drawn from the same distribution. For many NLP tasks, however, we are confronted with new domains in which labeled data is scarce or non-existent. In such cas ..."
Abstract
-
Cited by 91 (9 self)
- Add to MetaCart
Discriminative learning methods are widely used in natural language processing. These methods work best when their training and test data are drawn from the same distribution. For many NLP tasks, however, we are confronted with new domains in which labeled data is scarce or non-existent. In such cases, we seek to adapt existing models from a resourcerich source domain to a resource-poor target domain. We introduce structural correspondence learning to automatically induce correspondences among features from different domains. We test our technique on part of speech tagging and show performance gains for varying amounts of source and target training data, as well as improvements in target domain parsing accuracy using our improved tagger. 1
Wide-coverage efficient statistical parsing with CCG and log-linear models
- COMPUTATIONAL LINGUISTICS
, 2007
"... This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminativ ..."
Abstract
-
Cited by 87 (20 self)
- Add to MetaCart
This paper describes a number of log-linear parsing models for an automatically extracted lexicalized grammar. The models are "full" parsing models in the sense that probabilities are defined for complete parses, rather than for independent events derived by decomposing the parse tree. Discriminative training is used to estimate the models, which requires incorrect parses for each sentence in the training data as well as the correct parse. The lexicalized grammar formalism used is Combinatory Categorial Grammar (CCG), and the grammar is automatically extracted from CCGbank, a CCG version of the Penn Treebank. The combination of discriminative training and an automatically extracted grammar leads to a significant memory requirement (over 20 GB), which is satisfied using a parallel implementation of the BFGS optimisation algorithm running on a Beowulf cluster. Dynamic programming over a packed chart, in combination with the parallel implementation, allows us to solve one of the largest-scale estimation problems in the statistical parsing literature in under three hours. A key component of the parsing system, for both training and testing, is a Maximum Entropy supertagger which assigns CCG lexical categories to words in a sentence. The supertagger makes the discriminative training feasible, and also leads to a highly efficient parser. Surprisingly,
2006b. Reranking and self-training for parser adaptation
- ACL-COLING
"... Statistical parsers trained and tested on the Penn Wall Street Journal (WSJ) treebank have shown vast improvements over the last 10 years. Much of this improvement, however, is based upon an ever-increasing number of features to be trained on (typically) the WSJ treebank data. This has led to concer ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Statistical parsers trained and tested on the Penn Wall Street Journal (WSJ) treebank have shown vast improvements over the last 10 years. Much of this improvement, however, is based upon an ever-increasing number of features to be trained on (typically) the WSJ treebank data. This has led to concern that such parsers may be too finely tuned to this corpus at the expense of portability to other genres. Such worries have merit. The standard “Charniak parser ” checks in at a labeled precisionrecall f-measure of 89.7 % on the Penn WSJ test set, but only 82.9 % on the test set from the Brown treebank corpus. This paper should allay these fears. In particular, we show that the reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2%. Furthermore, use of the self-training techniques described in (Mc-Closky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data. This is remarkable since training the parser and reranker on labeled Brown data achieves only 88.4%. 1
A Graph Kernel for Protein-Protein Interaction Extraction
"... In this paper, we propose a graph kernel based approach for the automated extraction of protein-protein interactions (PPI) from scientific literature. In contrast to earlier approaches to PPI extraction, the introduced alldependency-paths kernel has the capability to consider full, general dependenc ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
In this paper, we propose a graph kernel based approach for the automated extraction of protein-protein interactions (PPI) from scientific literature. In contrast to earlier approaches to PPI extraction, the introduced alldependency-paths kernel has the capability to consider full, general dependency graphs. We evaluate the proposed method across five publicly available PPI corpora providing the most comprehensive evaluation done for a machine learning based PPI-extraction system. Our method is shown to achieve state-of-theart performance with respect to comparable evaluations, achieving 56.4 F-score and 84.8 AUC on the AImed corpus. Further, we identify several pitfalls that can make evaluations of PPI-extraction systems incomparable, or even invalid. These include incorrect crossvalidation strategies and problems related to comparing F-score results achieved on different evaluation resources. 1
The Stanford typed dependencies representation
"... This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward description of grammatical relations for any user who could benefit from automatic text understanding. For such purposes, we argue that dependency schemes must follow a simple design a ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper examines the Stanford typed dependencies representation, which was designed to provide a straightforward description of grammatical relations for any user who could benefit from automatic text understanding. For such purposes, we argue that dependency schemes must follow a simple design and provide semantically contentful information, as well as offer an automatic procedure to extract the relations. We consider the underlying design principles of the Stanford scheme from this perspective, and compare it to the GR and PARC representations. Finally, we address the question of the suitability of the Stanford scheme for parser evaluation. 1
SelfTraining for Biomedical Parsing
- ACL
, 2008
"... Parser self-training is the technique of taking an existing parser, parsing extra data and then creating a second parser by treating the extra data as further training data. Here we apply this technique to parser adaptation. In particular, we self-train the standard Charniak/Johnson Penn-Treebank pa ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Parser self-training is the technique of taking an existing parser, parsing extra data and then creating a second parser by treating the extra data as further training data. Here we apply this technique to parser adaptation. In particular, we self-train the standard Charniak/Johnson Penn-Treebank parser using unlabeled biomedical abstracts. This achieves an f-score of 84.3 % on a standard test set of biomedical abstracts from the Genia corpus. This is a 20 % error reduction over the best previous result on biomedical data (80.2 % on the same test set). 1
Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser
- In the Proceedings of IWPT 2007
, 2007
"... This paper describes an effective approach to adapting an HPSG parser trained on the Penn Treebank to a biomedical domain. In this approach, we train probabilities of lexical entry assignments to words in a target domain and then incorporate them into the original parser. Experimental results show t ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper describes an effective approach to adapting an HPSG parser trained on the Penn Treebank to a biomedical domain. In this approach, we train probabilities of lexical entry assignments to words in a target domain and then incorporate them into the original parser. Experimental results show that this method can obtain higher parsing accuracy than previous work on domain adaptation for parsing the same data. Moreover, the results show that the combination of the proposed method and the existing method achieves parsing accuracy that is as high as that of an HPSG parser retrained from scratch, but with much lower training cost. We also evaluated our method in the Brown corpus to show the portability of our approach in another domain. 1
On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA
- In Proc. BioNLP 2007
"... Several incompatible syntactic annotation schemes are currently used by parsers and corpora in biomedical information extraction. The recently introduced Stanford dependency scheme has been suggested to be a suitable unifying syntax formalism. In this paper, we present a step towards such unificatio ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Several incompatible syntactic annotation schemes are currently used by parsers and corpora in biomedical information extraction. The recently introduced Stanford dependency scheme has been suggested to be a suitable unifying syntax formalism. In this paper, we present a step towards such unification by creating a conversion from the Link Grammar to the Stanford scheme. Further, we create a version of the BioInfer corpus with syntactic annotation in this scheme. We present an application-oriented evaluation of the transformation and assess the suitability of the scheme and our conversion to the unification of the syntactic annotations of BioInfer and the GENIA Treebank. We find that a highly reliable conversion is both feasible to create and practical, increasing the applicability of both the parser and the corpus to information extraction. 1
Adapting a lexicalized-grammar parser to contrasting domains
, 2008
"... Most state-of-the-art wide-coverage parsers are trained on newspaper text and suffer a loss of accuracy in other domains, making parser adaptation a pressing issue. In this paper we demonstrate that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by usi ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Most state-of-the-art wide-coverage parsers are trained on newspaper text and suffer a loss of accuracy in other domains, making parser adaptation a pressing issue. In this paper we demonstrate that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by using manually-annotated training data at the POS and lexical category levels only. This approach achieves parser accuracy comparable to that on newspaper data without the need for annotated parse trees in the new domain. We find that retraining at the lexical category level yields a larger performance increase for questions than for biomedical text and analyze the two datasets to investigate why different domains might behave differently for parser adaptation. 1
Domain Adaptation of Natural Language Processing Systems
, 2007
"... My first thanks must go to Fernando Pereira. He was a wonderful advisor, and every aspect of this thesis has benefitted from his insight. At times I was a difficult, even unruly graduate student, and Fernando had patience with all my ideas, whether good or bad. What I’ll miss most, though, is the qu ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
My first thanks must go to Fernando Pereira. He was a wonderful advisor, and every aspect of this thesis has benefitted from his insight. At times I was a difficult, even unruly graduate student, and Fernando had patience with all my ideas, whether good or bad. What I’ll miss most, though, is the quick trip to Fernando’s office, coming away with new insights on everything from numerical underflow to the state of the academic community in machine learning and NLP. In addition to Fernando, this thesis was shaped by a great committee. Having Ben Taskar as committee chairman has given me the perfect excuse to interrupt his workday with new, ostensibly-thesis-related machine learning ideas. Mark Liberman and Mitch Marcus brought a much-needed linguistic perspective to a thesis on language, and many of the techniques described are based on work by Tong Zhang, who kindly served as my external committee member. Although he didn’t directly serve on my committee, Shai Ben-David got me started on the theoretical aspects of this work, and chapter 4 grew out of work I co-authored with him. I was also fortunate to have a great academic family. With brothers (and one sister!)

