Results 1 - 10
of
23
Identifying gene and protein mentions in text using conditional random fields
- BMC Bioinformatics
"... Applying information extraction techniques in the biological domain has been a growing research area over the past few years. Numerous large scale copora have been developed [10] or are being developed [4] ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
Applying information extraction techniques in the biological domain has been a growing research area over the past few years. Numerous large scale copora have been developed [10] or are being developed [4]
Dependency parsing and domain adaptation with LR models and parser ensembles
- In Proceedings of the Eleventh Conference on Computational Natural Language Learning
, 2007
"... We present a data-driven variant of the LR algorithm for dependency parsing, and extend it with a best-first search for probabilistic generalized LR dependency parsing. Parser actions are determined by a classifier, based on features that represent the current state of the parser. We apply this pars ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
We present a data-driven variant of the LR algorithm for dependency parsing, and extend it with a best-first search for probabilistic generalized LR dependency parsing. Parser actions are determined by a classifier, based on features that represent the current state of the parser. We apply this parsing framework to both tracks of the CoNLL 2007 shared task, in each case taking advantage of multiple models trained with different learners. In the multilingual track, we train three LR models for each of the ten languages, and combine the analyses obtained with each individual model with a maximum spanning tree voting scheme. In the domain adaptation track, we use two models to parse unlabeled data in the target domain to supplement the labeled out-ofdomain training set, in a scheme similar to one iteration of co-training. 1
Corpus Design For Biomedical Natural Language Processing
- In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases
, 2005
"... This paper classifies six publicly available biomedical corpora according to various corpus design features and characteristics. ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper classifies six publicly available biomedical corpora according to various corpus design features and characteristics.
Annotation of chemical named entities
- In Proceedings of the Annual Meeting of the Association for Computational Linguistics
, 2007
"... We describe the annotation of chemical named entities in scientific text. A set of annotation guidelines defines 5 types of named entities, and provides instructions for the resolution of special cases. A corpus of fulltext chemistry papers was annotated, with an inter-annotator agreement score of 9 ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We describe the annotation of chemical named entities in scientific text. A set of annotation guidelines defines 5 types of named entities, and provides instructions for the resolution of special cases. A corpus of fulltext chemistry papers was annotated, with an inter-annotator agreement score of 93%. An investigation of named entity recognition using LingPipe suggests that scores of 63 % are possible without customisation, and scores of 74 % are possible with the addition of custom tokenisation and the use of dictionaries. 1
Creating Robust Supervised Classifiers via Web-Scale N-gram Data
"... In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that in ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.
Simultaneous Identification of Biomedical Named-Entity and Functional Relation UsingStatistical Parsing Techniques
- In NAACL-HLT 2007 (short
, 2007
"... In this paper we propose a statistical parsing technique that simultaneously identifies biomedical named-entities (NEs) and extracts subcellular localization relations for bacterial proteins from the text in MEDLINE articles. We build a parser that derives both syntactic and domain-dependent semanti ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper we propose a statistical parsing technique that simultaneously identifies biomedical named-entities (NEs) and extracts subcellular localization relations for bacterial proteins from the text in MEDLINE articles. We build a parser that derives both syntactic and domain-dependent semantic information and achieves an F-score of 48.4 % for the relation extraction task. We then propose a semi-supervised approach that incorporates noisy automatically labeled data to improve the F-score of our parser to 83.2%. Our key contributions are: learning from noisy data, and building an annotated corpus that can benefit relation extraction research. 1
The ITI TXM Corpora: Tissue Expressions and Protein-Protein Interactions
"... We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, Developmental- ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We report on two large corpora of semantically annotated full-text biomedical research papers created in order to develop information extraction (IE) tools for the TXM project. Both corpora have been annotated with a range of entities (CellLine, Complex, Developmental-
Ambiguous Part-of-Speech Tagging for Improving Accuracy and Domain Portability of Syntactic Parsers
- Proceedings of the Twentieth International Joint Conference on Artificial Intelligence
, 2007
"... We aim to improve the performance of a syntactic parser that uses a part-of-speech (POS) tagger as a preprocessor. Pipelined parsers consisting of POS taggers and syntactic parsers have several advantages, such as the capability of domain adaptation. However the performance of such systems on raw te ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We aim to improve the performance of a syntactic parser that uses a part-of-speech (POS) tagger as a preprocessor. Pipelined parsers consisting of POS taggers and syntactic parsers have several advantages, such as the capability of domain adaptation. However the performance of such systems on raw texts tends to be disappointing as they are affected by the errors of automatic POS tagging. We attempt to compensate for the decrease in accuracy caused by automatic taggers by allowing the taggers to output multiple answers when the tags cannot be determined reliably enough. We empirically verify the effectiveness of the method using an HPSG parser trained on the Penn Treebank. Our results show that ambiguous POS tagging improves parsing if outputs of taggers are weighted by probability values, and the results support previous studies with similar intentions. We also examine the effectiveness of our method for adapting the parser to the GENIA corpus and show that the use of ambiguous POS taggers can help development of portable parsers while keeping accuracy high. 1
Building Domain-Specific Taggers without Annotated (Domain
- Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
, 2007
"... Part of speech tagging is a fundamental component in many NLP systems. When taggers developed in one domain are used in another domain, the performance can degrade considerably. We present a method for developing taggers for new domains without requiring POS annotated text in the new domain. Our met ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Part of speech tagging is a fundamental component in many NLP systems. When taggers developed in one domain are used in another domain, the performance can degrade considerably. We present a method for developing taggers for new domains without requiring POS annotated text in the new domain. Our method involves using raw domain text and identifying related words to form a domain specific lexicon. This lexicon provides the initial lexical probabilities for EM training of an HMM model. We evaluate the method by applying it in the Biology domain and show that we achieve results that are comparable with some taggers developed for this domain.
Structural correspondence learning for dependency parsing
- In Proc
, 2007
"... Following (Blitzer et al., 2006), we present an application of structural correspondence learning to non-projective dependency parsing (McDonald et al., 2005). To induce the correspondences among dependency edges from different domains, we looked at every two tokens in a sentence and examined whethe ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Following (Blitzer et al., 2006), we present an application of structural correspondence learning to non-projective dependency parsing (McDonald et al., 2005). To induce the correspondences among dependency edges from different domains, we looked at every two tokens in a sentence and examined whether or not there is a preposition, a determiner or a helping verb between them. Three binary linear classifiers were trained to predict the existence of a preposition, etc, on unlabeled data and we used singular value decomposition to induce new features. During the training, the parser was trained with these additional features in addition to these described in (McDonald et al., 2005). We discriminatively trained our parser in an on-line fashion using a variant of the voted perceptron (Collins, 2002;

