Results 11 - 20
of
94
Evaluation and Extension of Maximum Entropy Models with Inequality Constraints
, 2003
"... A maximum entropy (ME) model is usually estimated so that it conforms to equality constraints on feature expectations. ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
A maximum entropy (ME) model is usually estimated so that it conforms to equality constraints on feature expectations.
A Robust Risk Minimization based Named Entity Recognition System
- IN PROCEEDINGS OF CONLL-2003
, 2003
"... This paper describes a robust linear classification system for Named Entity Recognition. A similar system has been applied to the CoNLL text chunking shared task with state of the art performance. By using different linguistic features, we can easily adapt this system to other token-based ling ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper describes a robust linear classification system for Named Entity Recognition. A similar system has been applied to the CoNLL text chunking shared task with state of the art performance. By using different linguistic features, we can easily adapt this system to other token-based linguistic tagging problems. The main focus of the current paper is to investigate the impact of various local linguistic features for named entity recognition on the CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) shared task data. We show that the system performance can be enhanced significantly with some relative simple token-based features that are available for many languages. Although more sophisticated linguistic features will also be helpful, they provide much less improvement than might be expected.
Exploring the boundaries: Gene and protein identification in biomedical text
- In Proceedings of the BioCreative Workshop
, 2004
"... Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features fo ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
Background: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. Methods: We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts. Results: This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the “open ” evaluation and a precision of 0.78 and recall of 0.85 in the “closed ” evaluation. Conclusions: Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches. Background The explosion of information in the biomedical domain and particularly in genetics has highlighted the need for automated text information extraction techniques. MEDLINE, the primary research database serving the biomedical community, currently contains over 14 million abstracts, with 60,000 new abstracts appearing each month. There is also an impressive number of molecular biological databases covering an
Exploiting domain structure for named entity recognition
- In Human Language Technology Conference
, 2006
"... Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present seve ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present several strategies for exploiting the domain structure in the training data to learn a more robust named entity recognizer that can perform well on a new domain. First, we propose a simple yet effective way to automatically rank features based on their generalizabilities across domains. We then train a classifier with strong emphasis on the most generalizable features. This emphasis is imposed by putting a rank-based prior on a logistic regression model. We further propose a domain-aware cross validation strategy to help choose an appropriate parameter for the rank-based prior. We evaluated the proposed method with a task of recognizing named entities (genes) in biology text involving three species. The experiment results show that the new domainaware approach outperforms a state-ofthe-art baseline method in adapting to new domains, especially when there is a great difference between the new domain and the training domain.
Improving Name Tagging by Reference Resolution and Relation Detection
- Proc. ACL2005
, 2005
"... Information extraction systems incorporate multiple stages of linguistic analysis. Although errors are typically compounded from stage to stage, it is possible to reduce the errors in one stage by harnessing the results of the other stages. We demonstrate this by using the results of coreference ana ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
Information extraction systems incorporate multiple stages of linguistic analysis. Although errors are typically compounded from stage to stage, it is possible to reduce the errors in one stage by harnessing the results of the other stages. We demonstrate this by using the results of coreference analysis and relation extraction to reduce the errors produced by a Chinese name tagger. We use an N-best approach to generate multiple hypotheses and have them re-ranked by subsequent stages of processing. We obtained thereby a reduction of 24 % in spurious and incorrect name tags, and a reduction of 14 % in missed tags. 1
Markov models for language-independent named entity recognition
- in Proceedings of the Conference on Natural Language Learning
, 2002
"... This report describes the application of Markov models to the problem of language-independent named entity recognition for the CoNLL-2002 shared task (Tjong Kim Sang, 2002). ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This report describes the application of Markov models to the problem of language-independent named entity recognition for the CoNLL-2002 shared task (Tjong Kim Sang, 2002).
Semi-supervised learning for natural language
- MASTER’S THESIS, MIT
, 2005
"... Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown p ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available “for free ” in large quantities. Unlabeled data has shown promise in improving the performance of a number of tasks, e.g. word sense disambiguation, information extraction, and natural language parsing. In this thesis, we focus on two segmentation tasks, named-entity recognition and Chinese word segmentation. The goal of named-entity recognition is to detect and classify names of people, organizations, and locations in a sentence. The goal of Chinese word segmentation is to find the word boundaries in a sentence that has been written as a string of characters without spaces. Our approach is as follows: In a preprocessing step, we use raw text to cluster words and calculate mutual information statistics. The output of this step is then used as features in a supervised model, specifically a global linear model trained using
Enhancing HMM-based biomedical named entity recognition by studying special phenomena
- J. Biomed. Inform
, 2004
"... The purpose of this research is to enhance an HMM-based named entity recognizer in the biomedical domain. First, we analyze the characteristics of biomedical named entities. Then, we propose a rich set of features, including orthographic, morphological, part-of-speech and semantic trigger features. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The purpose of this research is to enhance an HMM-based named entity recognizer in the biomedical domain. First, we analyze the characteristics of biomedical named entities. Then, we propose a rich set of features, including orthographic, morphological, part-of-speech and semantic trigger features. All these features are integrated via a Hidden Markov Model with back-off modeling. Furthermore, we propose a method for biomedical abbreviation recognition and two methods for cascaded named entity recognition. Evaluation on the GENIA V3.02 and V1.1 shows our system achieves 66.5 and 62.5 F-measure respectively and outperforms the previous best published system by 8.1 F-measure on the same experimental setting. The major contribution of this paper lies in its rich feature set specially designed for biomedical domain and the effective methods for abbreviation and cascaded named entity recognition. To our best knowledge, our system is the first one that copes with the cascaded phenomena.
Applying Coreference to Improve Name Recognition
- Proc. ACL 2004 Workshop on Reference Resolution and Its Applications
, 2004
"... We present a novel method of applying the results of coreference resolution to improve Name Recognition for Chinese. We consider first some methods for gauging the confidence of individual tags assigned by a statistical name tagger. For names with low confidence, we show how these names can be filte ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
We present a novel method of applying the results of coreference resolution to improve Name Recognition for Chinese. We consider first some methods for gauging the confidence of individual tags assigned by a statistical name tagger. For names with low confidence, we show how these names can be filtered using coreference features to improve accuracy. In addition, we present rules which use coreference information to correct some name tagging errors. Finally, we show how these gains can be magnified by clustering documents and using cross-document coreference in these clusters. These combined methods yield an absolute improvement of about 3.1 % in tagger F score. 1

