Results 1 - 10
of
59
Exploiting domain structure for named entity recognition
- In Human Language Technology Conference
, 2006
"... Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present seve ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present several strategies for exploiting the domain structure in the training data to learn a more robust named entity recognizer that can perform well on a new domain. First, we propose a simple yet effective way to automatically rank features based on their generalizabilities across domains. We then train a classifier with strong emphasis on the most generalizable features. This emphasis is imposed by putting a rank-based prior on a logistic regression model. We further propose a domain-aware cross validation strategy to help choose an appropriate parameter for the rank-based prior. We evaluated the proposed method with a task of recognizing named entities (genes) in biology text involving three species. The experiment results show that the new domainaware approach outperforms a state-ofthe-art baseline method in adapting to new domains, especially when there is a great difference between the new domain and the training domain.
An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data
- In Proc. of EMNLP/CoNLL07
, 2007
"... We consider the impact Active Learning (AL) has on effective and efficient text corpus annotation, and report on reduction rates for annotation efforts ranging up until 72%. We also address the issue whether a corpus annotated by means of AL – using a particular classifier and a particular feature s ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
(Show Context)
We consider the impact Active Learning (AL) has on effective and efficient text corpus annotation, and report on reduction rates for annotation efforts ranging up until 72%. We also address the issue whether a corpus annotated by means of AL – using a particular classifier and a particular feature set – can be re-used to train classifiers different from the ones employed by AL, supplying alternative feature sets as well. We, finally, report on our experience with the AL paradigm under real-world conditions, i.e., the annotation of large-scale document corpora for the life sciences. 1
Multi-task active learning for linguistic annotations
- In ACL
, 2008
"... We extend the classical single-task active learning (AL) approach. In the multi-task active learning (MTAL) paradigm, we select examples for several annotation tasks rather than for a single one as usually done in the context of AL. We introduce two MTAL metaprotocols, alternating selection and rank ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
We extend the classical single-task active learning (AL) approach. In the multi-task active learning (MTAL) paradigm, we select examples for several annotation tasks rather than for a single one as usually done in the context of AL. We introduce two MTAL metaprotocols, alternating selection and rank combination, and propose a method to implement them in practice. We experiment with a twotask annotation scenario that includes named entity and syntactic parse tree annotations on three different corpora. MTAL outperforms random selection and a stronger baseline, onesided example selection, in which one task is pursued using AL and the selected examples are provided also to the other task. 1
Stopping criteria for active learning of named entity recognition
, 2008
"... Active learning is a proven method for reducing the cost of creating the training sets that are necessary for statistical NLP. However, there has been little work on stopping criteria for active learning. An operational stopping criterion is necessary to be able to use active learning in NLP applica ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
Active learning is a proven method for reducing the cost of creating the training sets that are necessary for statistical NLP. However, there has been little work on stopping criteria for active learning. An operational stopping criterion is necessary to be able to use active learning in NLP applications. We investigate three different stopping criteria for active learning of named entity recognition (NER) and show that one of them, gradient-based stopping, (i) reliably stops active learning, (ii) achieves nearoptimal NER performance, (iii) and needs only about 20 % as much training data as exhaustive labeling. 1
Domain adaptive bootstrapping for named entity recognition, ACL
, 2009
"... Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled training data is scarce but unlabeled data is abundan ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled training data is scarce but unlabeled data is abundant. In this paper, we consider the problem of domain adaptation: the situation where training data may not be scarce, but belongs to a different domain from the target application domain. As the distribution of unlabeled data is different from the training data, standard bootstrapping often has difficulty selecting informative data to add to the training set. We propose an effective domain adaptive bootstrapping algorithm that selects unlabeled target domain data that are informative about the target domain and easy to automatically label correctly. We call these instances bridges, as they are used to bridge the source domain to the target domain. We show that the method outperforms supervised, transductive and bootstrapping algorithms on the named entity recognition task. 1
A Web Survey on the Use of Active Learning to Support Annotation of Text Data
"... As supervised machine learning methods for addressing tasks in natural language processing (NLP) prove increasingly viable, the focus of attention is naturally shifted towards the creation of training data. The manual annotation of corpora is a tedious and time consuming process. To obtain high-qual ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
As supervised machine learning methods for addressing tasks in natural language processing (NLP) prove increasingly viable, the focus of attention is naturally shifted towards the creation of training data. The manual annotation of corpora is a tedious and time consuming process. To obtain high-quality annotated data constitutes a bottleneck in machine learning for NLP today. Active learning is one way of easing the burden of annotation. This paper presents a first probe into the NLP research community concerning the nature of the annotation projects undertaken in general, and the use of active learning as annotation support in particular. 1
Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
- In Proceedings of BioNLP in HLT-NAACL
, 2006
"... We demonstrate that bootstrapping a gene name recognizer for FlyBase curation from automatically annotated noisy text is more effective than fully supervised training of the recognizer on more general manually annotated biomedical text. We present a new test set for this task based on an annotation ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
(Show Context)
We demonstrate that bootstrapping a gene name recognizer for FlyBase curation from automatically annotated noisy text is more effective than fully supervised training of the recognizer on more general manually annotated biomedical text. We present a new test set for this task based on an annotation scheme which distinguishes gene names from gene mentions, enabling a more consistent annotation. Evaluating our recognizer using this test set indicates that performance on unseen genes is its main weakness. We evaluate extensions to the technique used to generate training data designed to ameliorate this problem. 1
A Bayesian Network Model for Automatic and Interactive Image Segmentation
- IEEE Transaction on Image Processing, VOL
, 2011
"... Abstract—We propose a new Bayesian network (BN) model for both automatic and interactive image segmentation. A multilayer BN is constructed from an oversegmentation to model the statistical dependencies among superpixel regions, edge segments, vertices, and their measurements. The BN also incorporat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Abstract—We propose a new Bayesian network (BN) model for both automatic and interactive image segmentation. A multilayer BN is constructed from an oversegmentation to model the statistical dependencies among superpixel regions, edge segments, vertices, and their measurements. The BN also incorporates various local constraints to further restrain the relationships among these image entities. Given the BN model and various image measurements, belief propagation is performed to update the probability of each node. Image segmentation is generated by the most probable explanation inference of the true states of both region and edge nodes from the updated BN. Besides the automatic image segmentation, the proposed model can also be used for interactive image segmentation. While existing interactive segmentation (IS) approaches often passively depend on the user to provide exact intervention, we propose a new active input selection approach to provide suggestions for the user’s intervention. Such intervention can be conveniently incorporated into the BN model to perform actively IS. We evaluate the proposed model on both the Weizmann dataset and VOC2006 cow images. The results demonstrate that the BN model can be used for automatic segmentation, and more importantly, for actively IS. The experiments also show that the IS with active input selection can improve both the overall segmentation accuracy and efficiency over the IS with passive intervention. Index Terms—Active labeling, Bayesian network (BN), image segmentation, interactive image segmentation. I.
An intrinsic stopping criterion for committee-based Active Learning
- In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009
, 2009
"... As supervised machine learning methods are increasingly used in language technology, the need for high-quality annotated language data becomes imminent. Active learning (AL) is a means to alleviate the burden of annotation. This paper addresses the problem of knowing when to stop the AL process with ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
As supervised machine learning methods are increasingly used in language technology, the need for high-quality annotated language data becomes imminent. Active learning (AL) is a means to alleviate the burden of annotation. This paper addresses the problem of knowing when to stop the AL process without having the human annotator make an explicit decision on the matter. We propose and evaluate an intrinsic criterion for committee-based AL of named entity recognizers. 1
Discriminative Sample Selection for Statistical Machine Translation ∗
"... Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a giv ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demonstrate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy. 1