Results 1 -
7 of
7
Taking into account the differences between actively and passively acquired data: the case of active learning with support vector machines for imbalanced datasets
- In NAACL
, 2009
"... Actively sampled data can have very different characteristics than passively sampled data. Therefore, it’s promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost- ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Actively sampled data can have very different characteristics than passively sampled data. Therefore, it’s promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost-weighted SVMs for imbalanced data, a situation that arises for many HLT tasks. The key idea behind the proposed InitPA method for addressing imbalance is to base cost models during AL on an estimate of overall corpus imbalance computed via a small unbiased sample rather than the imbalance in the labeled training data, which is the leading method used during PL. 1
A method for stopping Active Learning based on stabilizing predictions and the need for user-adjustable stopping
- In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009
, 2009
"... A survey of existing methods for stopping active learning (AL) reveals the needs for methods that are: more widely applicable; more aggressive in saving annotations; and more stable across changing datasets. A new method for stopping AL based on stabilizing predictions is presented that addresses th ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
A survey of existing methods for stopping active learning (AL) reveals the needs for methods that are: more widely applicable; more aggressive in saving annotations; and more stable across changing datasets. A new method for stopping AL based on stabilizing predictions is presented that addresses these needs. Furthermore, stopping methods are required to handle a broad range of different annotation/performance tradeoff valuations. Despite this, the existing body of work is dominated by conservative methods with little (if any) attention paid to providing users with control over the behavior of stopping methods. The proposed method is shown to fill a gap in the level of aggressiveness available for stopping AL and supports providing users with control over stopping behavior.
Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance ABSTRACT
"... This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifier ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers—so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them “guide ” the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.
Guided Feature Labeling for Budget-Sensitive Learning Under Extreme Class Imbalance
"... Extreme class skew is a hurdle in many machine learning tasks. In such skewed settings, traditional methods for procuring labeled examples, including random sampling and active learning, are often ineffective— they struggle to find representative minority examples. The framework of Dual Supervision, ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Extreme class skew is a hurdle in many machine learning tasks. In such skewed settings, traditional methods for procuring labeled examples, including random sampling and active learning, are often ineffective— they struggle to find representative minority examples. The framework of Dual Supervision, which incorporates feature-based background information into traditional supervised learning, provides one avenue to combat this problem. However, active learning for feature information (feature labeling), like active learning, is often not resilient to extreme class skew. In this work, we present an alternative to active feature labeling, Guided Feature Labeling. In this paradigm, human domain experts are tasked with finding classindicative features given a description of a class. This work explores different data acquisition costs, and demonstrates that under certain conditions, Guided Feature Labeling does indeed offer high performance models at a far lower budget than complementary active labeling approaches. 1.
Inactive Learning? Difficulties Employing Active Learning in Practice
"... Despite the tremendous level of adoption of machine learning techniques in real-world settings, and the large volume of research on active learning, active learning techniques have been slow to gain substantial traction in practical applications. This reluctance of adoption is contrary to active lea ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Despite the tremendous level of adoption of machine learning techniques in real-world settings, and the large volume of research on active learning, active learning techniques have been slow to gain substantial traction in practical applications. This reluctance of adoption is contrary to active learning’s promise of reduced model-development costs and increased performance on a model-development budget. This essay presents several important and under-discussed challenges to using active learning well in practice. We hope this paper can serve as a call to arms for researchers in active learning—an encouragement to focus even more attention on how practitioners might actually use active learning. 1.
Learning SVM Ranking Function from User Feedback Using Document Metadata and Active Learning in the Biomedical Domain
"... Abstract. Information overload is a well-known problem facing biomedical professionals. MEDLINE, the biomedical bibliographic database, adds hundreds of articles daily to the millions already in its collection. This overload is exacerbated by the lack of relevance-based ranking for search results, a ..."
Abstract
- Add to MetaCart
Abstract. Information overload is a well-known problem facing biomedical professionals. MEDLINE, the biomedical bibliographic database, adds hundreds of articles daily to the millions already in its collection. This overload is exacerbated by the lack of relevance-based ranking for search results, as well as disparate levels of search skill and domain experience of professionals using systems designed to search MEDLINE. We propose to address these problems through learning ranking functions from user relevance feedback. We hypothesize that learning from feedback will give performance similar to learning from the entire data set. We hypothesize that, by employing active learning techniques, we can achieve this performance using feedback on a fraction of the total number of results. We further hypothesize that learning from metadata, specifically the Medical Subject Heading (MeSH) terms associated with MEDLINE citations, will result in better performance than learning from texutal features. We test our hypotheses through simulation, using the OHSUMED data set. Our results show that ranking functions learned from user feedback approach the performance of ranking functions learned from the entire data set using one half of the total data available. Our results also show that learning from MeSH features greatly outperforms learning from textual features. 1

