Results 1 - 10
of
38
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers
"... This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of smal ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
Ensemble-based Active Learning for Parse Selection
"... Supervised estimation methods are widely seen as being superior to semi and fully unsupervised methods. However, supervised methods crucially rely upon training sets that need to be manually annotated. This can be very expensive, especially when skilled annotators are required. Active learning (AL) ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Supervised estimation methods are widely seen as being superior to semi and fully unsupervised methods. However, supervised methods crucially rely upon training sets that need to be manually annotated. This can be very expensive, especially when skilled annotators are required. Active learning (AL) promises to help reduce this annotation cost. Within the complex domain of HPSG parse selection, we show that ideas from ensemble learning can help further reduce the cost of annotation. Our main results show that at times, an ensemble model trained with randomly sampled examples can outperform a single model trained using AL. However, converting the single-model AL method into an ensemble-based AL method shows that even this much stronger baseline model can be improved upon. Our best results show a ¢¤£¦ ¥ reduction in annotation cost compared with single-model random sampling.
Active learning for anomaly and rare-category detection
- In Advances in Neural Information Processing Systems 18
, 2004
"... We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of ano ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category ” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1
Active coevolutionary learning of deterministic finite automata
- Journal of Machine Learning Research
, 2005
"... This paper describes an active learning approach to the problem of grammatical inference, specifically the inference of deterministic finite automata (DFAs). We refer to the algorithm as the estimation-exploration algorithm (EEA). This approach differs from previous passive and active learning appro ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
This paper describes an active learning approach to the problem of grammatical inference, specifically the inference of deterministic finite automata (DFAs). We refer to the algorithm as the estimation-exploration algorithm (EEA). This approach differs from previous passive and active learning approaches to grammatical inference in that training data is actively proposed by the algorithm, rather than passively receiving training data from some external teacher. Here we show that this algorithm outperforms one version of the most powerful set of algorithms for grammatical inference, evidence driven state merging (EDSM), on randomly-generated DFAs. The performance increase is due to the fact that the EDSM algorithm only works well for DFAs with specific balances (percentage of positive labelings), while the EEA is more consistent over a wider range of balances. Based on this finding we propose a more general method for generating DFAs to be used in the development of future grammatical inference algorithms.
Active Learning to Recognize Multiple Types of Plankton
- Journal of Machine Learning Research
, 2004
"... Active learning has been applied with support vector machines to reduce the data labeling effort in pattern recognition domains. However, most of those applications only deal with two class problems. In this paper, we extend the active learning approach to multiple class support vector machines. The ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Active learning has been applied with support vector machines to reduce the data labeling effort in pattern recognition domains. However, most of those applications only deal with two class problems. In this paper, we extend the active learning approach to multiple class support vector machines. The experimental results from a plankton recognition system indicate that our approach often requires significantly less labeled images to maintain the same accuracy level as random sampling. 1.
An information theoretic tradeoff between complexity and accuracy
- In Proceedings of the COLT
, 2003
"... ..."
Dual-strategy active learning
- In Proc. of the 18th European Conf. on Machine Learning
, 2007
"... Abstract. Active Learning methods rely on static strategies for sampling unlabeled point(s). These strategies range from uncertainty sampling and density estimation to multi-factor methods with learn-onceuse-always model parameters. This paper proposes a dynamic approach, called DUAL, where the stra ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Abstract. Active Learning methods rely on static strategies for sampling unlabeled point(s). These strategies range from uncertainty sampling and density estimation to multi-factor methods with learn-onceuse-always model parameters. This paper proposes a dynamic approach, called DUAL, where the strategy selection parameters are adaptively updated based on estimated future residual error reduction after each actively sampled point. The objective of dual is to outperform static strategies over a large operating range: from very few to very many labeled points. Empirical results over six datasets demonstrate that DUAL outperforms several state-of-the-art methods on most datasets. 1
Active learning with feedback on both features and instances
- Journal of Machine Learning Research
, 2006
"... We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorizati ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorization tasks indicate that there is significant potential in improving classifier performance by feature re-weighting, beyond that achieved via membership queries alone (traditional active learning) if we have access to an oracle that can point to the important (most predictive) features. Our experiments on human subjects indicate that human feedback on feature relevance can identify a sufficient proportion of the most relevant features (over 50 % in our experiments). We find that on average, labeling a feature takes much less time than labeling a document. We devise an algorithm that interleaves labeling features and documents which significantly accelerates standard active learning in our simulation experiments. Feature feedback can complement traditional active learning in applications such as news filtering, e-mail classification, and personalization, where the human teacher can have significant knowledge on the relevance of features.
Using the Experimental Method to Produce Reliable Self-Organised Systems
- Engineering Self Organising Sytems: Methodologies and Applications. Volume 3464 of Lecture Notes in Artificial Intelligence
, 2004
"... cfpm.org/~bruce Abstract. The ‘engineering ’ and ‘adaptive ’ approaches to system production are distinguished. It is argued that producing reliable self-organised software systems (SOSS) will necessarily involve considerable use of adaptive approaches. A class of apparently simple multi-agent syste ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
cfpm.org/~bruce Abstract. The ‘engineering ’ and ‘adaptive ’ approaches to system production are distinguished. It is argued that producing reliable self-organised software systems (SOSS) will necessarily involve considerable use of adaptive approaches. A class of apparently simple multi-agent systems is defined, which however has all the power of a Turing machine, and hence is beyond formal specification and design methods (in general). It is then shown that such systems can be evolved to perform simple tasks. This highlights how we may be faced with systems whose workings we have not wholly designed and hence that we will have to treat them more as natural science treat the systems it encounters, namely using the classic experimental method. An example is briefly discussed. A system for annotating such systems with hypotheses, and conditions of application is proposed that would be a natural extension of current methods of open source code development. 1.
A stopping criterion for active learning
, 2007
"... Active learning (AL) is a framework that attempts to reduce the cost of annotating training material for statistical learning methods. While a lot of papers have been presented on applying AL to natural language processing tasks reporting impressive savings, little work has been done on defining a s ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Active learning (AL) is a framework that attempts to reduce the cost of annotating training material for statistical learning methods. While a lot of papers have been presented on applying AL to natural language processing tasks reporting impressive savings, little work has been done on defining a stopping criterion. In this work, we present a stopping criterion for active learning based on the way instances are selected during uncertainty-based sampling and verify its applicability in a variety of settings. The statistical learning models used in our study are support vector machines (SVMs), maximum entropy models and Bayesian logistic regression and the tasks performed are text classification, named entity recognition and shallow parsing. In addition, we present a method for multiclass mutually exclusive SVM active learning.

