Results 1 - 10
of
57
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers
"... This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of smal ..."
Abstract
-
Cited by 65 (5 self)
- Add to MetaCart
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
Editorial: Special Issue on Learning from Imbalanced Data Sets
- SIGKDD Explorations
, 2004
"... The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research. ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research.
Budgeted learning of naive-bayes classifiers
- In Proceedings of 19th Conference on Uncertainty in Artificial Intelligence (UAI-2003
, 2003
"... There is almost always a cost associated with acquiring training data. We consider the situation where the learner, with a fixed budget, may ‘purchase ’ data during training. In particular, we examine the case where observing the value of a feature of a training example has an associated cost, and t ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
There is almost always a cost associated with acquiring training data. We consider the situation where the learner, with a fixed budget, may ‘purchase ’ data during training. In particular, we examine the case where observing the value of a feature of a training example has an associated cost, and the total cost of all feature values acquired during training must remain less than this fixed budget. This paper compares methods for sequentially choosing which feature value to purchase next, given the budget and user’s current knowledge of Naïve Bayes model parameters. Whereas active learning has traditionally focused on myopic (greedy) approaches and uniform/round-robin policies for query selection, this paper shows that such methods are often suboptimal and presents a tractable method for incorporating knowledge of the budget in the information acquisition process. 1
Decision trees with minimal costs
- In Proceedings of the Twenty-First International Conference on Machine Learning
, 2004
"... We propose a simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs. More specifically, we first put forward an original and simple splitting criterion for attribute selection in tree building. Our treebuilding al ..."
Abstract
-
Cited by 26 (13 self)
- Add to MetaCart
We propose a simple, novel and yet effective method for building and testing decision trees that minimizes the sum of the misclassification and test costs. More specifically, we first put forward an original and simple splitting criterion for attribute selection in tree building. Our treebuilding algorithm has many desirable properties for a cost-sensitive learning system that must account for both types of costs. Then, assuming that the test cases may have a large number of missing values, we design several intelligent test strategies that can suggest ways of obtaining the missing values at a cost in order to minimize the total cost. We experimentally compare these strategies and C4.5, and demonstrate that our new algorithms significantly outperform C4.5 and its variations. In addition, our algorithm’s complexity is similar to that of C4.5, and is much lower than that of previous work. Our work is useful for many diagnostic tasks which must factor in the misclassification and test costs for obtaining missing information. 1.
Discovering significant patterns
, 2007
"... Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patter ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Active Feature-Value Acquisition for Classifier Induction
, 2004
"... Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For building accurate predictive models, acquiring complete information for all instances is often prohi ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For building accurate predictive models, acquiring complete information for all instances is often prohibitively expensive or unnecessary. Randomly selecting instances for feature acquisition allows a representative sampling, but does not incorporate estimations of the value of acquisition. Active feature acquisition aims at reducing the cost of achieving a desired model accuracy by identifying instances for which complete information is most informative to obtain. We present approaches in which instances are selected for feature acquisition based on the current model's ability to predict accurately and the model's confidence in its prediction. Experimental results on several real-world data sets demonstrate that these approaches can induce accurate models using substantially fewer feature acquisitions, and suggest promising directions for improvements.
Economical Active Feature-value Acquisition through Expected Utility Estimation
- In Proceedings of the KDD05 Workshop on Utility-Based Data Mining
, 2005
"... In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively expensive or unnecessary, while acquiring a random subset of feature values may not be most effective. T ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In many classification tasks training data have missing feature values that can be acquired at a cost. For building accurate predictive models, acquiring all missing values is often prohibitively expensive or unnecessary, while acquiring a random subset of feature values may not be most effective. The goal of active feature-value acquisition is to incrementally select feature values that are most cost-effective for improving the model’s accuracy. We present two policies, Sampled Expected Utility and Expected Utility-ES, that acquire feature values for inducing a classification model based on an estimation of the expected improvement in model accuracy per unit cost. A comparison of the two policies to each other and to alternative policies demonstrate that Sampled Expected Utility is preferable as it effectively reduces the cost of producing a model of a desired accuracy and exhibits a consistent performance across domains.
Test-cost sensitive naive bayes classification
- In Proc. 4th IEEE International Conference on Data Mining (ICDM’04
, 2004
"... Inductive learning techniques such as the naive Bayes and decision tree algorithms have been extended in the past to handle different types of costs mainly by distinguishing different costs of classification errors. However, it is an equally important issue to consider how to handle the test costs a ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Inductive learning techniques such as the naive Bayes and decision tree algorithms have been extended in the past to handle different types of costs mainly by distinguishing different costs of classification errors. However, it is an equally important issue to consider how to handle the test costs associated with querying the missing values in a test case. When the value of an attribute is missing in a test case, it may or may not be worthwhile to take the effort to obtain its missing value, depending on how much the value will result in a potential gain in the classification accuracy. In this paper, we show how to obtain a test-cost sensitive naive Bayes classifier (csNB) by including a test strategy which determines how unknown attributes are selected to perform test on in order to minimize the sum of the misclassification costs and test costs. We propose and evaluate several potential test strategies including one that allows several tests to be done at once. We empirically evaluate the csNB method, and show that it compares favorably with its decision tree counterpart. 1.
Feature value acquisition in testing: A sequential batch test algorithm
- In Proceedings of 2006 International Conference on Machine Learning (ICML 2006
, 2006
"... In medical diagnosis, doctors often have to order sets of medical tests in sequence in order to make an accurate diagnosis of patient diseases. While doing so they have to make a trade-off between the cost of the tests and possible misdiagnosis. In this paper, we use cost-sensitive learning to model ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In medical diagnosis, doctors often have to order sets of medical tests in sequence in order to make an accurate diagnosis of patient diseases. While doing so they have to make a trade-off between the cost of the tests and possible misdiagnosis. In this paper, we use cost-sensitive learning to model this process. We assume that test examples (new patients) may contain missing values, and their actual values can be acquired at cost (similar to doing medical tests) in order to reduce misclassification errors (misdiagnosis). We propose a novel Sequential Batch Test algorithm that can acquire sets of attribute values in sequence, similar to sets of medical tests ordered by doctors in sequence. The goal of our algorithm is to minimize the total cost (i.e., the trade-off) of acquiring attribute values and misclassifications. We demonstrate the effectiveness of our algorithm, and show that it outperforms previous methods significantly. Our algorithm can be readily applied in real-world diagnosis tasks. A case study on the heart disease is given in the paper. 1.
Budgeted learning of Naive Bayes classifiers
- In UAI’03
, 2003
"... of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. The author reserves all other publication and other rights in association with the copyright in the thesis, and except as herein before provided, neither the thesis nor any substantial portion ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. The author reserves all other publication and other rights in association with the copyright in the thesis, and except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatever without the author’s prior written permission. Date:

