Results 1 - 10
of
17
Learning on the border: Active learning in imbalanced data classification
- In Proc. ACM Conf. on Information and Knowledge Management (CIKM ’07
, 2007
"... This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical d ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.
Why Label when you can Search? Alternatives to Active Learning for Applying Human Resources to Build Classification Models Under Extreme Class Imbalance ABSTRACT
"... This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifier ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper analyses alternative techniques for deploying lowcost human resources for data acquisition for classifier induction in domains exhibiting extreme class imbalance—where traditional labeling strategies, such as active learning, can be ineffective. Consider the problem of building classifiers to help brands control the content adjacent to their on-line advertisements. Although frequent enough to worry advertisers, objectionable categories are rare in the distribution of impressions encountered by most on-line advertisers—so rare that traditional sampling techniques do not find enough positive examples to train effective models. An alternative way to deploy human resources for training-data acquisition is to have them “guide ” the learning by searching explicitly for training examples of each class. We show that under extreme skew, even basic techniques for guided learning completely dominate smart (active) strategies for applying human resources to select cases for labeling. Therefore, it is critical to consider the relative cost of search versus labeling, and we demonstrate the tradeoffs for different relative costs. We show that in cost/skew settings where the choice between search and active labeling is equivocal, a hybrid strategy can combine the benefits.
Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments
- JOURNAL OF FIELD ROBOTICS
, 2009
"... Autonomous robot navigation in unstructured outdoor environments is a challenging area of active research and is currently unsolved. The navigation task requires identifying safe, traversable paths that allow the robot to progress toward a goal while avoiding obstacles. Stereo is an effective tool i ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Autonomous robot navigation in unstructured outdoor environments is a challenging area of active research and is currently unsolved. The navigation task requires identifying safe, traversable paths that allow the robot to progress toward a goal while avoiding obstacles. Stereo is an effective tool in the near field, but used alone leads to a common failure mode in autonomous navigation in which suboptimal trajectories are followed due to nearsightedness, or the robot’s inability to distinguish obstacles and safe terrain in the far field. This can be addressed through the use of machine learning methods to accomplish near-to-far learning, in which near-field terrain appearance and stereo readings are used to train models able to predict far-field terrain. This paper proposes to enhance existing, memoryless near-to-far learning approaches through the use of classifier ensembles that allow terrain models trained on data seen at different points in time to be preserved and referenced later. These stored models serve as memory, and we show that they can be leveraged for more effective far-field terrain classification on future images seen by the robot. A five-factor, full-factorial, repeated-measures experimental evaluation is performed on hand-labeled data sets taken directly from the problem domain. The experiments
Cost-Sensitive Learning Methods for Imbalanced Data
"... Abstract — Class imbalance is one of the challenging problems for machine learning algorithms. When learning from highly imbalanced data, most classifiers are overwhelmed by the majority class examples, so the false negative rate is always high. Although researchers have introduced many methods to d ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract — Class imbalance is one of the challenging problems for machine learning algorithms. When learning from highly imbalanced data, most classifiers are overwhelmed by the majority class examples, so the false negative rate is always high. Although researchers have introduced many methods to deal with this problem, including resampling techniques and costsensitive learning (CSL), most of them focus on either of these techniques. This study presents two empirical methods that deal with class imbalance using both resampling and CSL. The first method combines and compares several sampling techniques with CSL using support vector machines (SVM). The second method proposes using CSL by optimizing the cost ratio (cost matrix) locally. Our experimental results on 18 imbalanced datasets from the UCI repository show that the first method can reduce the misclassification costs, and the second method can improve the classifier performance. I.
Guided Feature Labeling for Budget-Sensitive Learning Under Extreme Class Imbalance
"... Extreme class skew is a hurdle in many machine learning tasks. In such skewed settings, traditional methods for procuring labeled examples, including random sampling and active learning, are often ineffective— they struggle to find representative minority examples. The framework of Dual Supervision, ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Extreme class skew is a hurdle in many machine learning tasks. In such skewed settings, traditional methods for procuring labeled examples, including random sampling and active learning, are often ineffective— they struggle to find representative minority examples. The framework of Dual Supervision, which incorporates feature-based background information into traditional supervised learning, provides one avenue to combat this problem. However, active learning for feature information (feature labeling), like active learning, is often not resilient to extreme class skew. In this work, we present an alternative to active feature labeling, Guided Feature Labeling. In this paradigm, human domain experts are tasked with finding classindicative features given a description of a class. This work explores different data acquisition costs, and demonstrates that under certain conditions, Guided Feature Labeling does indeed offer high performance models at a far lower budget than complementary active labeling approaches. 1.
IJCNN Negative Correlation Learning for Classification Ensembles
"... Abstract — This paper proposes a new negative correlation learning (NCL) algorithm, called AdaBoost.NC, which uses an ambiguity term derived theoretically for classification ensembles to introduce diversity explicitly. All existing NCL algorithms, such as CELS [1] and NCCD [2], and their theoretical ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — This paper proposes a new negative correlation learning (NCL) algorithm, called AdaBoost.NC, which uses an ambiguity term derived theoretically for classification ensembles to introduce diversity explicitly. All existing NCL algorithms, such as CELS [1] and NCCD [2], and their theoretical backgrounds were studied in the regression context. We focus on classification problems in this paper. First, we study the ambiguity decomposition with the 0-1 error function, which is different from the one proposed by Krogh et al. [3]. It is applicable to both binary-class and multi-class problems. Then, to overcome the identified drawbacks of the existing algorithms, AdaBoost.NC is proposed by exploiting the ambiguity term in the decomposition to improve diversity. Comprehensive experiments are performed on a collection of benchmark data sets. The results show AdaBoost.NC is a promising algorithm to solve classification problems, which gives better performance than the standard AdaBoost and NCCD, and consumes much less computation time than CELS. I.
Active Inference and Learning for Classifying Streams
"... In this position paper we introduce Active Inference, a paradigm for intelligently requesting human labels for inference and learning in situations with a finite budget for applying human resources for labeling cases. Many machine learning systems are applied to a stream of instances that can repeat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this position paper we introduce Active Inference, a paradigm for intelligently requesting human labels for inference and learning in situations with a finite budget for applying human resources for labeling cases. Many machine learning systems are applied to a stream of instances that can repeat, such as queries posed to a search engine or web pages for potential ad impressions. When a particular instance x can be subject to classification more than once, we have an additional complication to the budgeted learning setting. In such applications, frequently the distributions will be non-uniform; for instance, in the above applications the distributions p(x) over examples are highly skewed and thus a few x’s result in a large percentage of the actual cases for prediction. In such settings, it may be beneficial to allocate a human “labeling” budget selectively perform direct inference, requesting human labels on a selected subset of the instances to be provided to an end system in an effort to reduce misclassification cost on the x’s with the highest expected utility. In estimating the utility of labeling a particular x, one must consider three factors: misclassification cost, the probability of encountering x, p(x), and the value x and its associated label may bring for (active) learning. We will discuss the illustrative application of machine learning for safe advertising, where there is a limited budget for acquiring ground-truth labels for labeling web-pages.
CONCEPT LEARNING FOR IMAGE AND VIDEO RETRIEVAL: THE INVERSE RANDOM UNDER SAMPLING APPROACH
"... A typical concept-detection problem is characterised by greatly disproportionate sizes of the populations of training samples in the concept and anti-concept classes. In many cases, the population of anti-concept (negative) examples outnumber the concept examples. In this paper, an inverse random un ..."
Abstract
- Add to MetaCart
A typical concept-detection problem is characterised by greatly disproportionate sizes of the populations of training samples in the concept and anti-concept classes. In many cases, the population of anti-concept (negative) examples outnumber the concept examples. In this paper, an inverse random under sampling method is proposed to solve this imbalance problem. By the proposed method of inverse under sampling of the anti-concept class we can construct a large number of concept detectors which in the fusion stage facilitate a fine control of both false negative rates and false positive rates. In this method the main emphasis in learning the discriminant functions is on the concept class, leading to an almost perfect separation of the two classes for each detector. The proposed methodology is applied to commonly-used video and image collection benchmarks: Mediamill and Scene datasets. The results indicate significant performance gains. For some concepts, the improvement in the average precision is by several orders of magnitude, and the mean average precision is 12 % and 17 % better for Mediamill and Scene datasets respectively when compared with conventionally trained logistic regression classifier. 1.
Coping with Imbalanced Training Data for Improved Terrain Prediction in Autonomous Outdoor Robot Navigation
"... Abstract — Autonomous robot navigation in unstructured outdoor environments is a challenging and largely unsolved area of active research. The navigation task requires identifying safe, traversable paths that allow the robot to progress towards a goal while avoiding obstacles. Machine learning techn ..."
Abstract
- Add to MetaCart
Abstract — Autonomous robot navigation in unstructured outdoor environments is a challenging and largely unsolved area of active research. The navigation task requires identifying safe, traversable paths that allow the robot to progress towards a goal while avoiding obstacles. Machine learning techniques are well adapted to this task, accomplishing near-to-far learning by training appearance-based models using near-field stereo readings in order to predict safe terrain and obstacles in the far field. However, these methods are subject to degraded performance when training data sets exhibit class imbalance, or skew, where data instances of one class outnumber those in another. In such scenarios, classifiers can be overwhelmed by the majority class, and will tend to ignore the minority class. In this paper, we show that typical outdoor terrain scenarios are associated with training data imbalance, and examine the impact of using undersampling, oversampling, SMOTE, and biased penalties techniques to correct for imbalance in stereo-derived training data. We conduct a statistically significant, repeated measures empirical evaluation and demonstrate improved far-field terrain prediction performance when using such methods for handling class imbalance versus taking no corrective action at all. I.
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Incorporating Reviewer and Product Information for Review Rating Prediction
"... Traditional sentiment analysis mainly considers binary classifications of reviews, but in many real-world sentiment classification problems, nonbinary review ratings are more useful. This is especially true when consumers wish to compare two products, both of which are not negative. Previous work ha ..."
Abstract
- Add to MetaCart
Traditional sentiment analysis mainly considers binary classifications of reviews, but in many real-world sentiment classification problems, nonbinary review ratings are more useful. This is especially true when consumers wish to compare two products, both of which are not negative. Previous work has addressed this problem by extracting various features from the review text for learning a predictor. Since the same word may have different sentiment effects when used by different reviewers on different products, we argue that it is necessary to model such reviewer and product dependent effects in order to predict review ratings more accurately. In this paper, we propose a novel learning framework to incorporate reviewer and product information into the text based learner for rating prediction. The reviewer, product and text features are modeled as a three-dimension tensor. Tensor factorization techniques can then be employed to reduce the data sparsity problems. We perform extensive experiments to demonstrate the effectiveness of our model, which has a significant improvement compared to state of the art methods, especially for reviews with unpopular products and inactive reviewers. 1

