Results 1 - 10
of
135
Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction
, 2002
"... For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n ..."
Abstract
-
Cited by 79 (9 self)
- Add to MetaCart
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n training examples are going to be selected, in what proportion should the classes be represented? In this article we analyze the relationship between the marginal class distribution of training data and the performance of classification trees induced from these data, when the size of the training set is fixed. We study twenty-six data sets and, for each, determine the best class distribution for learning. Our results show that, for a fixed number of training examples, it is often possible to obtain improved classifier performance by training with a class distribution other than the naturally occurring class distribution. For example, we show that to build a classifier robust to different misclassification costs, a balanced class distribution generally performs quite well. We also describe and evaluate a budgetsensitive progressive-sampling algorithm that selects training examples such that the resulting training set has a good (near-optimal) class distribution for learning.
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data
, 2004
"... There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situ ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have di#culties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods, Smote + Tomek and Smote + ENN, deal with these conditions directly, allying a known over-sampling method with data cleaning methods in order to produce better-defined class clusters. Our comparative experiments show that, in general, over-sampling methods provide more accurate results than under-sampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Smote + Tomek and Smote + ENN presented very good results for data sets with a small number of positive examples. Moreover, Random over-sampling, a very simple over-sampling method, is very competitive to more complex over-sampling methods. Since the over-sampling methods provided very good performance results, we also measured the syntactic complexity of decision trees induc...
Editorial: Special Issue on Learning from Imbalanced Data Sets
- SIGKDD Explorations
, 2004
"... The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research. ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
The class imbalance problem is one of the (relatively) new problems that emerged when machine learning matured from an embryonic science to an applied technology, amply used in the worlds of business, industry and scientific research.
Mining with Rarity: A Unifying Framework
"... Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This a ..."
Abstract
-
Cited by 57 (6 self)
- Add to MetaCart
Rare objects are often of great interest and great value. Until recently, however, rarity has not received much attention in the context of data mining. Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage. This article discusses the role that rare classes and rare cases play in data mining. The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems. These descriptions utilize examples from existing research, so that this article provides a good survey of the literature on rarity in data mining. This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
Applying support vector machines to imbalanced datasets
- In Proceedings of the 15th European Conference on Machine Learning (ECML
, 2004
"... Abstract. Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive i ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
Abstract. Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al’s different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them. 1
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling
, 2003
"... This paper takes a new look at two sampling schemes commonly used to adapt machine learning algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and undersampling with the decision tree learner C4 ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
This paper takes a new look at two sampling schemes commonly used to adapt machine learning algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and undersampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison. But it is recommended that the cheapest class classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Over-sampling, however, shows little sensitivity, there is often little difference in performance when misclassification costs are changed. 1.
Learning when data sets are imbalanced and when costs are unequal and unknown
- ICML-2003 Workshop on Learning from Imbalanced Data Sets II
, 2003
"... The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results fr ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
The problem of learning from imbalanced data sets, while not the same problem as learning when misclassification costs are unequal and unknown, can be handled in a similar manner. That is, in both contexts, we can use techniques from roc analysis to help with classifier design. We present results from two studies in which we dealt with skewed data sets and unequal, but unknown costs of error. We also compare for one domain these results to those obtained by over-sampling and under-sampling the data set. The operations of sampling, moving the decision threshold, and adjusting the cost matrix produced sets of classifiers that fell on the same roc curve. 1.
SMOTEBoost: improving prediction of the minority class in boosting
- In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003
, 2003
"... Abstract. Many real world data mining applications involve learning from imbalanced data sets, where the particular events of interest may be very few when compared to the other classes. Learning from data sets that contain rare events usually produces biased classifiers that have a higher predictiv ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
Abstract. Many real world data mining applications involve learning from imbalanced data sets, where the particular events of interest may be very few when compared to the other classes. Learning from data sets that contain rare events usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class of interest. SMOTE (Synthetic Minority Over-sampling Technique) is a recent approach that is specifically designed for learning with minority classes. Boosting is a promising ensemble-based learning algorithm that can improve the classification performance of any weak classifier. This paper presents a novel approach for learning from rare classes, based on a combination of the SMOTE algorithm and the boosting procedure. Unlike standard boosting where all misclassified examples are given equal weights, the novel SMOTEBoost approach creates synthetic examples from the minority class, thus indirectly changing the updating weights and compensating for skewed distributions. The SMOTEBoost algorithm applied to several highly and moderately imbalanced data sets shows improvement in prediction performance on the rare class compared to standard boosting and to the SMOTE algorithm used with a single classifier. 1.
C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure
- In Proceedings of the ICML’03 Workshop on Class Imbalances
, 2003
"... Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets — quality of probabilistic estimates, pruning, and effect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets. 1.
Class-boundary alignment for imbalanced dataset learning
- In ICML 2003 Workshop on Learning from Imbalanced Data Sets
, 2003
"... In this paper, we propose the class-boundaryalignment algorithm to augment SVMs to deal with imbalanced training-data problems posed by many emerging applications (e.g., image retrieval, video surveillance, and gene profiling). Through a simple example, we first show that SVMs can be ineffective in ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
In this paper, we propose the class-boundaryalignment algorithm to augment SVMs to deal with imbalanced training-data problems posed by many emerging applications (e.g., image retrieval, video surveillance, and gene profiling). Through a simple example, we first show that SVMs can be ineffective in determining the class boundary when the training instances of the target class are heavily outnumbered by the nontarget training instances. To remedy this problem, we propose to adjust the class boundary either by transforming the kernel function when the training data can be represented in a vector space, or by modifying the kernel matrix when the data do not have a vector-space representation (e.g., sequence data). Through theoretical analysis and empirical study, we show that the classboundary-alignment algorithm works effectively with images (data that have a vector-space representation) and video sequences (data that do not have a vector-space representation). 1.

