Results 1  10
of
24
SMOTE: Synthetic Minority Oversampling Technique
 Journal of Artificial Intelligence Research
, 2002
"... An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often realworld data sets are predominately composed of ``normal'' examples with only a small percentage of ``abn ..."
Abstract

Cited by 301 (21 self)
 Add to MetaCart
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often realworld data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Undersampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal) class and undersampling the majority (normal) class can achieve better classifier performance (in ROC space) than only undersampling the majority class. This paper also shows that a combination of our method of oversampling the minority class and undersampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of oversampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions
 In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining
, 1997
"... Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions rarely hold. We present a method for the comparison of classifier performance that is robust to imprecis ..."
Abstract

Cited by 261 (14 self)
 Add to MetaCart
Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions rarely hold. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. Introduction When mining data with inductive methods, we often experiment with a wide variety of learning algorithms, using different algorithm parameters, varying output threshold values, and using different training regimens. Such experimentation yields a large number of classifiers to be evaluated a...
Robust Classification for Imprecise Environments
, 1989
"... In realworld environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclas ..."
Abstract

Cited by 255 (14 self)
 Add to MetaCart
In realworld environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. We then show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and ...
Exploiting the cost (in)sensitivity of decision tree splitting criteria
 In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are in uenced by misclassi cation costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as well as ..."
Abstract

Cited by 47 (4 self)
 Add to MetaCart
This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are in uenced by misclassi cation costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as well as or better than, in terms of expected misclassi cation cost, splitting criteria that are cost sensitive. Consequently there are two opposite ways of dealing with imbalance. One is to combine a costinsensitive splitting criterion with a cost insensitive pruning method to produce a decision tree algorithm little a ected by cost or prior class distribution. The other is to grow a costindependent tree which is then pruned in a costsensitive manner. 1.
C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure
 In Proceedings of the ICML’03 Workshop on Class Imbalances
, 2003
"... Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to overprevalence. It is desired that the interesting (minority) class prediction be ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to overprevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets — quality of probabilistic estimates, pruning, and effect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets. 1.
B.: BorderlineSMOTE: A New OverSampling Method
 in Imbalanced Data Sets Learning. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.) ICIC
"... Abstract. In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
Abstract. In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority oversampling technique (SMOTE) is one of the oversampling methods addressing this problem. Based on SMOTE method, this paper presents two new minority oversampling methods, borderlineSMOTE1 and borderlineSMOTE2, in which only the minority examples near the borderline are oversampled. For the minority class, experiments show that our approaches achieve better TP rate and Fvalue than SMOTE and random oversampling methods. 1
Costsensitive boosting for classification of imbalanced data
, 2007
"... Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent o ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Classification of data with imbalanced class distribution has posed a significant drawback of the performance attainable by most standard classifier learning algorithms, which assume a relatively balanced class distribution and equal misclassification costs. The significant difficulty and frequent occurrence of the class imbalance problem indicate the need for extra research efforts. The objective of this paper is to investigate metatechniques applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. The AdaBoost algorithm is reported as a successful metatechnique for improving classification accuracy. The insight gained from a comprehensive analysis of the AdaBoost algorithm in terms of its advantages and shortcomings in tacking the class imbalance problem leads to the exploration of three costsensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. Further analysis shows that one of the proposed algorithms tallies with the stagewise additive modelling in statistics to minimize the cost exponential loss. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real world medical data sets, where the class imbalance problem prevails.
A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We propose a new scoring function for learning Bayesian networks from data using score search algorithms. This is based on the concept of mutual information and exploits some wellknown properties of this measure in a novel way. Essentially, a statistical independence test based on the chisquare di ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We propose a new scoring function for learning Bayesian networks from data using score search algorithms. This is based on the concept of mutual information and exploits some wellknown properties of this measure in a novel way. Essentially, a statistical independence test based on the chisquare distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a nonBayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the KullbackLeibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the wellknown K2, BDeu and BIC/MDL scores are also presented.
Supporting Software Maintenance by Mining Software Update Records
, 2001
"... This paper describes the application of inductive methods to data extracted from both source code and software maintenance records. We would like to extract relations that indicate which files in, a legacy system, are relevant to each other in the context of program maintenance. We call these relati ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
This paper describes the application of inductive methods to data extracted from both source code and software maintenance records. We would like to extract relations that indicate which files in, a legacy system, are relevant to each other in the context of program maintenance. We call these relations Maintenance Relevance Relations. Such a relation could reveal existing complex interconnections among files in the system, which may in turn be useful in comprehending them. We discuss the methodology we employed to extract and evaluate the relations. We also point out some of the problems we encountered and our solutions for them. Finally, we present some of the results that we have obtained.
DATA MINING FOR IMBALANCED DATASETS: AN OVERVIEW
, 2005
"... A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "realworld" problems, many of which are characterized by imbalanced data. Additionally the distribution o ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "realworld" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.