Results 1 - 10
of
15
Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions
- In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining
, 1997
"... Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions rarely hold. We present a method for the comparison of classifier performance that is robust to imprecis ..."
Abstract
-
Cited by 225 (13 self)
- Add to MetaCart
Applications of inductive learning algorithms to realworld data mining problems have shown repeatedly that using accuracy to compare classifiers is not adequate because the underlying assumptions rarely hold. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. Introduction When mining data with inductive methods, we often experiment with a wide variety of learning algorithms, using different algorithm parameters, varying output threshold values, and using different training regimens. Such experimentation yields a large number of classifiers to be evaluated a...
Robust Classification for Imprecise Environments
, 1989
"... In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclas ..."
Abstract
-
Cited by 209 (12 self)
- Add to MetaCart
In real-world environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. We then show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and ...
SMOTE: Synthetic Minority Over-sampling Technique
- Journal of Artificial Intelligence Research
, 2002
"... An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abn ..."
Abstract
-
Cited by 175 (11 self)
- Add to MetaCart
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are influenced by misclassification costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are influenced by misclassification costs or changes to the class distribution. Splitting criteria that are relatively insensitive to costs (class distributions) are found to perform as well as or better than, in terms of expected misclassification cost, splitting criteria that are cost sensitive. Consequently there are two opposite ways of dealing with imbalance. One is to combine a costinsensitive splitting criterion with a cost insensitive pruning method to produce a decision tree algorithm little affected by cost or prior class distribution. The other is to grow a cost-independent tree which is then pruned in a cost-sensitive manner. 1. Introduction When applying machine learning to real world classification problems two complications that often arise are imbalanced classes (one class occurs much more often than the other (Kubat et al., 1998; Ezawa et al., 1...
C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure
- In Proceedings of the ICML’03 Workshop on Class Imbalances
, 2003
"... Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting ” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class errors. In this paper, we study three issues, usually considered separately, concerning decision trees and imbalanced data sets — quality of probabilistic estimates, pruning, and effect of preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. We consider each issue independently and in conjunction with each other, highlighting the scenarios where one method might be preferred over another for learning decision trees from imbalanced data sets. 1.
Supporting Software Maintenance by Mining Software Update Records
, 2001
"... This paper describes the application of inductive methods to data extracted from both source code and software maintenance records. We would like to extract relations that indicate which files in, a legacy system, are relevant to each other in the context of program maintenance. We call these relati ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper describes the application of inductive methods to data extracted from both source code and software maintenance records. We would like to extract relations that indicate which files in, a legacy system, are relevant to each other in the context of program maintenance. We call these relations Maintenance Relevance Relations. Such a relation could reveal existing complex interconnections among files in the system, which may in turn be useful in comprehending them. We discuss the methodology we employed to extract and evaluate the relations. We also point out some of the problems we encountered and our solutions for them. Finally, we present some of the results that we have obtained.
A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We propose a new scoring function for learning Bayesian networks from data using score search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square di ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We propose a new scoring function for learning Bayesian networks from data using score search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented.
A recognition-based alternative to discrimination-based multi-layer perceptrons
- Proc. Workshop on Learning from Imbalanced Data Sets
, 2000
"... Though impressive classication accuracy is often obtained via discrimination-based learning techniques such as Multi-Layer Perceptrons (DMLP), these techniques often assume that the underlying training sets are optimally balanced (in terms of the number of positive and negative examples). Unfortuna ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Though impressive classication accuracy is often obtained via discrimination-based learning techniques such as Multi-Layer Perceptrons (DMLP), these techniques often assume that the underlying training sets are optimally balanced (in terms of the number of positive and negative examples). Unfortunately, this is not always the case. In this paper, we look at a recognition-based approach whose accuracy in such environments is superior to that obtained via more conventional mechanisms. At the heart of the new technique is a modied auto-encoder that allows for the incorporation of a recognition component into the conventional MLP mechanism. In short, rather than being associated with an output value of "1", positive examples are fully reconstructed at the network output layer while negative examples, rather than being associated with an output value of "0", have their inverse derived at the output layer. The result is an auto-encoder able to recognize positive examples while discriminating against negative ones by virtue of the fact that negative cases generate larger reconstruction errors. A simple technique is employed to exaggerate the impact of training with these negative examples so that reconstruction errors can be more reliably established. Preliminary testing on both seismic and sonar data sets has demonstrated that the new method produces lower error rates than standard connectionist systems in imbalanced settings. Our approach thus suggests a simple and more robust alternative to commonly used classification mechanisms.
Evaluation of Classifiers for an Uneven Class Distribution Problem
- Applied Artificial Intelligence
, 2006
"... Classification problems with uneven class distributions present several difficulties during the training as well as during the evaluation process of classifiers. A classification problem with such characteristics has resulted from a data-mining project where the objective was to predict customer ins ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Classification problems with uneven class distributions present several difficulties during the training as well as during the evaluation process of classifiers. A classification problem with such characteristics has resulted from a data-mining project where the objective was to predict customer insolvency. Using the dataset from the customer insolvency problem we study several alternative methodologies which have been reported to better suit the specific characteristics of this type of problems. Three different but equally important directions are examined; (a) the performance measures that should be used for problems in this domain, (b) the class distributions that should be used for the training data sets, (c) the classification algorithms to be used. The final evaluation of the resulting classifiers is based on a study of the economic impact of classification results. This study concludes to a framework that provides the “best ” classifiers, identifies the performance measures that should be used as the decision criterion and suggests the “best ” class distribution based on the value of the relative gain from correct classification in the positive class. This framework has been applied in the customer insolvency problem, but it is claimed that it can be applied to many similar problems with uneven class distributions that almost always require a multi-objective evaluation proces. Keywords: data mining, classification, imbalanced class distributions, voting algorithms, Cost-sensitive learning 1.
Wrapper-based computation and evaluation of sampling methods for imbalanced datasets
- In Workshop on Utility-Based Data Mining held in conjunction with the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
, 2005
"... Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases o ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Learning from imbalanced datasets presents an interesting problem both from modeling and economy standpoints. When the imbalance is large, classification accuracy on the smaller class(es) tends to be lower. In particular, when a class is of great interest but occurs relatively rarely such as cases of fraud, instances of disease, and regions of interest in largescale simulations, it is important to accurately identify it. It then becomes more costly to misclassify the interesting class. In this paper, we implement a wrapper approach that computes the amount of under-sampling and synthetic generation of the minority class examples (SMOTE) to improve minority class accuracy. The f-value serves as the evaluation function. Experimental results show the wrapper approach is effective in optimization of the composite f-value, andreduces the average cost per test example for the datasets considered. We report both average cost per test example and the cost curves in the paper. The true positive rate of the minority class increases significantly without causing a significant change in the f-value. We also obtain the lowest cost per test example, compared to any result we are aware of for the KDD Cup-99 intrusion detection data set.

