MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

The Effect of Class Distribution on Classifier Learning: An Empirical Study (2001) [44 citations — 1 self]

Abstract:

In this article we analyze the effect of class distribution on classifier learning. We begin by describing the different ways in which class distribution affects learning and how it affects the evaluation of learned classifiers. We then present the results of two comprehensive experimental studies. The first study compares the performance of classifiers generated from unbalanced data sets with the performance of classifiers generated from balanced versions of the same data sets. This comparison allows us to isolate and quantify the effect that the training set's class distribution has on learning and contrast the performance of the classifiers on the minority and majority classes. The second study assesses what distribution is "best" for training, with respect to two performance measures: classification accuracy and the area under the ROC curve (AUC). A tacit assumption behind much research on classifier induction is that the class distribution of the training data should match the "natural" distribution of the data. This study shows that the naturally occurring class distribution often is not best for learning, and often substantially better performance can be obtained by using a different class distribution. Understanding how classifier performance is affected by class distribution can help practitioners to choose training data---in real-world situations the number of training examples often must be limited due to computational costs or the costs associated with procuring and preparing the data. 1.

Citations

3356 C4.5: Programs for Machine Learning – Quinlan - 1993
2573 Classification and Regression Trees – Breiman, Friedman, et al. - 1984
2227 UCI repository of machine learning databases – Blake, Merz
215 The case against accuracy estimation for comparing classifiers – Provost, Fawcett, et al. - 1998
123 Concept Learning and the Problem of Small Disjuncts – Holte, Acker, et al. - 1989
119 Construction and Assessment of Classification Rules – Hand - 1997
78 Megainduction : A Machine Learning on Very Large Databases – Catlett - 1991
76 Toward scalable learning with non-uniform class and cost distributions – Chan, Stolfo - 1998
63 Addressing the Curse of Imbalanced Training Sets: One-Sided Selection – Kubat, Matwin - 1997
43 Robust classification systems for imprecise environments – Provost, Fawcett
32 R.C.: Exploiting the Cost (In)Sensitivity of Decision Tree Splitting Criteria – Drummond, Holte - 2000
15 A quantitative study of small disjuncts – Weiss, Hirsh - 2000
14 Better decisions through science – Swets, Dawes - 2000