Results 1 -
2 of
2
Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction
, 2002
"... For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n ..."
Abstract
-
Cited by 79 (9 self)
- Add to MetaCart
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n training examples are going to be selected, in what proportion should the classes be represented? In this article we analyze the relationship between the marginal class distribution of training data and the performance of classification trees induced from these data, when the size of the training set is fixed. We study twenty-six data sets and, for each, determine the best class distribution for learning. Our results show that, for a fixed number of training examples, it is often possible to obtain improved classifier performance by training with a class distribution other than the naturally occurring class distribution. For example, we show that to build a classifier robust to different misclassification costs, a balanced class distribution generally performs quite well. We also describe and evaluate a budgetsensitive progressive-sampling algorithm that selects training examples such that the resulting training set has a good (near-optimal) class distribution for learning.
The Effect Of Small Disjuncts And Class Distribution On Decision Tree Learning
- Rutgers University
, 2003
"... OF THE DISSERTATION The Effect of Small Disjuncts and Class Distribution on Decision Tree Learning by Gary Mitchell Weiss Dissertation Director: Haym Hirsh The main goal of classifier learning is to generate a model that makes few misclassification errors. Given this emphasis on error minimizat ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
OF THE DISSERTATION The Effect of Small Disjuncts and Class Distribution on Decision Tree Learning by Gary Mitchell Weiss Dissertation Director: Haym Hirsh The main goal of classifier learning is to generate a model that makes few misclassification errors. Given this emphasis on error minimization, it makes sense to try to understand how the induction process gives rise to classifiers that make errors and whether we can identify those parts of the classifier that generate most of the errors. In this thesis we provide the first comprehensive studies of two major sources of classification errors. The first study concerns small disjuncts, which are those disjuncts within a classifier that cover only a few training examples. An analysis of classifiers induced from thirty data sets shows that these small disjuncts are extremely error prone and often account for the majority of all classification errors. Because small disjuncts largely determine classifier performance, we use them as a "lens" through which to study classifier induction. Factors such as pruning, training-set size, noise and class imbalance are each analyzed to determine how they affect small disjuncts and, more generally, classifier learning.

