## An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999)

### Cached

### Download Links

Venue: | MACHINE LEARNING |

Citations: | 539 - 2 self |

### BibTeX

@ARTICLE{Bauer99anempirical,

author = {Eric Bauer and Ron Kohavi},

title = {An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants},

journal = {MACHINE LEARNING },

year = {1999},

volume = {36},

pages = {105--139}

}

### Years of Citing Articles

### OpenURL

### Abstract

Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a Naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, reweighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that Bagging reduced variance of unstable methods, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable methods but increased the variance for Naive-Bayes, which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if reweighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no pruning, use of probabilistic estimates, weight perturbations (Wagging), and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when the data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to large and significant reductions in the mean-squared errors. Practical problems that arise in implementing boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only "hard" areas but also outliers and noise.

### Citations

3921 | Pattern Classification and Scene Analysis - Duda, Hart - 1973 |

2868 | UCI Repository of machine learning databases - Blake, Merz - 1998 |

2534 | An Introduction to the Bootstrap - Efron, Tibshirani - 1993 |

2492 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...hods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (=-=Breiman 1996-=-b, Freund & Schapire 1996, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classi... |

2308 | A desicion-theoretic generalization of online learning and an application to boosting. EuroCOLT
- Freund, Schapire
- 1995
(Show Context)
Citation Context ...provides a recent review of related algorithms, and additional recent work can be found in Chan, Stolfo & Wolpert (1996). Algorithms that adaptively change the distribution include AdaBoost (Freund & =-=Schapire 1995-=-) and Arc-x4 (Breiman 1996a). Drucker & Cortes (1996) and Quinlan (1996) applied boosting to decision tree induction, observing both that error significantly decreases and that the generalization erro... |

1631 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...cation algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (Breiman 1996b, Freund & =-=Schapire 1996-=-, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting met... |

1039 | Bayesian Theory - Bernardo, Smith - 1994 |

777 |
C4.5: Programs for
- Quinlan
- 1993
(Show Context)
Citation Context ...ion tree inducer we used, called MC4 (MLC++ C4.5), is a TopDown Decision Tree (TDDT) induction algorithm implemented in MLC++ (Kohavi, Sommerfield & Dougherty 1997). The algorithm is similar to C4.5 (=-=Quinlan 1993-=-) with the exception that unknowns are regarded as a separate value. The algorithm grows the decision tree following the standard methodology of choosing the best attribute according to the evaluation... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics - Bartlett, Freund, et al. |

665 | The strength of weak learnability - Schapire - 1990 |

653 | K.B.: Multi-interval discretization of continuous-valued attributes for classification learning - Fayyad, Irani - 1993 |

609 | Neural networks and the bias/variance dilemma - Geman, Bienenstock, et al. - 1992 |

528 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ...of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (Kohavi 1995a, =-=Dietterich 1998-=-), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata for comparisons, s... |

438 | Very simple classification rules perform well on most commonly used datasets - Holte - 1993 |

423 | Boosting a weak learning algorithm by majority - Freund - 1995 |

334 | An analysis of Bayesian classifiers - Langley, Iba, et al. - 1992 |

298 | Beyond independence: Conditions for the optimality of the simple Bayesian classifier - Domingos, Pazzani |

278 | Arcing classifiers
- Breiman
- 1998
(Show Context)
Citation Context ...hods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (=-=Breiman 1996-=-b, Freund & Schapire 1996, Quinlan 1996). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classi... |

271 | Bagging, boosting, and C4.5
- Quinlan
- 1996
(Show Context)
Citation Context ...ms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets (Breiman 1996b, Freund & Schapire 1996, =-=Quinlan 1996-=-). Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and thos... |

173 | Bias plus variance decomposition for zero-one loss functions - Kohavi, Wolpert - 1996 |

154 | eld, \Data mining using MLC++ a machine learning library in C
- Kohavi, Sommer
- 1996
(Show Context)
Citation Context ...uild a structured model that has the same a#ect as Bagging. Ridgeway, Madigan & Richardson (1998) convert a boosted Naive-Bayes to a regular Naive-Bayes, which then allows for visualizations (Becker, =-=Kohavi & Sommerfield 1997-=-). Are there ways to make boosting comprehensible for general models? Craven & Shavlik (1993) built a single decision tree that attempts to make the same classifications as a neural network. Quinlan (... |

149 |
The Estimation of Probabilities: An Essay on Modern Bayesian Methods, volume 30 of Research Monographs
- Good
- 1965
(Show Context)
Citation Context ...ly unable to build a good classifier because the tree consists of a single binary root split with leaves as children. 4 ERIC BAUER AND RON KOHAVI 3.2. The Naive-Bayes Inducer The Naive-Bayes Inducer (=-=Good 1965-=-, Duda & Hart 1973, Langley, Iba & Thompson 1992), sometimes called Simple-Bayes (Domingos & Pazzani 1997), builds a simple conditional independence classifier. Formally, the probability of a class la... |

147 | Error-Correcting Output Coding Corrects Bias and Variance - Kong, Dietterich - 1995 |

146 |
A Conservation Law for Generalization Performance
- Schaffer
- 1994
(Show Context)
Citation Context ...the segment dataset with MC4(1), the error increased as the training set size grew. While in theory such behavior must happen for everys12 ERIC BAUER AND RON KOHAVI induction algorithm (Wolpert 1994, =-=Schaffer 1994-=-), this is the first time we have seen it in a real dataset. Further investigation revealed that in this problem all seven classes are equiprobable, i.e., the dataset was stratified. A strong majority... |

124 | Learning classification trees
- BUNTINE
- 1992
(Show Context)
Citation Context ...ging). 2 ERIC BAUER AND RON KOHAVI Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (=-=Buntine 1992-=-b, Buntine 1992a, Kohavi & Kunz 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand 1995); voting trees using di#erent splitting criteria and hu... |

112 | Reducing misclassification costs - Pazzani, Merz, et al. - 1994 |

107 | Wrappers for Performance Enhancement and Oblivious Decision Graphs
- Kohavi
- 1995
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

103 | Error-based and entropy-based discretization of continuous features - Kohavi, Sahami - 1996 |

90 | Boosting decision trees - Drucker, Cortes - 1996 |

88 | Error - correcting output codes: A general method for improving multiclass inductive learning programs - Dietterich, Bakiri - 1991 |

79 | A theory of learning classification rules
- Buntine
- 1991
(Show Context)
Citation Context ...ging). 2 ERIC BAUER AND RON KOHAVI Algorithms that do not adaptively change the distribution include option decision tree algorithms that construct decision trees with multiple options at some nodes (=-=Buntine 1992-=-b, Buntine 1992a, Kohavi & Kunz 1997); averaging path sets, fanned sets, and extended fanned sets as alternatives to pruning (Oliver & Hand 1995); voting trees using di#erent splitting criteria and hu... |

75 | Boosting the margin: A new explanation for the eectiveness of voting methods - Schapire, Freund, et al. - 1998 |

66 | The effects of training set size on decision tree complexity - Oates, Jensen - 1997 |

64 | Naive bayesian learning - Elkan - 1997 |

60 | Multiple decision trees - Kwok, Carter - 1990 |

60 | Stacked generalization, Neural Networks 5 - Wolpert - 1992 |

58 | Arcing the edge - Breiman - 1997 |

52 |
On bias, variance, 0/1 loss and the curse of dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ... independence assumption is not true in many cases, causing a single factor to a#ect several attributes whose probabilities are multiplied assuming they are conditionally independent given the label (=-=Friedman 1997-=-). To summarize, we have seen error reductions for the family of decision-tree algorithms when probabilistic estimates were used. The error reductions were larger for the one level decision trees. Thi... |

49 |
The heuristics of instability in model selection
- Breiman
- 1996
(Show Context)
Citation Context ... sample contains only about 63.2% unique instances from the training set. This perturbation causes di#erent classifiers to be built if the inducer is unstable (e.g., neural networks, decision trees) (=-=Breiman 1994-=-) and the performance can improve if the induced classifiers are good and not correlated; however, Bagging may slightly degrade the performance of stable algorithms (e.g., k-nearest neighbor) because ... |

48 | Comparing connectionist and symbolic learning methods - Quinlan - 1993 |

44 | Induction of one-level decision trees - WI, Langley - 1992 |

41 | Learning symbolic rules using artificial neural networks - Craven, Shavlik - 1993 |

41 | Option decision trees with majority votes - Kohavi, Kunz - 1997 |

37 | Visualizing the simple Bayesian classifier - Becker, Kohavi, et al. - 2001 |

37 |
Feature subset sÃ©lection using the wrapper model: Overfitting and dynamic search space topology
- Kohavi, Sommerfield
(Show Context)
Citation Context ...rd deviations of the error estimate from each run were computed as the standard deviation of the three outer runs, assuming they were independent. Although such an assumption is not strictly correct (=-=Kohavi 1995-=-a, Dietterich 1998), it is quite reasonable given our circumstances because our training sets are small in size and we only average three values. 6. Experimental Design We now describe our desiderata ... |

37 | On Pruning and Averaging Decision Trees - Oliver, Hand - 1995 |

32 | Why does bagging work? A Bayesian account and its implications - Domingos - 1997 |

23 | Interpretable boosted naive Bayes classification - Ridgeway, Madigan, et al. - 1998 |

22 | Learning probabilistic relational concept descriptions - Ali - 1996 |