## Why Does Bagging Work? A Bayesian Account and its Implications (1997)

Venue: | In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining |

Citations: | 33 - 7 self |

### BibTeX

@INPROCEEDINGS{Domingos97whydoes,

author = {Pedro Domingos},

title = {Why Does Bagging Work? A Bayesian Account and its Implications},

booktitle = {In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining},

year = {1997},

pages = {155--158},

publisher = {AAAI Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

The error rate of decision-tree and other classification learners can often be much reduced by bagging: learning multiple models from bootstrap samples of the database, and combining them by uniform voting. In this paper we empirically test two alternative explanations for this, both based on Bayesian learning theory: (1) bagging works because it is an approximation to the optimal procedure of Bayesian model averaging, with an appropriate implicit prior; (2) bagging works because it effectively shifts the prior to a more appropriate region of model space. All the experimental evidence contradicts the first hypothesis, and confirms the second. Bagging Bagging (Breiman 1996a) is a simple and effective way to reduce the error rate of many classification learning algorithms. For example, in the empirical study described below, it reduces the error of a decision-tree learner in 19 of 26 databases, by 4% on average. In the bagging procedure, given a training set of size s, a "bootstrap" re...

### Citations

5584 |
C4.5: Programs for machine learning
- Quinlan
- 1993
(Show Context)
Citation Context ...) relies on the fact that, implicitly or explicitly, a classification model divides the instance space into regions, and labels each region with a class. For example, if the model is a decision tree (=-=Quinlan 1993-=-), each leaf corresponds to a region. A noise level can then be estimated separately for each region, by making: P r(x i ; c i jh) = n r;c i n r (3) where r is the region x i is in, n r is the total n... |

3141 | Uci repository of machine learning databases - Blake, Merz - 1998 |

2866 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...ng works because it effectively shifts the prior to a more appropriate region of model space. All the experimental evidence contradicts the first hypothesis, and confirms the second. Bagging Bagging (=-=Breiman 1996-=-a) is a simple and effective way to reduce the error rate of many classification learning algorithms. For example, in the empirical study described below, it reduces the error of a decision-tree learn... |

1823 | Experiments with a new boosting algorithm
- Schapire, Freund
- 1996
(Show Context)
Citation Context ...g. Bagging is one of several "multiple model" approaches that have recently received much attention (see, for example, (Chan, Stolfo, & Wolpert 1996)). Other procedures of this type include =-=boosting (Freund & Schapire 1996-=-) and stacking (Wolpert 1992). Two related explanations have been proposed for bagging's success, both in a classical statistical framework. 1 Partly supported by a NATO scholarship. 2 Copyright c fl ... |

1221 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...wered the question of how the success of bagging relates to domain characteristics. In this paper, an alternate line of reasoning is pursued, one that draws on Bayesian learning theory (Buntine 1990; =-=Bernardo & Smith 1994-=-). In this theory, knowledge of the domain is (or assumptions about it are) contained in the prior probability assigned to the different models in the model space under consideration, the training set... |

211 | On bias, variance 0/1-loss, and the curse-of-dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ...nd variancesof a learning algorithm. Several alternative definitions of bias and variance for classification learners have been proposed (Kong & Dietterich 1995; Kohavi & Wolpert 1996; Breiman 1996b; =-=Friedman 1996-=-). Loosely, bias measures the systematic component of a learner's error (i.e., its average error over many different training sets), and variance measures the additional error that is due to the varia... |

200 |
Constructing optimal binary decision trees is NP-complete
- Hyafil, Rivest
- 1976
(Show Context)
Citation Context ...ly representing the same model as a bagged ensemble, and compare its complexity with that of the single tree induced from the whole training set. Although this is likely to be an NP-complete problem (=-=Hyafil & Rivest 1976-=-), an approximation to this approach can be obtained by simply applying the base learner to a training set composed of a large number of examples generated at random, and classified according to the b... |

191 | D.H.: Bias plus variance decomposition for zero-one loss functions
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...ss of bagging to the notions of bias and variancesof a learning algorithm. Several alternative definitions of bias and variance for classification learners have been proposed (Kong & Dietterich 1995; =-=Kohavi & Wolpert 1996-=-; Breiman 1996b; Friedman 1996). Loosely, bias measures the systematic component of a learner's error (i.e., its average error over many different training sets), and variance measures the additional ... |

157 | Error-correcting output coding corrects bias and variance
- Kong, Dietterich
- 1995
(Show Context)
Citation Context ...(1996) relates the success of bagging to the notions of bias and variancesof a learning algorithm. Several alternative definitions of bias and variance for classification learners have been proposed (=-=Kong & Dietterich 1995-=-; Kohavi & Wolpert 1996; Breiman 1996b; Friedman 1996). Loosely, bias measures the systematic component of a learner's error (i.e., its average error over many different training sets), and variance m... |

100 | variance and arcing classifiers - Breiman, “Bias - 1996 |

82 | A Theory of Learning Classification Rules
- Buntine
- 1990
(Show Context)
Citation Context ...t leaves unanswered the question of how the success of bagging relates to domain characteristics. In this paper, an alternate line of reasoning is pursued, one that draws on Bayesian learning theory (=-=Buntine 1990-=-; Bernardo & Smith 1994). In this theory, knowledge of the domain is (or assumptions about it are) contained in the prior probability assigned to the different models in the model space under consider... |

57 | Knowledge acquisition from examples via multiple models
- Domingos
- 1997
(Show Context)
Citation Context ...plexities of the two directly comparable. This "meta-learning" procedure was carried out for the 26 databases previously mentioned, using C4.5 as before. Details and full results are given e=-=lsewhere (Domingos 1997-=-). In all but four of the 22 databases where bagging improves on the single rule set, metalearning also produces a rule set with lower error, with over 99% confidence according to sign and Wilcoxon te... |

51 | Bayesian model averaging
- Madigan, Raftery, et al.
- 1996
(Show Context)
Citation Context ... necessary. Since P r(hj~x; ~c) is often very peaked, using only the model with highest posterior can be an acceptable approximation. Alternatively, a sampling scheme (e.g., Markov chain Monte Carlo (=-=Madigan et al. 1996-=-)) can be used. Empirical Tests of the First Hypothesis This section empirically tests the following hypothesis: 1. Bagging reduces a classification learner's error rate because it more closely approx... |

45 |
On Finding the Most Probably Model
- Cheeseman
- 1990
(Show Context)
Citation Context ...the region(s) of model space occupied by the models sampled by bootstrapping. This use of training-set information in the prior is not strictly allowed by Bayesian theory, but is nevertheless common (=-=Cheeseman 1990-=-). Although counter-intuitive, penalizing models that have lower error on the training data simply corresponds to an assumption that the models overfit the data, or more precisely, that the models tha... |