## Bias Plus Variance Decomposition for Zero-One Loss Functions (1996)

### Cached

### Download Links

- [robotics.stanford.edu]
- [ai.stanford.edu]
- [robotics.stanford.edu]
- [robotics.stanford.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | MACHINE LEARNING: PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL |

Citations: | 188 - 4 self |

### BibTeX

@INPROCEEDINGS{Kohavi96biasplus,

author = {Ron Kohavi and David H. Wolpert},

title = {Bias Plus Variance Decomposition for Zero-One Loss Functions},

booktitle = {MACHINE LEARNING: PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL},

year = {1996},

pages = {275--283},

publisher = {Morgan Kaufmann Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a bias-variance decomposition of expected misclassification rate, the most commonly used loss function in supervised classification learning. The bias-variance decomposition for quadratic loss functions is well known and serves as an important tool for analyzing learning algorithms, yet no decomposition was offered for the more commonly used zero-one (misclassification) loss functions until the recent work of Kong & Dietterich (1995) and and Breiman (1996). Their decomposition suffers from some major shortcomings though (e.g., potentially negative variance), which our decomposition avoids. We show that, in practice, the naive frequency-based estimation of the decomposition terms is by itself biased and show how to correct for this bias. We illustrate the decomposition on various algorithms and datasets from the UCI repository.

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...en averaging, minus estimating based on the full set of N =2n training sets): E 1 P 2 i=1�2 (T ; wi) 2 ; T ; 1 2 P i=1�2 wi 2 = 1 P 2 i=1�2 E (T ; wi) 2 ; E T ; 1 P wi 2 i=1�2 By Jensen's inequality (=-=Cover & Thomas 1991-=-), the rst term on the right hand side is larger or equal to the second term. This shows that when we average once (over 2n instances) rather than twice (over n instances) we get a smaller estimate fo... |

3679 | Simplifying decision trees
- Quinlan
- 1987
(Show Context)
Citation Context ...set E. (Equation 4 was used to estimate p(yHjf� m� x).) At rst, all these terms were estimated using frequency counts. Figure 1 (left) shows the estimate for bias 2 for di erent values of N when ID3 (=-=Quinlan 1986-=-) was executed on three datasets from the UCI repository. It is clear that our estimate of bias 2 using frequency counts shrinks as we increase N. Since in nite N gives the correct value of bias 2 , t... |

3487 | An Introduction to the Bootstrap - Efron, RJ - 1993 |

2804 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...ng Classi ers There has been a lot of work recently on combining classi ers, with the terms aggregation, averages, ensembles, classi er combinations, voting, and stacking commonly used (Wolpert 1992, =-=Breiman 1994-=-a, Perrone 1993, Ali 1996). In the simplest scheme, multiple classi ers are generated and then vote for each test instance, with the majority prediction used as the nal prediction. Figure 3 shows ID3 ... |

788 |
UCI repository of machine learning databases
- Murphy, Aha
- 1992
(Show Context)
Citation Context ...erms in our decomposition by using frequency counts. 4.1 Our frequency counts experiments To investigate the behavior of the terms in our decomposition, we ran a set of experiments on UCI repository (=-=Murphy & Aha 1996-=-). In each of those experiments, for a given dataset and a given learning algorithm, we estimated (the x-average of) bias 2 ,variance, intrinsic noise, and overall error as follows. 1. We randomly div... |

650 | Neural Networks and the bias/variance dilemma - Geman, Bienenstock, et al. |

591 | Stacked generalization
- Wolpert
- 1992
(Show Context)
Citation Context ...n. 4.5 Combining Classi ers There has been a lot of work recently on combining classi ers, with the terms aggregation, averages, ensembles, classi er combinations, voting, and stacking commonly used (=-=Wolpert 1992-=-, Breiman 1994a, Perrone 1993, Ali 1996). In the simplest scheme, multiple classi ers are generated and then vote for each test instance, with the majority prediction used as the nal prediction. Figur... |

156 | Error-correcting output coding corrects bias and variance - Kong, Dietterich - 1995 |

100 | variance and arcing classifiers - Breiman, “Bias - 1996 |

80 | Improving regression estimation: averaging methods for variance reduction with extensions to general convex measure optimization
- Perrone
- 1993
(Show Context)
Citation Context ...here has been a lot of work recently on combining classi ers, with the terms aggregation, averages, ensembles, classi er combinations, voting, and stacking commonly used (Wolpert 1992, Breiman 1994a, =-=Perrone 1993-=-, Ali 1996). In the simplest scheme, multiple classi ers are generated and then vote for each test instance, with the majority prediction used as the nal prediction. Figure 3 shows ID3 versus a combin... |

51 | The heuristics of instability in model selection - Breiman - 1996 |

50 | Machine learning bias, statistical bias, and statistical variance of decision tree algorithms - Dietterich, Kong - 1995 |

23 |
Learning probabilistic Relational Concept Descriptions
- Ali
- 1996
(Show Context)
Citation Context ...a lot of work recently on combining classi ers, with the terms aggregation, averages, ensembles, classi er combinations, voting, and stacking commonly used (Wolpert 1992, Breiman 1994a, Perrone 1993, =-=Ali 1996-=-). In the simplest scheme, multiple classi ers are generated and then vote for each test instance, with the majority prediction used as the nal prediction. Figure 3 shows ID3 versus a combination of 5... |

18 |
Almost sure consistent nonparametric regression from recursive partitioning schemes. Journal of Multivariate Analysis 15:147{163
- Gordon, Olshen
- 1984
(Show Context)
Citation Context ...equence, decision trees are very unstable, and therefore they usually gain by aggregation techniques (Breiman 1994b). Note also that the smaller the internal sample, the more bias we potentially add (=-=Gordon & Olshen 1984-=-) but the more di erent the classi ers will be, leading to a more stable average. Our results show that in this voting scheme, the reduction in error is almost solely due to the reduction in variance.... |

3 | Bias, variance, and arcing classi ers - Breiman - 1996 |

1 |
Heuristics of instability inmodel selection
- Breiman
- 1994
(Show Context)
Citation Context ...ng Classi ers There has been a lot of work recently on combining classi ers, with the terms aggregation, averages, ensembles, classi er combinations, voting, and stacking commonly used (Wolpert 1992, =-=Breiman 1994-=-a, Perrone 1993, Ali 1996). In the simplest scheme, multiple classi ers are generated and then vote for each test instance, with the majority prediction used as the nal prediction. Figure 3 shows ID3 ... |