Recently ensemble methods like AdaBoost have been applied successfully in many problems, while seemingly defying the problems of overfitting. AdaBoost rarely overfits in the low noise regime, however, we show that it clearly does so for higher noise levels. Central to the understanding of this fact is the margin distribution. AdaBoost can be viewed as a constraint gradient descent in an error function with respect to the margin. We find that AdaBoost asymptotically achieves a hard margin distribution, i.e. the algorithm concentrates its resources on a few hard-to-learn patterns that are interestingly very similar to Support Vectors. A hard margin is clearly a sub-optimal strategy in the noisy case, and regularization, in our case a ``mistrust'' in the data, must be introduced in the algorithm to alleviate the distortions that single difficult patterns (e.g. outliers) can cause to the margin distribution. We propose several regularization methods and generalizations of the original AdaBoost algorithm to achieve a soft margin. In particular we suggest (1) regularized AdaBoost-Reg where the gradient decent is done directly with respect to the soft margin and (2) regularized linear and quadratic programming (LP/QP-) AdaBoost, where the soft margin is attained by introducing slack variables. Extensive simulations demonstrate that the proposed regularized AdaBoost-type algorithms are useful and yield competitive results for noisy data.
|
5044
|
Statistical Learning Theory
– Vapnik
- 1998
|
|
3356
|
C4.5: Programs for Machine Learning
– Quinlan
- 1993
|
|
3316
|
Neural Networks for Pattern Recognition
– Bishop
- 1995
|
|
1565
|
Bagging predictors
– Breiman
- 1996
|
|
1205
|
Schapire, “Decision-theoretic generalization of on-line learning and application to boosting
– Freund, E
- 1997
|
|
1091
|
Support-vector network
– Cortes, Vapnik
- 1995
|
|
719
|
A training algorithm for optimal margin classifiers
– Boser, Guyon, et al.
- 1992
|
|
596
|
R.: Additive logistic regression: a statistical view of boosting
– Friedman, Hastie, et al.
- 1998
|
|
500
|
Boosting the margin: A new explanation for the effectiveness of voting methods
– Schapire, Freund, et al.
- 1998
|
|
400
|
Improved boosting algorithms using confidence-rated predictions
– Schapire, Singer
- 1999
|
|
395
|
Fast learning in networks of locally-tuned processing units
– Moody, Darken
- 1989
|
|
196
|
arcing classifiers
– Breiman, Bias
- 1996
|
|
142
|
Nonlinear Programming
– Mangasarian
- 1994
|
|
100
|
Introduction to support vector learning
– Schölkopf, Burges, et al.
- 1999
|
|
100
|
The connection between regularization operators and support vector kernels
– Smola, Scholkopf, et al.
- 1997
|
|
96
|
Prediction games and arcing algorithms
– Breiman
- 1999
|
|
94
|
Game theory, on-line prediction, and boosting
– Freund, Schapire
- 1996
|
|
92
|
Functional gradient techniques for combining hypotheses
– Mason, Baxter, et al.
- 1999
|
|
83
|
Boosting in the limit: Maximizing the margin of learned ensembles
– Grove, Schuurmans
- 1998
|
|
73
|
arcing classifiers
– Bias
- 1996
|
|
52
|
Arcing the edge
– Breiman
- 1997
|
|
52
|
Optimization by Simulated Annealing: Quantitative Studies
– Kirkpatrick
- 1984
|
|
47
|
2000, `Improved Generalization Through Explicit Optimization of Margins
– Mason, Bartlett, et al.
|
|
44
|
Combining support vector and mathematical programming methods for classification
– Bennett
- 1999
|
|
36
|
Theoretical Views of Boosting
– Schapire
- 1999
|
|
28
|
Y.Benjio, “Boosting Neural Networks
– Schwenk
- 2000
|
|
23
|
Robust ensemble learning
– Rätsch, Schölkopf, et al.
- 2000
|
|
21
|
Using adaptive bagging to debias regressions
– Breiman
- 1999
|
|
17
|
Using support vector machines for time series prediction
– Muller, Smola, et al.
- 1997
|
|
15
|
Boosting first-order learning
– Quinlan
- 1996
|
|
12
|
Regularizing AdaBoost
– Ratsch, Onoda, et al.
- 1999
|
|
11
|
A simple cost function for boosting
– Frean, Downs
- 1998
|
|
11
|
New support vector algorithms, Neural Computation 12
– Schölkopf, Smola, et al.
- 2000
|
|
9
|
A boosting algorithm for regression
– Bertoni, Campadelli, et al.
- 1997
|
|
8
|
An asymptotic analysis of adaboost in the binary classification case
– Onoda, Ratsch, et al.
- 1998
|
|
8
|
Greedy function approximation
– Friedman
- 1999
|
|
4
|
Ensemble learning methods for classification
– Ratsch
- 1998
|
|
2
|
Perceptrons in kernel feature space
– Frie, Harrison
- 1998
|
|
2
|
Density estimation using sv machines
– Weston, Gammerman, et al.
- 1997
|
|
1
|
Learning algorithms for classification: A comparism on handwritten digit recognistion. Neural Networks
– LeCun, Jackel, et al.
- 1995
|
|
1
|
Perceptrons in kernel feature space. Research report RR-720, Dept
– Frie
- 1998
|
|
1
|
Improving the generalization performance of the minimum classification error learning and its application to neural networks
– Rokui
- 1998
|