## Additive Logistic Regression: a Statistical View of Boosting (1998)

### Cached

### Download Links

- [stat.stanford.edu]
- [utstat.toronto.edu]
- [www-stat.stanford.edu]
- CiteULike

### Other Repositories/Bibliography

Venue: | Annals of Statistics |

Citations: | 1303 - 21 self |

### BibTeX

@ARTICLE{Friedman98additivelogistic,

author = {Jerome Friedman and Trevor Hastie and Robert Tibshirani},

title = {Additive Logistic Regression: a Statistical View of Boosting},

journal = {Annals of Statistics},

year = {1998},

volume = {28},

pages = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted majority vote of the sequence of classifiers thereby produced. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multi-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most...

### Citations

4358 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...p(x)) \Gamma f(x) ' 2 (32) 2 The population algorithm described here translates immediately to an implementation on data when E(\Deltajx) is replaced by a regression method, such as regression trees (=-=Breiman et al. 1984-=-). While the role of the weights are somewhat artificial in the L 2 case, they are not in any implementation; w(x) is constant when conditioned on x, but the w(x i ) in a terminal node of a tree, for ... |

2721 | Bagging Predictors
- Breiman
- 1996
(Show Context)
Citation Context ...M (Breiman, Friedman, Olshen & Stone 1984) as the base classifier. This adaptation grows fixed-size trees in a "best-first" manner (see Section 7, page 32). Included in the figure is the bag=-=ged tree (Breiman 1996-=-) which averages trees grown on bootstrap resampled versions of the training data. Bagging is purely a variance-reduction technique, and since trees tend to have high variance, bagging often produces ... |

2470 | A decision-theoretic generalization of online learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ... generalization error. This theory (Schapire 1990) has evolved in the machine learning community, initially based on the concepts of PAC learning (Kearns & Vazirani 1994), and later from game theory (=-=Freund 1995, Breiman -=-1997). Early versions of boosting "weak learners" (Schapire 1990) are far simpler than those described here, and the theory is more precise. The bounds and the theory associated with the Ada... |

1791 | Generalized Linear Models - McCullagh, Nelder - 1983 |

1750 | Experiments with a new boosting algorithm - Freund, Schapire - 1996 |

1611 | Generalized Additive Models - Hastie, Tibshirani - 1990 |

1133 | Matching pursuits with time-frequency dictionaries - Mallat, Zhang - 1993 |

767 | Boosting the margin: A new explanation for the effectiveness of voting methods - Schapire, Freund, et al. - 1997 |

699 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...imate for fm (x) in Section 3. Freund & Schapire (1996) and Schapire & Singer (1998) provide some theory to support their algorithms, in the form of upper bounds on generalization error. This theory (=-=Schapire 1990-=-) has evolved in the machine learning community, initially based on the concepts of PAC learning (Kearns & Vazirani 1994), and later from game theory (Freund 1995, Breiman 1997). Early versions of boo... |

615 | An introduction to computational learning theory - Kearns, Vazirani - 1994 |

471 | Very simple classification rules perform well on most commonly used datasets. Machine Learning 11:63–91
- Holte
- 1993
(Show Context)
Citation Context ...illicit performance differences among the methods being tested. Such complicated boundaries are not likely to often occur in practice. Many practical problems involve comparatively simple boundaries (=-=Holte 1993-=-); in such cases performance differences will still be situation dependent, but correspondingly less pronounced. 6 Some experiments with data In this section we show the results of running the four fi... |

466 | An experimental comparison of three methods for constructing ensembles of decision trees
- Dietterich
- 2000
(Show Context)
Citation Context ...A variety of other examples (not shown) exhibit similar behavior with all boosting methods. Note that other committee approaches to classification such as bagging (Breiman 1996) and randomized trees (=-=Dietterich 1998-=-), while admitting parallel implementations, cannot take advantage of this approach to reduce computation. 9 Concluding remarks In order to understand a learning procedure statistically it is necessar... |

440 | Boosting a Weak Learning Algorithm by Majority
- Freund
- 1990
(Show Context)
Citation Context ... generalization error. This theory (Schapire 1990) has evolved in the machine learning community, initially based on the concepts of PAC learning (Kearns & Vazirani 1994), and later from game theory (=-=Freund 1995, Breiman -=-1997). Early versions of boosting "weak learners" (Schapire 1990) are far simpler than those described here, and the theory is more precise. The bounds and the theory associated with the Ada... |

439 | Projection pursuit regression - Friedman, Stuetzel - 1981 |

255 |
Multivariate Adaptive Regression Splines” (with discussion
- Friedman
- 1991
(Show Context)
Citation Context ...ient number of boosts, the stump based model achieved superior performance. More generally, one can consider an expansion of the of the decision boundary function in a functional ANOVA decomposition (=-=Friedman 1991-=-) B(x) = X j f j (x j ) + X j;k f jk (x j ; x k ) + X j;k;l f jkl (x j ; x k ; x l ) + ::: (43) The first sum represents the closest function to B(x) that is additive in the original features, the fir... |

145 |
Another approach to polychotomous classification,” Dept. Statistics
- Friedman
- 1996
(Show Context)
Citation Context ...pooled complement classes. Even if the decision boundaries separating all class pairs are relatively simple, pooling classes can produce complex decision boundaries that are difficult to approximate (=-=Friedman 1996-=-). By considering all of the classes simultaneously, the symmetric multi--class model is better able to take advantage of simple pairwise boundaries when they exist (Hastie & Tibshirani 1998). As note... |

144 | Prediction games and arcing algorithms
- Breiman
- 1999
(Show Context)
Citation Context ...on error. This theory (Schapire 1990) has evolved in the machine learning community, initially based on the concepts of PAC learning (Kearns & Vazirani 1994), and later from game theory (Freund 1995, =-=Breiman 1997). Early v-=-ersions of boosting "weak learners" (Schapire 1990) are far simpler than those described here, and the theory is more precise. The bounds and the theory associated with the AdaBoost algorith... |

116 | Flexible discriminant analysis by optimal scoring - Hastie, Tibshirani, et al. - 1994 |

97 | variance and arcing classifiers
- Breiman
(Show Context)
Citation Context ...M (Breiman, Friedman, Olshen & Stone 1984) as the base classifier. This adaptation grows fixed-size trees in a "best-first" manner (see Section 7, page 32). Included in the figure is the bag=-=ged tree (Breiman 1996-=-a) which averages trees grown on bootstrap resampled versions of the training data. Bagging is purely a 1 Essentially the same as AdaBoost.M1 for binary data (Freund & Schapire 1996) Discrete AdaBoost... |

67 | Linear Smoothers and Additive Models” (with discussion), The Annals of Statistics - Buja, Hastie, et al. - 1989 |

24 | Classification by pairwise coupling. The annals of statistics - Hastie, Tibshirani - 1998 |

7 | Nearest neighbor pattern classification,” Proc - Cover, Hart - 1967 |