We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show th...
|
1239
|
A decision-theoretic generalization of on-line learning and an application to boosting
– Freund, Schapire
- 1997
|
|
642
|
A maximum entropy approach to natural language processing
– Berger, Pietra, et al.
- 1996
|
|
623
|
Additive logistic regression: a statistical view of boosting”, Ann
– Friedman, Hastie, et al.
- 2000
|
|
408
|
Improved boosting algorithms using confidence-rated predictions. Machine Learning 37
– Schapire, Singer
- 1999
|
|
370
|
Inducing features of random fields
– Pietra, Pietra, et al.
- 1997
|
|
298
|
Generalized iterative scaling for log-linear models
– Darroch, Ratcliff
- 1972
|
|
278
|
BoosTexter: A boosting-based system for text categorization
– Schapire, Singer
- 2000
|
|
163
|
Soft margins for AdaBoost
– Rätsch, Onoda, et al.
- 2001
|
|
156
|
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability
– Csiszar
- 1975
|
|
126
|
Parallel optimization: Theory, algorithms, and applications
– Censor, Zenios
- 1997
|
|
110
|
Additive versus exponentiated gradient updates for linear prediction
– Kivinen, Warmuth
- 1997
|
|
103
|
A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming
– Brègman
- 1967
|
|
95
|
Functional gradient techniques for combining hypotheses
– Mason, Baxter, et al.
- 2000
|
|
67
|
A simple, fast, and effective rule learner
– Cohen, Singer
- 1999
|
|
66
|
The alternating decision tree learning algorithm
– Freund, Mason
|
|
65
|
Robust trainability of single neurons
– Hoffgen, Horn, et al.
- 1995
|
|
53
|
Arcing the edge
– Breiman
- 1997
|
|
47
|
Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions
– Robert
- 1999
|
|
41
|
Boosting as Entropy Projection
– Kivinen, Warmuth
- 1999
|
|
40
|
On-line learning of linear functions
– Littlestone, Long, et al.
- 1988
|
|
35
|
Sanov property, generalized i-projection and a conditional limit theorem. The Annals of Probability
– Csiszar
- 1984
|
|
32
|
Additive models, boosting, and inference for generalized divergences
– Lafferty
- 1999
|
|
31
|
An iterative row-action method for interval convex programming
– Censor, Lent
- 1981
|
|
24
|
Prediction games and arcing classifiers
– Breiman
- 1997
|
|
18
|
Potential boosters
– Duffy, Helmbold
- 1999
|
|
18
|
Duality and auxiliary functions for bregman distances
– Pietra, Pietera
- 2001
|
|
14
|
Statistical learning algorithms based on Bregman distances
– Lafferty, Pietra, et al.
- 1997
|
|
12
|
Körner: “Information Theory
– Csiszár, J
- 1981
|
|
11
|
Scaling up a boosting-based learner via adaptive sampling
– Domingo, Watanabe
- 2000
|
|
9
|
Information theoretical optimization techniques
– Topsøe
- 1979
|
|
8
|
Bounds on approximate steepest descent for likelihood maximization in exponential families
– Cesa-Bianchi, Krogh, et al.
- 1994
|
|
5
|
From computational learning theory to discovery science
– Watanabe
- 1999
|
|
3
|
Robust trainability of single neurons
– offgen, Simon
- 1992
|
|
2
|
Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. The Annals of Statistics
– ar, I
- 1991
|
|
2
|
I-divergence geometry of probability distributions and minimization problems. The Annals of Probability
– ar
- 1975
|
|
1
|
Generalized projections for non-negative functions
– ar, I
- 1995
|
|
1
|
Sanov property, generalized I-projection and a conditional limit theorem
– ar
- 1984
|