## Boosting as Entropy Projection (1999)

### Cached

### Download Links

- [www.cse.ucsc.edu]
- [users.soe.ucsc.edu]
- [www2.boosting.org]
- DBLP

### Other Repositories/Bibliography

Citations: | 66 - 9 self |

### BibTeX

@MISC{Kivinen99boostingas,

author = {Jyrki Kivinen and Manfred K. Warmuth},

title = {Boosting as Entropy Projection},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider the AdaBoost procedure for boosting weak learners. In AdaBoost, a key step is choosing a new distribution on the training examples based on the old distribution and the mistakes made by the present weak hypothesis. We show how AdaBoost 's choice of the new distribution can be seen as an approximate solution to the following problem: Find a new distribution that is closest to the old distribution subject to the constraint that the new distribution is orthogonal to the vector of mistakes of the current weak hypothesis. The distance (or divergence) between distributions is measured by the relative entropy. Alternatively, we could say that AdaBoost approximately projects the distribution vector onto a hyperplane dened by the mistake vector. We show that this new view of AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions.

### Citations

3976 | Variational Analysis
- Rochafellar, Wets
- 1998
(Show Context)
Citation Context ...1 (ff) is as in (B.4) and ff t = argmax ff Q t (ff) : (B.10) To make more use of this, let us write Q t (ff) out in a more explicit form. For this, it is useful to introduce the convex conjugate of F =-=[Roc70]-=-. This is the function G that satisfies G(`) + F (w) = ` \Delta w (B.11) for ` = f(w). From the convexity of F , we know that G is well-defined and convex. Further, by differentiating the definition (... |

2613 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...ng it several times on slightly modified training data and then combining the results in a suitable manner. Currently the most popular variants of boosting are based on Freund and Schapire's AdaBoost =-=[FS97b]-=-. The details of the boosting framework of our paper are mainly taken from Schapire and Singer's work on confidence-rated boosting [SS98]. Let us review the basic idea of boosting on a very rough leve... |

1357 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
- 1998
(Show Context)
Citation Context ...ial. In particular, the proof of (2.4) uses the property that e \Gammaffyh(x)s1 if and only if sign(h(x)) 6= sign(y). Thus the exponential gives a nice approximation to the discrete loss [SS98]. (See =-=[FHT98]-=- for more discussion). We now suggest an alternative view, in which the corrective and totally corrective updates appear as solutions to constrained relative entropy minimization problems. The exponen... |

1225 |
Linear and nonlinear programming
- Luenberger, Ye
- 2008
(Show Context)
Citation Context ...oise. The distance is measured by the relative entropy \Delta(d t+1 ; d t ) defined in (1.2). The following theorem is basically an application of standard duality techniques from convex optimization =-=[Lue84]-=-. Intuitively, a constrained minimization problem for \Delta(d; d t ) turns out to be equivalent to an unconstrained maximization problem for \Gamma ln Z t (ff). A similar result, but with more emphas... |

937 | Nonlinear programming: theory and algorithms - Bazaraa, Sherali, et al. - 1993 |

762 | Improved Boosting Algorithms Using Confidence-rated Predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...iants of boosting are based on Freund and Schapire's AdaBoost [FS97b]. The details of the boosting framework of our paper are mainly taken from Schapire and Singer's work on confidence-rated boosting =-=[SS98]-=-. Let us review the basic idea of boosting on a very rough level. We take as our starting point an arbitrary learning algorithm, which in this context is called the weak learner (as Supported by ESPRI... |

724 | The strength of weak learnability
- Schapire
- 1990
(Show Context)
Citation Context ...AdaBoost as an entropy projection is dual to the usual view of AdaBoost as minimizing the normalization factors of the updated distributions. 1 Introduction Boosting, originally suggested by Schapire =-=[Sch90]-=-, is a particular method for improving the performance of a (supervised) learning algorithm by applying it several times on slightly modified training data and then combining the results in a suitable... |

592 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ... less equivalent with the corrective boosting algorithm, but in a context somewhat different from boosting weak learners, was analysed using a duality relation similar to (1.5) by Della Pietra et al. =-=[DDL97]-=-; see Lafferty [Laf99] for connecting this work to boosting as it is understood in computational learning theory. Considering problems other than boosting, one should notice work on on-line prediction... |

452 | Boosting a weak learning algorithm by majority
- Freund
- 1990
(Show Context)
Citation Context ...ng Group NeuroCOLT2 y Supported by NSF grant CCR 9700201 opposed to the master algorithm that implements the whole boosting procedure). We also have a fixed training set of examples. Following Freund =-=[Fre95]-=-, we choose some probability distribution over the training set as the initial training distribution. We then repeat the following until some termination condition is met. We call the weak learner and... |

313 |
The relaxation method for finding the common point of convex sets and its application to the solution of problems
- Bregman
- 1967
(Show Context)
Citation Context ... U t , then (4.1) becomes the familiar Pythagorean Theorem. This property of minimum distance projections onto sets defined by linear constraints holds even more generally for all Bregman divergences =-=[Bre67]-=-. For applications in on-line learning theory, see [HW98]. 5 Boosting in a regression framework It has been pointed out [ROM98] that on highly noisy training sets, AdaBoost may tend to overfit. Consid... |

282 | Soft margins for AdaBoost
- Ratsch, Onoda, et al.
- 2001
(Show Context)
Citation Context ...linear constraints holds even more generally for all Bregman divergences [Bre67]. For applications in on-line learning theory, see [HW98]. 5 Boosting in a regression framework It has been pointed out =-=[ROM98]-=- that on highly noisy training sets, AdaBoost may tend to overfit. Considering this, the strict constraint d t+1 \Delta u t = 0 of the corrective algorithm, and the even tighter constraint of the tota... |

207 | Entropy Optimization Principles with Applications - Kapur, Kesavan - 1992 |

199 | Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems - Csiszär - 1991 |

145 | Adaptive game playing using multiplicative weights - Freund, Schapire |

143 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Haussler
- 1997
(Show Context)
Citation Context ...it is understood in computational learning theory. Considering problems other than boosting, one should notice work on on-line prediction algorithms using experts [FSSW97, KW99] and linear regression =-=[KW97]-=-. In this context, the relative entropy has been used in the same kind of double role as here, both in deriving updates and then proving (worst-case) performance bounds for them. Outside the context o... |

126 | Relative loss bounds for online density estimation with the exponential family of distributions - Azoury, Warmuth |

103 | Using and combining predictors that specialize - Freund, Schapire, et al. - 1997 |

66 | Averaging expert predictions - Kivinen, Warmuth - 1999 |

57 | An iterative row-action method for interval convex programming - Censor, Lent - 1981 |

53 | Differential Geometrical Method in Statistics - Amari - 1985 |

51 |
Convex analysis and minimization Algorithms II
- Hiriart-Urruty, Lemaréchal
- 1993
(Show Context)
Citation Context ...earners). Similar generalizations have been done in parallel work by Lafferty [Laf99]. For solving the minimization problem (1.3), one can use standard methods of constrained convex optimization; see =-=[HUL91]-=- for an overview. Here we want to point out two earlier papers that use the actual boosting update (1.1) to solve (ostensibly) a different numerical problem. Littlestone, Long and Warmuth [LLW92] sugg... |

44 | Online learning of linear functions
- Littlestone, Long, et al.
- 1991
(Show Context)
Citation Context ... see [HUL91] for an overview. Here we want to point out two earlier papers that use the actual boosting update (1.1) to solve (ostensibly) a different numerical problem. Littlestone, Long and Warmuth =-=[LLW92]-=- suggest this update for solving iteratively a system of linear equations with a sparse solution. Cesa-Bianchi, Krogh, and Warmuth [CBKW94] developed the same algorithm in the context of finding a max... |

40 | Additive models, boosting, and inference for generalized divergences
- Lafferty
- 1999
(Show Context)
Citation Context ...the corrective boosting algorithm, but in a context somewhat different from boosting weak learners, was analysed using a duality relation similar to (1.5) by Della Pietra et al. [DDL97]; see Lafferty =-=[Laf99]-=- for connecting this work to boosting as it is understood in computational learning theory. Considering problems other than boosting, one should notice work on on-line prediction algorithms using expe... |

34 | General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis - Jones, Byrne - 1990 |

25 | Relative information - Jumarie - 1990 |

24 | A geometric approach to leveraging weak learners
- Duffy, Helmbold
- 1999
(Show Context)
Citation Context ...updates motivated by different Bregman divergences are briefly discussed in Appendix B, but without any results on the training error of the resulting boosting procedure. Perhaps the GeoLev procedure =-=[DH99]-=- could be related to projections with respect to the squared Euclidean distances. Note that the relative entropy is a special divergence in that it is defined on the simplex Pm and this is the natural... |

21 | Improved boosting algorithms using condence-rated predictions - Schapire, Singer - 1999 |

20 | Tracking the best regressor
- Herbster, Warmuth
- 1998
(Show Context)
Citation Context .... This property of minimum distance projections onto sets defined by linear constraints holds even more generally for all Bregman divergences [Bre67]. For applications in on-line learning theory, see =-=[HW98]-=-. 5 Boosting in a regression framework It has been pointed out [ROM98] that on highly noisy training sets, AdaBoost may tend to overfit. Considering this, the strict constraint d t+1 \Delta u t = 0 of... |

16 | Inducing features of random elds - Pietra, S, et al. - 1997 |

12 | Bounds on approximate steepest descent for likelihood maximization in exponential families
- Cesa-Bianchi, Krogh, et al.
- 1994
(Show Context)
Citation Context ... a different numerical problem. Littlestone, Long and Warmuth [LLW92] suggest this update for solving iteratively a system of linear equations with a sparse solution. Cesa-Bianchi, Krogh, and Warmuth =-=[CBKW94]-=- developed the same algorithm in the context of finding a maximum likelihood model from an exponential family. Both papers actually give a more general algorithm that corresponds to (1.4) generalized ... |

10 | The binary exponentiated gradient algorithm for learning linear functions
- Bylander
- 1997
(Show Context)
Citation Context ...result is that we also get Theorem 1 for the usual relative entropy as a special case of the derivation given here. Another related divergence is the sum of binary relative entropies used by Bylander =-=[Byl97]-=- to analyse on-line linear regression. This divergence is defined for vectors in [0; 1] m , with \Delta F ( e w;w) = m X i=1 ` e w i ln e w i w i + (1 \Gamma e w i ) ln 1 \Gamma e w i 1 \Gamma w i ' f... |

10 | The relaxation method of £nding the common point of convex sets and its application to the solution of problems in convex programming - Bregman - 1967 |