## Logistic Regression, AdaBoost and Bregman Distances (2000)

### Cached

### Download Links

Citations: | 213 - 43 self |

### BibTeX

@MISC{Collins00logisticregression,,

author = {Michael Collins and Robert E. Schapire and Yoram Singer},

title = { Logistic Regression, AdaBoost and Bregman Distances},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.

### Citations

2426 | A Decision-Theoretic Generalization of Online Learning and an Application to Boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...s intractable (see, for instance, (Höffgen & Simon, 1992)). It is therefore often advantageous to instead minimize some other nonnegative loss function. For instance, the boosting algorithm AdaBoost (=-=Freund & Schapire, 1997-=-; Schapire & Singer, 1999) is based on the exponential loss mX i=1 exp ; ;yif (xi) : (2) It can be verified that Eq. (1) is upper bounded by Eq. (2). However, the latter loss is much easier to work wi... |

1287 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
- 2000
(Show Context)
Citation Context ...owever, in this paper, we make the idealized assumption that the weak learner always chooses the best hj. Given this assumption, it has been noted by Breiman (1997a, 1997b) and various later authors (=-=Friedman et al., 2000-=-; Mason et al., 1999; Rätsch, Onoda, & Müller, 2001; Schapire & Singer, 1999) that the choice of both hj and are done in such a way as to cause the greatest decrease in the exponential loss induced by... |

1123 | A maximum entropy approach to natural language processing - Berger, Pietra, et al. - 1996 |

717 | Improved boosting algorithms using confidence-rated predictions
- Schapire, Singer
- 1999
(Show Context)
Citation Context ...nstance, (Höffgen & Simon, 1992)). It is therefore often advantageous to instead minimize some other nonnegative loss function. For instance, the boosting algorithm AdaBoost (Freund & Schapire, 1997; =-=Schapire & Singer, 1999-=-) is based on the exponential loss mX i=1 exp ; ;yif (xi) : (2) It can be verified that Eq. (1) is upper bounded by Eq. (2). However, the latter loss is much easier to work with as demonstrated by Ada... |

567 | Inducing Features of Random Fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...aximizing this likelihood then is equivalent to minimizing the log loss of this model mX ln ; 1 + exp ; ;yif (xi) : (3) i=1 Generalized and improved iterative scaling (Darroch & Ratcliff, 1972; Della =-=Pietra et al., 1997-=-) are popular parallel-update methods for minimizing this loss. In this paper, we give an alternative parallelupdate algorithm which we compare to iterative scaling techniques in preliminary experimen... |

513 | Boostexter: a boosting-based system for text categorization
- Schapire, Singer
(Show Context)
Citation Context ...tic loss rather than exponential loss—the only difference is in the manner in which qt is computed from t. Thus, we could easily convert any system such as SLIPPER (Cohen & Singer, 1999), BoosTexter (=-=Schapire & Singer, 2000-=-) or alternating trees (Freund & Mason, 1999) to use logistic loss. We can even do this for systems based on “confidence-rated” boosting (Schapire & Singer, 1999) in which t and jt are chosen together... |

437 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...erations, a single feature (weak hypothesis) is chosen and the parameter associated with that single feature is adjusted. In contrast, methods for logistic regression, most notably iterative scaling (=-=Darroch & Ratcliff, 1972-=-; Della Pietra, Della Pietra, & Lafferty, 1997), update all parameters in parallel on each iteration. Our first new algorithm is a method for optimizing the exponential loss using parallel updates. It... |

277 | The relaxation method to find the common point of convex sets and its applications to the solution of problems in convex programming - Bregman - 1967 |

268 | I-divergence geometry of probability distributions and minimization problems - Csiszar - 1975 |

267 | Parallel Optimization, Theory, Algorithms and Applications - Censor, Zenios - 1997 |

265 | Soft margins for AdaBoost
- Rätsch, Onoda, et al.
(Show Context)
Citation Context ...d assumption that the weak learner always chooses the best hj. Given this assumption, it has been noted by Breiman (1997a, 1997b) and various later authors (Friedman et al., 2000; Mason et al., 1999; =-=Rätsch, Onoda, & Müller, 2001-=-; Schapire & Singer, 1999) that the choice of both hj and are done in such a way as to cause the greatest decrease in the exponential loss induced by , given that only a single component of is to be u... |

181 | least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems - Csiszár, “Why - 1991 |

146 | Functional gradient techniques for combining hypotheses - Mason, Baxter, et al. - 2000 |

136 | Additive versus Exponentiated Gradient updates for linear prediction - Kivinen, Warmuth - 1997 |

131 | The alternating decision tree learning algorithm
- Freund, Mason
- 1999
(Show Context)
Citation Context ...difference is in the manner in which qt is computed from t. Thus, we could easily convert any system such as SLIPPER (Cohen & Singer, 1999), BoosTexter (Schapire & Singer, 2000) or alternating trees (=-=Freund & Mason, 1999-=-) to use logistic loss. We can even do this for systems based on “confidence-rated” boosting (Schapire & Singer, 1999) in which t and jt are chosen together on each round to minimize Eq. (30) rather t... |

96 | A simple, fast, and effective rule learner - Cohen, Singer - 1999 |

85 | Robust Trainability of Single Neurons
- Höffgen, Horn
- 1995
(Show Context)
Citation Context ...e [[ ]] is1if is true and 0 otherwise. Although minimization of the number of classification errors may be a worthwhile goal, in its most general form, the problem is intractable (see, for instance, (=-=Höffgen & Simon, 1992-=-)). It is therefore often advantageous to instead minimize some other nonnegative loss function. For instance, the boosting algorithm AdaBoost (Freund & Schapire, 1997; Schapire & Singer, 1999) is bas... |

77 | Sanov property, generalized I-projection and a conditional limit theorem - Csiszár - 1984 |

76 | Schapire and Yoram Singer, “Improved boosting algorithms using confidence-rated predictions - Robert - 1999 |

61 | M.K.: Boosting as entropy projection - Kivinen, Warmuth - 1999 |

60 | Arcing the edge - Breiman - 1997 |

47 | An iterative row-action method for interval convex programming - Censor, Lent - 1981 |

42 | On-line learning of linear functions - Littlestone, Long, et al. - 1991 |

40 | Additive models, boosting, and inference for generalized divergences - Lafferty - 1999 |

35 | Prediction Games and Arcing Classifiers - Breiman |

31 | Information theoretical optimization techniques - Topsoe - 1979 |

29 | Duality and auxiliary functions for bregman distances - Pietra, Pietera - 2001 |

26 | Generalized projections for non-negative functions - Csiszár - 1995 |

25 | Statistical learning algorithms based on bregman distances - Pietra, Pietra, et al. - 1997 |

19 | Information Theory - Csiszár, Körner - 1981 |

19 | Potential boosters - Duffy, Helmbold - 2000 |

11 | Scaling up a boosting-based learner via adaptive sampling - Domingo, Watanabe - 2000 |

10 | Bounds on approximate steepest descent for likelihood maximization in exponential families - Cesa-Bianchi, Krogh, et al. - 1994 |

5 | From Computational Learning Theory to Discovery Science - Watanabe - 1999 |

3 | Robust trainability of single neurons - offgen, Simon - 1992 |

2 | Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. The Annals of Statistics - ar, I - 1991 |

2 | I-divergence geometry of probability distributions and minimization problems. The Annals of Probability - ar - 1975 |

1 | Generalized projections for non-negative functions - ar, I - 1995 |

1 | Sanov property, generalized I-projection and a conditional limit theorem - ar - 1984 |