Results 1  10
of
44
Boosting the margin: A new explanation for the effectiveness of voting methods
 IN PROCEEDINGS INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1997
"... One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this ..."
Abstract

Cited by 898 (52 self)
 Add to MetaCart
One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the biasvariance decomposition.
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2000
"... We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a marginbased binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class ..."
Abstract

Cited by 560 (20 self)
 Add to MetaCart
We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a marginbased binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class is compared against all others, or in which all pairs of classes are compared to each other, or in which output codes with errorcorrecting properties are used. We propose a general method for combining the classifiers generated on the binary problems, and we prove a general empirical multiclass loss bound given the empirical loss of the individual binary learning algorithms. The scheme and the corresponding bounds apply to many popular classification learning algorithms including supportvector machines, AdaBoost, regression, logistic regression and decisiontree algorithms. We also give a multiclass generalization error analysis for general output codes with AdaBoost as the binary learner. Experimental results with SVM and AdaBoost show that our scheme provides a viable alternative to the most commonly used multiclass algorithms.
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 261 (44 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
Theoretical Views of Boosting and Applications
, 1999
"... . Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, we briefly survey theoretical work on boosting including analyses of AdaBoost's training error and generalization error, connections between boosting and game ..."
Abstract

Cited by 74 (2 self)
 Add to MetaCart
. Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, we briefly survey theoretical work on boosting including analyses of AdaBoost's training error and generalization error, connections between boosting and game theory, methods of estimating probabilities using boosting, and extensions of AdaBoost for multiclass classification problems. Some empirical work and applications are also described. Background Boosting is a general method which attempts to "boost" the accuracy of any given learning algorithm. Kearns and Valiant [29, 30] were the first to pose the question of whether a "weak" learning algorithm which performs just slightly better than random guessing in Valiant's PAC model [44] can be "boosted" into an arbitrarily accurate "strong" learning algorithm. Schapire [36] came up with the first provable polynomialtime boosting algorithm in 1989. A year later, Freund [16] developed a much more effici...
How boosting the margin can also boost classifier complexity
 In Proceedings of the 23rd International Conference on Machine Learning
, 2006
"... Boosting methods are known not to usually overfit training data even as the size of the generated classifiers becomes large. Schapire et al. attempted to explain this phenomenon in terms of the margins the classifier achieves on training examples. Later, however, Breiman cast serious doubt on this e ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
Boosting methods are known not to usually overfit training data even as the size of the generated classifiers becomes large. Schapire et al. attempted to explain this phenomenon in terms of the margins the classifier achieves on training examples. Later, however, Breiman cast serious doubt on this explanation by introducing a boosting algorithm, arcgv, that can generate a higher margins distribution than AdaBoost and yet performs worse. In this paper, we take a close look at Breiman’s compelling but puzzling results. Although we can reproduce his main finding, we find that the poorer performance of arcgv can be explained by the increased complexity of the base classifiers it uses, an explanation supported by our experiments and entirely consistent with the margins theory. Thus, we find maximizing the margins is desirable, but not necessarily at the expense of other factors, especially baseclassifier complexity. 1.
Process Consistency for AdaBoost
 Annals of Statistics
, 2000
"... Introduction. Some recent experimental results [e.g., Friedman, Hastie and Tibshirani (1999); Grove and Schuurmans (1998); Mason et al. (1998)] and theoretical examples [Jiang (1999)] suggest that the AdaBoost algorithm [e.g., Schapire (1999); Freund and Schapire (1997)] can overfit in the limit of ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
Introduction. Some recent experimental results [e.g., Friedman, Hastie and Tibshirani (1999); Grove and Schuurmans (1998); Mason et al. (1998)] and theoretical examples [Jiang (1999)] suggest that the AdaBoost algorithm [e.g., Schapire (1999); Freund and Schapire (1997)] can overfit in the limit of (very) large time (or the number of rounds of AdaBoost), despite the observation that the algorithm is often found to be resistant to overfitting after running hundreds of rounds. Jiang (1999) provides examples where it can be shown that the prediction error of Adaboost [P E(AdaBoost t n ), depending on the sample size n and the time t] is asymptotically suboptimal at t = 1, in the sense that the prediction at t = 1 is not consistent. Here by consistency of a prediction we mean that as the sample size n<F12.24
A GradientBased Boosting Algorithm for Regression Problems
 In Advances in Neural Information Processing Systems
, 2001
"... In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the met ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boosting methods for classification problems have been derived as gradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. We propose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to "boost" the small advantage that a hypothesis produced by a weak learner can achieve over random guessing, by using the weak learning procedure several times on a sequence of carefully constructed distributions. Boosting methods, notably AdaBoost (Freund & Schapire, 1997), are simple yet powerful algorithms that are easy to im...
Boosting on Manifolds: Adaptive Regularization of Base Classifiers
 In Advances in Neural Information Processing Systems 16
, 2004
"... In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve ADABOOST by incorporating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use ADABOOST's efficient learning mechanism t ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve ADABOOST by incorporating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use ADABOOST's efficient learning mechanism to significantly improve supervised and semisupervised algorithms proposed in the context of manifold learning. Beside the specific manifoldbased penalization, the resulting algorithm also accommodates the boosting of a large family of regularized learning algorithms.
The Rate of Convergence of AdaBoost
"... The AdaBoost algorithm was designed to combine many “weak ” hypotheses that perform slightly better than random guessing into a “strong ” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss. ” Unlike previous work, our ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
The AdaBoost algorithm was designed to combine many “weak ” hypotheses that perform slightly better than random guessing into a “strong ” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss. ” Unlike previous work, our proofs do not require a weaklearning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that at iteration t, the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence on these parameters is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the dataset. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. 1