Results 1  10
of
36
Boosting for highdimensional linear models
 THE ANNALS OF STATISTICS
, 2006
"... We prove that boosting with the squared error loss, L2Boosting, is consistent for very highdimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as O(exp(sample size)), assuming that the true underlying regression function is sparse in terms of th ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
We prove that boosting with the squared error loss, L2Boosting, is consistent for very highdimensional linear models, where the number of predictor variables is allowed to grow essentially as fast as O(exp(sample size)), assuming that the true underlying regression function is sparse in terms of the ℓ1norm of the regression coefficients. In the language of signal processing, this means consistency for denoising using a strongly overcomplete dictionary if the underlying signal is sparse in terms of the ℓ1norm. We also propose here an AICbased method for tuning, namely for choosing the number of boosting iterations. This makes L2Boosting computationally attractive since it is not required to run the algorithm multiple times for crossvalidation as commonly used so far. We demonstrate L2Boosting for simulated data, in particular where the predictor dimension is large in comparison to sample size, and for a difficult tumorclassification problem with gene expression microarray data.
Boosting algorithms: Regularization, prediction and model fitting
 Statistical Science
, 2007
"... Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and correspo ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in highdimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated opensource software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing userspecified loss functions. Key words and phrases: Generalized linear models, generalized additive models, gradient boosting, survival analysis, variable selection, software. 1.
Statistical analysis of Bayes optimal subset ranking
 IEEE Transactions on Information Theory
, 2008
"... Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain) criterion that measures the quality of items near the top of the ranklist. Similar to error minimization for binary classification, direct optimization of natural ranking criteria such as DCG leads to a nonconvex optimization problems that can be NPhard. Therefore a computationally more tractable approach is needed. We present bounds that relate the approximate optimization of DCG to the approximate minimization of certain regression errors. These bounds justify the use of convex learning formulations for solving the subset ranking problem. The resulting estimation methods are not conventional, in that we focus on the estimation quality in the topportion of the ranklist. We further investigate the asymptotic statistical behavior of these formulations. Under appropriate conditions, the consistency of the estimation schemes with respect to the DCG metric can be derived. I.
Random Classification Noise Defeats All Convex Potential Boosters
"... A broad class of boosting algorithms can be interpreted as performing coordinatewise gradient descent to minimize some potential function of the margins of a data set. This class includes AdaBoost, LogitBoost, and other widely used and wellstudied boosters. In this paper we show that for a broad c ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
A broad class of boosting algorithms can be interpreted as performing coordinatewise gradient descent to minimize some potential function of the margins of a data set. This class includes AdaBoost, LogitBoost, and other widely used and wellstudied boosters. In this paper we show that for a broad class of convex potential functions, any such boosting algorithm is highly susceptible to random classification noise. We do this by showing that for any such booster and any nonzero random classification noise rate η, there is a simple data set of examples which is efficiently learnable by such a booster if there is no noise, but which cannot be learned to accuracy better than 1/2 if there is random classification noise at rate η. This negative result is in contrast with known branching program based boosters which do not fall into the convex potential function framework and which can provably learn to high accuracy in the presence of random classification noise.
Boosting with Structural Sparsity
"... We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and backpruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the ℓ1, ℓ2, and ℓ ∞ norms of the predictor and introduce mixednorm penalties that build upon the initial penalties. The mixednorm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models. 1. Introduction and
Complexities of convex combinations and bounding the generalization error in classification
, 2008
"... We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new u ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new upper confidence bounds on generalization error of ensemble (voting) classification algorithms that utilize the new complexity measures along with the empirical distributions of classification margins, providing a better explanation of generalization performance of large margin classification methods.
Some theory for generalized boosting algorithms
 J. Machine Learning Research
, 2006
"... We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corol ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We give a review of various aspects of boosting, clarifying the issues through a few simple results, and relate our work and that of others to the minimax paradigm of statistics. We consider the population version of the boosting algorithm and prove its convergence to the Bayes classifier as a corollary of a general result about GaussSouthwell optimization in Hilbert space. We then investigate the algorithmic convergence of the sample version, and give bounds to the time until perfect separation of the sample. We conclude by some results on the statistical optimality of the L2 boosting.
Analysis of boosting algorithms using the smooth margin function
, 2007
"... We introduce a useful tool for analyzing boosting algorithms called the “smooth margin function, ” a differentiable approximation of the usual margin for boosting algorithms. We present two boosting algorithms based on this smooth margin, “coordinate ascent boosting ” and “approximate coordinate asc ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We introduce a useful tool for analyzing boosting algorithms called the “smooth margin function, ” a differentiable approximation of the usual margin for boosting algorithms. We present two boosting algorithms based on this smooth margin, “coordinate ascent boosting ” and “approximate coordinate ascent boosting, ” which are similar to Freund and Schapire’s AdaBoost algorithm and Breiman’s arcgv algorithm. We give convergence rates to the maximum margin solution for both of our algorithms and for arcgv. We then study AdaBoost’s convergence properties using the smooth margin function. We precisely bound the margin attained by AdaBoost when the edges of the weak classifiers fall within a specified range. This shows that a previous bound proved by Rätsch and Warmuth is exactly tight. Furthermore, we use the smooth margin to capture explicit properties of AdaBoost in cases where cyclic behavior occurs.
Ingrid Daubechies. Boosting based on a smooth margin
 In Proceedings of the Seventeenth Annual Conference on Computational Learning Theory
, 2004
"... Abstract. We study two boosting algorithms, Coordinate Ascent Boosting and Approximate Coordinate Ascent Boosting, which are explicitly designed to produce maximum margins. To derive these algorithms, we introduce a smooth approximation of the margin that one can maximize in order to produce a maxim ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Abstract. We study two boosting algorithms, Coordinate Ascent Boosting and Approximate Coordinate Ascent Boosting, which are explicitly designed to produce maximum margins. To derive these algorithms, we introduce a smooth approximation of the margin that one can maximize in order to produce a maximum margin classifier. Our first algorithm is simply coordinate ascent on this function, involving a line search at each step. We then make a simple approximation of this line search to reveal our second algorithm. These algorithms are proven to asymptotically achieve maximum margins, and we provide two convergence rate calculations. The second calculation yields a faster rate of convergence than the first, although the first gives a more explicit (still fast) rate. These algorithms are very similar to AdaBoost in that they are based on coordinate ascent, easy to implement, and empirically tend to converge faster than other boosting algorithms. Finally, we attempt to understand AdaBoost in terms of our smooth margin, focusing on cases where AdaBoost exhibits cyclic behavior. 1
ON EARLY STOPPING IN GRADIENT DESCENT LEARNING
"... Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function from Reproducing Kernel Hilbert Spaces (RKHSs), the family being characterized by a polynomial decreasing rate of step sizes (or learning rate). By solving a biasvariance tradeoff we ob ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Abstract. In this paper, we study a family of gradient descent algorithms to approximate the regression function from Reproducing Kernel Hilbert Spaces (RKHSs), the family being characterized by a polynomial decreasing rate of step sizes (or learning rate). By solving a biasvariance tradeoff we obtain an early stopping rule and some probabilistic upper bounds for the convergence of the algorithms. These upper bounds have improved rates where the usual regularized least square algorithm fails and achieve the minimax optimal rate O(m −1/2) in some cases. We also discuss the implication of these results in the context of classification. Some connections are addressed with Boosting, Landweber iterations, and the online learning algorithms as stochastic approximations of the gradient descent method. 1.