Results 1  10
of
90
Regularization and variable selection via the Elastic Net
 Journal of the Royal Statistical Society, Series B
, 2005
"... Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where ..."
Abstract

Cited by 362 (8 self)
 Add to MetaCart
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p n case. An algorithm called LARSEN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
The tradeoffs of large scale learning
 IN: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20
, 2008
"... This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approx ..."
Abstract

Cited by 133 (4 self)
 Add to MetaCart
This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of smallscale and largescale learning problems. Smallscale learning problems are subject to the usual approximation–estimation tradeoff. Largescale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in nontrivial ways.
Convexity, Classification, and Risk Bounds
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2003
"... Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficien ..."
Abstract

Cited by 121 (14 self)
 Add to MetaCart
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
Piecewise linear regularized solution paths
 Ann. Statist
, 2007
"... We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ i ..."
Abstract

Cited by 86 (8 self)
 Add to MetaCart
We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the Lasso for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen & van de Geer’s Locally Adaptive Regression Splines. 1
On the Rate of Convergence of Regularized Boosting Classifiers
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite ..."
Abstract

Cited by 46 (10 self)
 Add to MetaCart
A regularized boosting method is introduced, for which regularization is obtained through a penalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite a large class of distributions, the probability of error converges to the Bayes risk at a rate faster than n (V+2)/(4(V+1)) where V is the VC dimension of the "base" class whose elements are combined by boosting methods to obtain an aggregated classifier. The dimensionindependent nature of the rates may partially explain the good behavior of these methods in practical problems. Under Tsybakov's noise condition the rate of convergence is even faster. We investigate the conditions necessary to obtain such rates for different base classes. The special case of boosting using decision stumps is studied in detail. We characterize the class of classifiers realizable by aggregating decision stumps.
On the consistency of multiclass classification methods
 In Proceedings of the 18th Conference on Computational Learning Theory (COLT
, 2005
"... Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns o ..."
Abstract

Cited by 43 (1 self)
 Add to MetaCart
Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature.
Statistical performance of support vector machines
 ANN. STATIST
, 2008
"... The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes. Our main result build ..."
Abstract

Cited by 43 (8 self)
 Add to MetaCart
The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes. Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain “oracle inequalities ” in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.