Results 1  10
of
46
Regularization paths for generalized linear models via coordinate descent
, 2009
"... We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, twoclass logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic ..."
Abstract

Cited by 228 (8 self)
 Add to MetaCart
(Show Context)
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, twoclass logistic regression, and multinomial regression problems while the penalties include ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
An interiorpoint method for largescale l1regularized logistic regression
 Journal of Machine Learning Research
, 2007
"... Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand ..."
Abstract

Cited by 167 (5 self)
 Add to MetaCart
Logistic regression with ℓ1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interiorpoint method for solving largescale ℓ1regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warmstart techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently.
The group Lasso for logistic regression
 Journal of the Royal Statistical Society, Series B
, 2008
"... Summary. The group lasso is an extension of the lasso to do variable selection on (predefined) groups of variables in linear regression models. The estimates have the attractive property of being invariant under groupwise orthogonal reparameterizations. We extend the group lasso to logistic regressi ..."
Abstract

Cited by 152 (7 self)
 Add to MetaCart
(Show Context)
Summary. The group lasso is an extension of the lasso to do variable selection on (predefined) groups of variables in linear regression models. The estimates have the attractive property of being invariant under groupwise orthogonal reparameterizations. We extend the group lasso to logistic regression models and present an efficient algorithm, that is especially suitable for high dimensional problems, which can also be applied to generalized linear models to solve the corresponding convex optimization problem. The group lasso estimator for logistic regression is shown to be statistically consistent even if the number of predictors is much larger than sample size but with sparse true underlying structure. We further use a twostage procedure which aims for sparser models than the group lasso, leading to improved prediction performance for some cases. Moreover, owing to the twostage nature, the estimates can be constructed to be hierarchical. The methods are used on simulated and real data sets about splice site detection in DNA sequences.
Grouped and hierarchical model selection through composite absolute penalties
 Annals of Statistics
, 2006
"... Extracting useful information from highdimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and ..."
Abstract

Cited by 94 (3 self)
 Add to MetaCart
(Show Context)
Extracting useful information from highdimensional data is an important part of the focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1penalized L2 minimization method Lasso has been popular in regression models. In this paper, we combine different norms including L1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family which allows the grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for nonoverlapping groups. In that case, we give a Bayesian 1 interpretation for CAP penalties. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. In the computation aspect, we propose using the BLASSO and crossvalidation to obtain CAP estimates. For a subfamily of CAP estimates involving only the L1 and L ∞ norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived allowing the regularization parameter to be selected without crossvalidation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments including cases with p>> n and misspecified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments. 1
Boosting algorithms: Regularization, prediction and model fitting
 Statistical Science
, 2007
"... Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and correspo ..."
Abstract

Cited by 48 (9 self)
 Add to MetaCart
(Show Context)
Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in highdimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated opensource software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing userspecified loss functions. Key words and phrases: Generalized linear models, generalized additive models, gradient boosting, survival analysis, variable selection, software. 1.
ℓ1 Trend Filtering
, 2007
"... The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
(Show Context)
The problem of estimating underlying trends in time series data arises in a variety of disciplines. In this paper we propose a variation on HodrickPrescott (HP) filtering, a widely used method for trend estimation. The proposed ℓ1 trend filtering method substitutes a sum of absolute values (i.e., an ℓ1norm) for the sum of squares used in HP filtering to penalize variations in the estimated trend. The ℓ1 trend filtering method produces trend estimates that are piecewise linear, and therefore is well suited to analyzing time series with an underlying piecewise linear trend. The kinks, knots, or changes in slope, of the estimated trend can be interpreted as abrupt changes or events in the underlying dynamics of the time series. Using specialized interiorpoint methods, ℓ1 trend filtering can be carried out with not much more effort than HP filtering; in particular, the number of arithmetic operations required grows linearly with the number of data points. We describe the method and some of its basic properties, and give some illustrative examples. We show how the method is related to ℓ1 regularization based methods in sparse signal recovery and feature selection, and list some extensions of the basic method.
Variable inclusion and shrinkage algorithms
 Journal of the American Statistical Association
, 2008
"... The Lasso is a popular and computationally efficient procedure for automatically performing both variable selection and coefficient shrinkage on linear regression models. One limitation of the Lasso is that the same tuning parameter is used for both variable selection and shrinkage. As a result, it ..."
Abstract

Cited by 17 (10 self)
 Add to MetaCart
(Show Context)
The Lasso is a popular and computationally efficient procedure for automatically performing both variable selection and coefficient shrinkage on linear regression models. One limitation of the Lasso is that the same tuning parameter is used for both variable selection and shrinkage. As a result, it typically ends up selecting a model with too many variables to prevent over shrinkage of the regression coefficients. We suggest an improved class of methods called ”Variable Inclusion and Shrinkage Algorithms” (VISA). Our approach is capable of selecting sparse models while avoiding over shrinkage problems and uses a path algorithm so is also computationally efficient. We show through extensive simulations that VISA significantly outperforms the Lasso and also provides improvements over more recent procedures, such as the Dantzig selector, Relaxed Lasso and Adaptive Lasso. In addition, we provide theoretical justification for VISA in terms of nonasymptotic bounds on the estimation error that suggest it should exhibit good performance even for large numbers of predictors. Finally, we extend the VISA methodology, path algorithm, and theoretical bounds to the Generalized Linear Models framework.
Unified lasso estimation via least squares approximation
, 2007
"... We propose a method of least squares approximation (LSA) for unified yet simple LASSO estimation. Our general theoretical framework includes ordinary least squares, generalized linear models, quantile regression, and many others as special cases. Specifically, LSA can transfer many different types o ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
We propose a method of least squares approximation (LSA) for unified yet simple LASSO estimation. Our general theoretical framework includes ordinary least squares, generalized linear models, quantile regression, and many others as special cases. Specifically, LSA can transfer many different types of LASSO objective functions into their asymptotically equivalent leastsquares problems. Thereafter, the standard asymptotic theory can be established and the LARS algorithm can be applied. In particular, if the adaptive LASSO penalty and a BICtype tuning parameter selector are used, the resulting LSA estimator can be as efficient as oracle. Extensive numerical studies confirm our theory.
Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization
, 2009
"... Convex optimization problems arising in applications, possibly as approximations of intractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Convex optimization problems arising in applications, possibly as approximations of intractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this is an error bound for the linear convergence analysis of firstorder gradient methods for solving these problems. Example applications include compressed sensing, variable selection in regression, TVregularized image denoising, and sensor network localization.
A generalized Dantzig selector with shrinkage tuning
 Biometrika
, 2009
"... The Dantzig selector performs variable selection and model fitting in linear regression. It uses an L1 penalty to shrink the regression coefficients towards zero, in a similar fashion to the Lasso. While both the Lasso and Dantzig selector potentially do a good job of selecting the correct variables ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
The Dantzig selector performs variable selection and model fitting in linear regression. It uses an L1 penalty to shrink the regression coefficients towards zero, in a similar fashion to the Lasso. While both the Lasso and Dantzig selector potentially do a good job of selecting the correct variables, they tend to overshrink the final coefficients. This results in an unfortunate tradeoff. One can either select a high shrinkage tuning parameter that produces an accurate model but poor coefficient estimates or a low shrinkage parameter that produces more accurate coefficients but includes many irrelevant variables. We extend the Dantzig selector to fit generalized linear models while also eliminating overshrinkage of the coefficient estimates. In addition, we develop a computationally efficient algorithm, similar in nature to least angle regression, to compute the entire path of coefficient estimates. A detailed simulation study illustrates the advantages of our approach relative to several other possible methods. Finally, we apply the methodology to two realworld datasets.