## Piecewise linear regularized solution paths (2007)

Venue: | Ann. Statist |

Citations: | 87 - 8 self |

### BibTeX

@ARTICLE{Rosset07piecewiselinear,

author = {Saharon Rosset and Ji Zhu},

title = {Piecewise linear regularized solution paths},

journal = {Ann. Statist},

year = {2007},

pages = {1030}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider the generic regularized optimization problem ˆ β(λ) = arg minβ L(y, Xβ) + λJ(β). Recently, Efron et al. (2004) have shown that for the Lasso – that is, if L is squared error loss and J(β) = ‖β‖1 is the l1 norm of β – the optimal coefficient path is piecewise linear, i.e., ∂ ˆ β(λ)/∂λ is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the Lasso for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen & van de Geer’s Locally Adaptive Regression Splines. 1

### Citations

9002 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...⊤ i β ) + λ�β� 2 2. Many “modern” methods for machine learning and signal processing can also be cast in the framework of regularized optimization. For example, the regularized support vector machine =-=[20]-=- uses the hinge loss function and the ℓ2-norm penalty: (4) ˆβ(λ) = min β n� (1 − yix i=1 ⊤ i β)+ + λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting [6] is a popular and highly succe... |

2055 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...so to that of the 14Lasso on the original data and after we artificially “contaminate” the data by adding large constants to a small number of responses. We use the training-test configuration as in =-=[6]-=-, page 48. The training set consists of 67 observations and the test set of 30 observations. We ran the Lasso and the Huberized Lasso with “knot” at t = 1 on the original dataset, and on the “contamin... |

1858 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...roaches, such as ridge regression and the Lasso. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β describing the fitted model: Ridge: Lasso =-=[16]-=-: ˆ β(λ) = min β ˆ β(λ) = min β n∑ i=1 n∑ i=1 (yi − x T iβ) 2 + λ‖β‖ 2 2, (2) (yi − x T iβ) 2 + λ‖β‖1. (3) Another example from the statistics literature is the penalized logistic regression model [20... |

1639 | Experiments with a new boosting algorithm
- Freund, Schapire
- 1996
(Show Context)
Citation Context ...rized support vector machine [20] uses the hinge loss function and the ℓ2-norm penalty: (4) ˆβ(λ) = min β n� (1 − yix i=1 ⊤ i β)+ + λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting =-=[6]-=- is a popular and highly successful method for iteratively building an additive model from a dictionary of “weak learners.” In [15] we have shown that the AdaBoost algorithm approximately follows the ... |

1280 |
Spline models for observational data
- Wahba
- 1990
(Show Context)
Citation Context ...16]: ˆ β(λ) = min β ˆ β(λ) = min β n∑ i=1 n∑ i=1 (yi − x T iβ) 2 + λ‖β‖ 2 2, (2) (yi − x T iβ) 2 + λ‖β‖1. (3) Another example from the statistics literature is the penalized logistic regression model =-=[20]-=- for classification, which is widely used in medical decisions and credit scoring: ˆβ(λ) = min β n∑ i=1 log(1 + e −yixT i β ) + λ‖β‖ 2 2, where the loss is the negative binomial log-likelihood. Many “... |

700 |
The elements of statistical learning: data mining, inference, and prediction: with 200 full-color illustrations
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...erized LASSO to that of the LASSO on the original data and after we artificially “contaminate” the data by adding large constants to a small number of responses. We used the training-test split as in =-=[8]-=-. The training set consists of 67 observations and the test set of 30 observations. We ran the LASSO and the Huberized LASSO with a knot at t = 1 on the original dataset, and on the “contaminated” dat... |

493 |
Ridge regression: biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...cribed as exact or approximate regularized optimization approaches. The obvious examples from the statistics literature are explicit regularized linear regression approaches, such as ridge regression =-=[9]-=- and the LASSO [17]. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β: (2) (3) Ridge : LASSO : ˆ β(λ) = min β ˆ β(λ) = min β n� (yi −x i=1 ⊤... |

342 | Variable selection via nonconcave penalized likelihood and its oracle properties
- Fan, Li
- 2001
(Show Context)
Citation Context ...ch efficient algorithms can be designed, leaves out some other statistically well motivated fitting approaches. The use of a nonconvex penalty was advocated by Fan and collaborators in several papers =-=[4, 5]-=-. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. [16] advocates the use of nonconvex ψ-loss... |

329 |
Regression shrinkage and selection via the
- Tibshirani
- 1996
(Show Context)
Citation Context ... approximate regularized optimization approaches. The obvious examples from the statistics literature are explicit regularized linear regression approaches, such as ridge regression [9] and the LASSO =-=[17]-=-. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β: (2) (3) Ridge : LASSO : ˆ β(λ) = min β ˆ β(λ) = min β n� (yi −x i=1 ⊤ i β) 2 + λ�β� 2 2,... |

327 |
Robust Estimation of a Location Parameter
- Huber
- 1964
(Show Context)
Citation Context ...ons. On the loss side, this leads us to consider functions L which are: • Pure quadratic loss functions, like those of linear regression. • A mixture of quadratic and linear pieces, like Huber’s loss =-=[10]-=-. These loss functions are of interest because they generate robust modeling tools. They will be the focus of Section 3. • Loss functions which are piecewise linear. These include several widely used ... |

327 | Quantile Regression
- Koenker, Hallock
- 2001
(Show Context)
Citation Context ...obviously of interest, if we can offer new, efficient algorithms for solving them. In this paper we discuss in this context locally adaptive regression splines [13] (Section 4.1), quantile regression =-=[11]-=- and support vector machines (Section 5). (b) Our efficient algorithms allow us to pose statistically motivated regularized problems that have not been considered in the literature. In this context, w... |

163 | A new approach to variable selection in least squares problems - Osborne, Presnell, et al. |

148 |
Wavelet shrinkage: asymptopia? (with discussion
- Donoho, Johnstone, et al.
- 1995
(Show Context)
Citation Context ...tion perspective, ℓ1 regularization gives “smooth” variable selection and more compact models than ℓ2 regularization. In the case of orthogonal wavelet bases, the soft thresholding method proposed by =-=[2]-=-, which is equivalent to ℓ1 regularization, is asymptotically nearly optimal (in a minimax sense) over a wide variety of loss functions and estimated functions. It is not surprising, therefore, that ℓ... |

148 | The entire regularization path for the support vector machine
- Hastie, Rosset, et al.
(Show Context)
Citation Context ...hms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in [13]. We have extended some path-following ideas to versions of the regularized support vector machine =-=[7, 21]-=-. In this paper, we systematically investigate the usefulness of piecewise linear solution paths. We aim to combine efficient computational methods based on piecewise linear paths and statistical cons... |

145 | On the lasso and its dual
- Osborne, Presnell, et al.
(Show Context)
Citation Context ...e shown that the piecewise linear coefficient paths property holds for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso in =-=[11, 13]-=- and for total-variation penalized squared error loss in [10]. We have extended some path-following ideas to versions of the regularized support vector machine [22, 7]. The results in [3] show that th... |

137 | Feature selection, L1 vs. L2 regularization, and rotational invariance
- Ng
- 2004
(Show Context)
Citation Context ...uch superior to l2 regularization from a prediction error perspective. Indeed, a sense can be defined in which l1 regularized problems have lower complexity than l2 regularized ones in high dimension =-=[5, 12]-=-. From an inference/interpretation perspective, l1 regularization gives “smooth” variable selection and more compact models than l2 regularization [3]. In the case of orthogonal wavelet bases, the sof... |

133 | Sparsity and smoothness via the fused lasso
- Tibshirani, Saunders, et al.
- 2005
(Show Context)
Citation Context ...liminate the rest of the changes for brevity. An example where a specific two-penalty formulation is natural and of great practical interest in a specific application can be found in our recent paper =-=[17]-=-. The problem considered is that of protein mass spectroscopy, and the predictors correspond to a continuum of “time of flight” sites. Thus the predictors have an order and a distance between them, an... |

111 | Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–134. MR2051001 X
- ZHANG
- 2004
(Show Context)
Citation Context ...f t < r ≤ 1 0 otherwise. It is trivial to show that argmin fEyl(r(y,f)) = 2Pr(y = 1) − 1, hence the population minimizer of the Huberized squared hinge loss gives the correct sign for classification. =-=[21]-=- has also considered this loss function for the t = −1 case, but more so from a theoretical perspective. It is also worth to note that [15] has proposed a ψ loss function that is as robust as the 0 − ... |

90 |
Least angle regression (with discussion
- Efron, Hastie, et al.
(Show Context)
Citation Context ....,γm−1. Our discussion will concentrate on (L, J) pairs which allow efficient generation of the whole path and give statistically useful modeling tools. A canonical example is the LASSO (3). Recently =-=[3]-=- has shown that the piecewise linear coefficient paths property holds for the LASSO, and suggested the LAR–LASSO algorithm which takes advantage of it. Similar algorithms were suggested for the LASSO ... |

79 | Nonconcave penalized likelihood with a diverging number of parameters
- Fan, Peng
- 2004
(Show Context)
Citation Context ...ch efficient algorithms can be designed, leaves out some other statistically well motivated fitting approaches. The use of a nonconvex penalty was advocated by Fan and collaborators in several papers =-=[4, 5]-=-. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. [16] advocates the use of nonconvex ψ-loss... |

70 | Boosting as a regularized path to a maximum margin classifier
- Rosset, Zhu, et al.
- 2004
(Show Context)
Citation Context ...+ λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting [6] is a popular and highly successful method for iteratively building an additive model from a dictionary of “weak learners.” In =-=[15]-=- we have shown that the AdaBoost algorithm approximately follows the path of the ℓ1-regularized solutions to the exponential loss function e −yf as the regularizing parameter λ decreases. In this pape... |

66 |
Least angle regression’, Annals of Statistics 32(2), 407–499. With discussion, and a rejoinder from the authors
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ...,γm−1. Our discussion will concentrate on (L, J) pairs which allow efficient generation of the whole path and give statistically useful modeling tools. A canonical example is the Lasso (3) : Recently =-=[3]-=- have shown that the piecewise linear coefficient paths property holds for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso... |

66 | An algorithm for multi-parametric quadratic programming and explicit mpc solutions
- Tøndel, Johansen, et al.
(Show Context)
Citation Context ...h requires significant generalization of the theory presented here (the area of multi-parametric quadratic programming offers some relevant tools, although no general efficient tools. See for example =-=[18]-=-). However, if we limit our interest to a line in this p-dimensional space, then our algorithms for the relevant loss can be applied almost as-is. For example, assume we wanted to find the solution pa... |

64 | Quantile Smoothing Splines
- Koenker, Ng, et al.
- 1994
(Show Context)
Citation Context ...idual” being r = (y − β ⊤ x) for regression and r = (y · β ⊤ x) for classification. When these loss functions are combined with ℓ1 penalty (or total variation penalty, in appropriate function classes =-=[12]-=-), the resulting regularized problems can be formulated as linear programming problems. When the path of regularized solutions ˆ β(λ) is considered, it turns out to have interesting structure with reg... |

48 |
Local extremes, runs, strings and multiresolution
- Davies, Kovac
- 2001
(Show Context)
Citation Context ...2,..., then TV (f) is the sum of TV dif(f), calculated over the differentiable set only and the absolute “jumps” in f where it is noncontinuous. In what follows we assume the range of f is limited to =-=[0,1]-=-. Total variation penalties tend to lead to regularized solutions which are polynomial splines. [13] investigates the solutions to total-variation penalized least squares problems. The authors use tot... |

46 | Regularization of wavelet approximations (with discussion - Antoniadis, Fan - 2001 |

44 |
W.: On ψlearning
- Shen, Tseng, et al.
- 1998
(Show Context)
Citation Context ...and collaborators in several papers [4, 5]. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. =-=[16]-=- advocates the use of nonconvex ψ-loss in the classification setting, minimizing the effect of outliers and misclassified points. This approach can be viewed as an even more robust version of our Hube... |

42 |
Sparsity and smoothness via the fused
- Tibshirani, Saunders, et al.
(Show Context)
Citation Context ... This becomes much more challenging once we stray away from squared error loss. We may also consider more complex penalty structure, such as local or datadependent penalties [1] or multiple penalties =-=[18]-=-.sPIECEWISE LINEAR SOLUTION PATHS 19 Finally, it is worth noting that limiting our discussion to convex problems, for which efficient algorithms can be designed, leaves out some other statistically we... |

19 |
Locally adaptive regression splines
- Mammen, Geer
- 1997
(Show Context)
Citation Context ...holds for the LASSO, and suggested the LAR–LASSO algorithm which takes advantage of it. Similar algorithms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in =-=[13]-=-. We have extended some path-following ideas to versions of the regularized support vector machine [7, 21]. In this paper, we systematically investigate the usefulness of piecewise linear solution pat... |

18 | Least angle regression (with discussion), Annals of Statistics 32 - Efron, Hastie, et al. - 2004 |

7 | Image reconstruction by linear programming
- Tsuda, Ratsch
(Show Context)
Citation Context ...with an ℓ∞ loss, which is also piecewise linear and nondifferentiable. It leads to interesting “mini-max” estimation procedures, popular in many areas, including engineering and control. For example, =-=[19]-=- proposes the use of ℓ1penalized ℓ∞-loss solutions in an image reconstruction problem (but does not consider the solution path). Path-following algorithms can be designed in the same spirit as the ℓ1 ... |

6 |
Locally adaptive regression splines,” Annals of Statistics
- Mammen, Geer
- 1997
(Show Context)
Citation Context ...s for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso in [11, 13] and for total-variation penalized squared error loss in =-=[10]-=-. We have extended some path-following ideas to versions of the regularized support vector machine [22, 7]. The results in [3] show that the number of linear pieces in the Lasso path is approximately ... |

6 | Quantile Smoothing Splines, Biometrika 81 - Koenker, Ng, et al. - 1994 |

2 |
1-norm support vector machines
- Hastie, T, et al.
- 2003
(Show Context)
Citation Context ...hms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in [13]. We have extended some path-following ideas to versions of the regularized support vector machine =-=[7, 21]-=-. In this paper, we systematically investigate the usefulness of piecewise linear solution paths. We aim to combine efficient computational methods based on piecewise linear paths and statistical cons... |

1 |
Discussion of 3 boosting papers. The Annals of Statistics. The Annals of Statistics 32(1
- Friedman, Hastie, et al.
- 2004
(Show Context)
Citation Context ...uch superior to l2 regularization from a prediction error perspective. Indeed, a sense can be defined in which l1 regularized problems have lower complexity than l2 regularized ones in high dimension =-=[5, 12]-=-. From an inference/interpretation perspective, l1 regularization gives “smooth” variable selection and more compact models than l2 regularization [3]. In the case of orthogonal wavelet bases, the sof... |

1 | On the degrees of freedom of the Lasso. Techincal report. Predictive Modeling Group - Zou, Hastie, et al. - 2004 |