#### DMCA

## Piecewise linear regularized solution paths, (2007)

### Cached

### Download Links

Venue: | The Annals of Statistics, |

Citations: | 138 - 9 self |

### Citations

13209 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...⊤ i β ) + λ�β� 2 2. Many “modern” methods for machine learning and signal processing can also be cast in the framework of regularized optimization. For example, the regularized support vector machine =-=[20]-=- uses the hinge loss function and the ℓ2-norm penalty: (4) ˆβ(λ) = min β n� (1 − yix i=1 ⊤ i β)+ + λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting [6] is a popular and highly succe... |

4189 | Regression shrinkage and selection via the lasso
- TIBSHIRANI
- 1996
(Show Context)
Citation Context ...roaches, such as ridge regression and the Lasso. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β describing the fitted model: Ridge: Lasso =-=[16]-=-: ˆ β(λ) = min β ˆ β(λ) = min β n∑ i=1 n∑ i=1 (yi − x T iβ) 2 + λ‖β‖ 2 2, (2) (yi − x T iβ) 2 + λ‖β‖1. (3) Another example from the statistics literature is the penalized logistic regression model [20... |

3463 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...so to that of the 14Lasso on the original data and after we artificially “contaminate” the data by adding large constants to a small number of responses. We use the training-test configuration as in =-=[6]-=-, page 48. The training set consists of 67 observations and the test set of 30 observations. We ran the Lasso and the Huberized Lasso with “knot” at t = 1 on the original dataset, and on the “contamin... |

2211 | Experiments with a new boosting algorithm
- Freund, Shapire
- 1996
(Show Context)
Citation Context ...rized support vector machine [20] uses the hinge loss function and the ℓ2-norm penalty: (4) ˆβ(λ) = min β n� (1 − yix i=1 ⊤ i β)+ + λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting =-=[6]-=- is a popular and highly successful method for iteratively building an additive model from a dictionary of “weak learners.” In [15] we have shown that the AdaBoost algorithm approximately follows the ... |

1859 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ...16]: ˆ β(λ) = min β ˆ β(λ) = min β n∑ i=1 n∑ i=1 (yi − x T iβ) 2 + λ‖β‖ 2 2, (2) (yi − x T iβ) 2 + λ‖β‖1. (3) Another example from the statistics literature is the penalized logistic regression model =-=[20]-=- for classification, which is widely used in medical decisions and credit scoring: ˆβ(λ) = min β n∑ i=1 log(1 + e −yixT i β ) + λ‖β‖ 2 2, where the loss is the negative binomial log-likelihood. Many “... |

1419 | The elements of statistical learning: data mining, inference, and prediction
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...erized LASSO to that of the LASSO on the original data and after we artificially “contaminate” the data by adding large constants to a small number of responses. We used the training-test split as in =-=[8]-=-. The training set consists of 67 observations and the test set of 30 observations. We ran the LASSO and the Huberized LASSO with a knot at t = 1 on the original dataset, and on the “contaminated” dat... |

974 | Quantile regression
- Koenker
- 2005
(Show Context)
Citation Context ...obviously of interest, if we can offer new, efficient algorithms for solving them. In this paper we discuss in this context locally adaptive regression splines [13] (Section 4.1), quantile regression =-=[11]-=- and support vector machines (Section 5). (b) Our efficient algorithms allow us to pose statistically motivated regularized problems that have not been considered in the literature. In this context, w... |

953 |
Ridge regression: biased estimation for non-orthogonal problems
- HOERL, R
- 1970
(Show Context)
Citation Context ...cribed as exact or approximate regularized optimization approaches. The obvious examples from the statistics literature are explicit regularized linear regression approaches, such as ridge regression =-=[9]-=- and the LASSO [17]. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β: (2) (3) Ridge : LASSO : ˆ β(λ) = min β ˆ β(λ) = min β n� (yi −x i=1 ⊤... |

941 | Variable selection via nonconcave penalized likelihood and its oracle properties
- Fan, Li
- 2001
(Show Context)
Citation Context ...ch efficient algorithms can be designed, leaves out some other statistically well motivated fitting approaches. The use of a nonconvex penalty was advocated by Fan and collaborators in several papers =-=[4, 5]-=-. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. [16] advocates the use of nonconvex ψ-loss... |

697 |
Regression shrinkage and selection via the
- Tibshirani
- 1996
(Show Context)
Citation Context ... approximate regularized optimization approaches. The obvious examples from the statistics literature are explicit regularized linear regression approaches, such as ridge regression [9] and the LASSO =-=[17]-=-. Both of these use squared error loss, but they differ in the penalty they impose on the coefficient vector β: (2) (3) Ridge : LASSO : ˆ β(λ) = min β ˆ β(λ) = min β n� (yi −x i=1 ⊤ i β) 2 + λ�β� 2 2,... |

635 |
Robust estimation of a location parameter
- Huber
- 1964
(Show Context)
Citation Context ...ons. On the loss side, this leads us to consider functions L which are: • Pure quadratic loss functions, like those of linear regression. • A mixture of quadratic and linear pieces, like Huber’s loss =-=[10]-=-. These loss functions are of interest because they generate robust modeling tools. They will be the focus of Section 3. • Loss functions which are piecewise linear. These include several widely used ... |

332 | Sparsity and smoothness via the fused lasso
- TIBSHIRANI, SAUNDERS, et al.
- 2005
(Show Context)
Citation Context ...liminate the rest of the changes for brevity. An example where a specific two-penalty formulation is natural and of great practical interest in a specific application can be found in our recent paper =-=[17]-=-. The problem considered is that of protein mass spectroscopy, and the predictors correspond to a continuum of “time of flight” sites. Thus the predictors have an order and a distance between them, an... |

239 | A new approach to variable selection in least squares problems
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ...alent to it. 2 to generate the whole regularized path β(λ), 0 ≤ λ ≤ ∞ simply by sequentially calculating the “step sizes” between each two consecutive λ values and the “directions” γ1, . . . , γm−1. Our discussion will concentrate on (L, J) pairs which allow efficient generation of the whole path and give statistically useful modeling tools. A canonical example is the Lasso (3) : Recently [3] have shown that the piecewise linear coefficient paths property holds for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso in [11, 13] and for total-variation penalized squared error loss in [10]. We have extended some path-following ideas to versions of the regularized support vector machine [22, 7]. The results in [3] show that the number of linear pieces in the Lasso path is approximately the number of the variables in X, and the complexity of generating the whole solution path, for all values of λ, using the LAR-Lasso algorithm, is approximately equal to one least square calculation on the full sample. A simple example to illustrate the piecewise linear property can be seen in Figure 1, where we show the Lasso optimal so... |

208 | On the LASSO and its dual
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ...e shown that the piecewise linear coefficient paths property holds for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso in =-=[11, 13]-=- and for total-variation penalized squared error loss in [10]. We have extended some path-following ideas to versions of the regularized support vector machine [22, 7]. The results in [3] show that th... |

204 | Feature selection, l1 vs. l2 regularization, and rotational invariance.
- Ng
- 2004
(Show Context)
Citation Context ...uch superior to l2 regularization from a prediction error perspective. Indeed, a sense can be defined in which l1 regularized problems have lower complexity than l2 regularized ones in high dimension =-=[5, 12]-=-. From an inference/interpretation perspective, l1 regularization gives “smooth” variable selection and more compact models than l2 regularization [3]. In the case of orthogonal wavelet bases, the sof... |

201 | The entire regularization path for the support vector machine
- Rosset, Tibshirani, et al.
- 2004
(Show Context)
Citation Context ...hms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in [13]. We have extended some path-following ideas to versions of the regularized support vector machine =-=[7, 21]-=-. In this paper, we systematically investigate the usefulness of piecewise linear solution paths. We aim to combine efficient computational methods based on piecewise linear paths and statistical cons... |

179 |
Wavelet shrinkage: asymptopia? (with discussion
- Donoho, Johnstone, et al.
- 1995
(Show Context)
Citation Context ...tion perspective, ℓ1 regularization gives “smooth” variable selection and more compact models than ℓ2 regularization. In the case of orthogonal wavelet bases, the soft thresholding method proposed by =-=[2]-=-, which is equivalent to ℓ1 regularization, is asymptotically nearly optimal (in a minimax sense) over a wide variety of loss functions and estimated functions. It is not surprising, therefore, that ℓ... |

166 | Non-concave penalized likelihood with a diverging number of parameters
- Fan, Peng
- 2004
(Show Context)
Citation Context ...ch efficient algorithms can be designed, leaves out some other statistically well motivated fitting approaches. The use of a nonconvex penalty was advocated by Fan and collaborators in several papers =-=[4, 5]-=-. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. [16] advocates the use of nonconvex ψ-loss... |

158 | Statistical behavior and consistency of classification methods based on convex risk minimization’,
- Zhang
- 2004
(Show Context)
Citation Context ...f t < r ≤ 1 0 otherwise. It is trivial to show that argmin fEyl(r(y,f)) = 2Pr(y = 1) − 1, hence the population minimizer of the Huberized squared hinge loss gives the correct sign for classification. =-=[21]-=- has also considered this loss function for the t = −1 case, but more so from a theoretical perspective. It is also worth to note that [15] has proposed a ψ loss function that is as robust as the 0 − ... |

153 |
Least angle regression (with discussion
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ....,γm−1. Our discussion will concentrate on (L, J) pairs which allow efficient generation of the whole path and give statistically useful modeling tools. A canonical example is the LASSO (3). Recently =-=[3]-=- has shown that the piecewise linear coefficient paths property holds for the LASSO, and suggested the LAR–LASSO algorithm which takes advantage of it. Similar algorithms were suggested for the LASSO ... |

128 | Quantile smoothing splines.
- Koenker, Pin, et al.
- 1994
(Show Context)
Citation Context ...idual” being r = (y − β ⊤ x) for regression and r = (y · β ⊤ x) for classification. When these loss functions are combined with ℓ1 penalty (or total variation penalty, in appropriate function classes =-=[12]-=-), the resulting regularized problems can be formulated as linear programming problems. When the path of regularized solutions ˆ β(λ) is considered, it turns out to have interesting structure with reg... |

109 |
Least angle regression’, The Annals of Statistics 32
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ...,γm−1. Our discussion will concentrate on (L, J) pairs which allow efficient generation of the whole path and give statistically useful modeling tools. A canonical example is the Lasso (3) : Recently =-=[3]-=- have shown that the piecewise linear coefficient paths property holds for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso... |

93 | Boosting as a regularized path to a maximum margin classifier
- ROSSET, ZHU, et al.
- 2002
(Show Context)
Citation Context ...+ λ�β� 2 2, where (·)+ is the positive part of the argument. Boosting [6] is a popular and highly successful method for iteratively building an additive model from a dictionary of “weak learners.” In =-=[15]-=- we have shown that the AdaBoost algorithm approximately follows the path of the ℓ1-regularized solutions to the exponential loss function e −yf as the regularizing parameter λ decreases. In this pape... |

92 | An algorithm for multiparametric quadratic programming and explicit MPC solutions.
- TZndel, Johansen, et al.
- 2003
(Show Context)
Citation Context ...h requires significant generalization of the theory presented here (the area of multi-parametric quadratic programming offers some relevant tools, although no general efficient tools. See for example =-=[18]-=-). However, if we limit our interest to a line in this p-dimensional space, then our algorithms for the relevant loss can be applied almost as-is. For example, assume we wanted to find the solution pa... |

84 |
Sparsity and smoothness via the fused
- Tibshirani, Saunders, et al.
- 2005
(Show Context)
Citation Context ... This becomes much more challenging once we stray away from squared error loss. We may also consider more complex penalty structure, such as local or datadependent penalties [1] or multiple penalties =-=[18]-=-.sPIECEWISE LINEAR SOLUTION PATHS 19 Finally, it is worth noting that limiting our discussion to convex problems, for which efficient algorithms can be designed, leaves out some other statistically we... |

81 |
Local extremes, runs, strings and multiresolution.
- Davies, Kovac
- 2001
(Show Context)
Citation Context ...2,..., then TV (f) is the sum of TV dif(f), calculated over the differentiable set only and the absolute “jumps” in f where it is noncontinuous. In what follows we assume the range of f is limited to =-=[0,1]-=-. Total variation penalties tend to lead to regularized solutions which are polynomial splines. [13] investigates the solutions to total-variation penalized least squares problems. The authors use tot... |

59 | Regularization of wavelet approximations (with discussion - Antoniadis, Fan - 2001 |

50 |
On ψ-learning.
- Shen, Tseng, et al.
- 2003
(Show Context)
Citation Context ...and collaborators in several papers [4, 5]. They expose the favorable variable selection property of the penalty function they offer, which can be viewed as an improvement over the use of ℓ1 penalty. =-=[16]-=- advocates the use of nonconvex ψ-loss in the classification setting, minimizing the effect of outliers and misclassified points. This approach can be viewed as an even more robust version of our Hube... |

31 |
Locally adaptive regression splines.
- Mammen, Geer
- 1997
(Show Context)
Citation Context ...holds for the LASSO, and suggested the LAR–LASSO algorithm which takes advantage of it. Similar algorithms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in =-=[13]-=-. We have extended some path-following ideas to versions of the regularized support vector machine [7, 21]. In this paper, we systematically investigate the usefulness of piecewise linear solution pat... |

31 | Least angle regression (with discussion), The Annals of Statistics 32 - Efron, Hastie, et al. - 2004 |

8 | Quantile smoothing splines, Biometrika 81(4): 673–680 - Koenker, Ng, et al. - 1994 |

7 |
Locally adaptive regression splines. Annals of Statistics 25(1):387
- Springer, Geer
- 1997
(Show Context)
Citation Context ...s for the Lasso, and suggested the LAR-Lasso algorithm which takes advantage of it. Similar algorithms were suggested for the Lasso in [11, 13] and for total-variation penalized squared error loss in =-=[10]-=-. We have extended some path-following ideas to versions of the regularized support vector machine [22, 7]. The results in [3] show that the number of linear pieces in the Lasso path is approximately ... |

6 | Image reconstruction by linear programming
- Tsuda, Ratsch
(Show Context)
Citation Context ...with an ℓ∞ loss, which is also piecewise linear and nondifferentiable. It leads to interesting “mini-max” estimation procedures, popular in many areas, including engineering and control. For example, =-=[19]-=- proposes the use of ℓ1penalized ℓ∞-loss solutions in an image reconstruction problem (but does not consider the solution path). Path-following algorithms can be designed in the same spirit as the ℓ1 ... |

2 |
1-norm support vector machines.
- Hastie, T, et al.
- 2003
(Show Context)
Citation Context ...hms were suggested for the LASSO in [14] and for total-variation penalized squared error loss in [13]. We have extended some path-following ideas to versions of the regularized support vector machine =-=[7, 21]-=-. In this paper, we systematically investigate the usefulness of piecewise linear solution paths. We aim to combine efficient computational methods based on piecewise linear paths and statistical cons... |

2 |
An algorithm for multi-parametric quadratic programming and explicit MPC solutions.
- Tndel, Johansen, et al.
- 2003
(Show Context)
Citation Context ...tractive because the l1 penalty is not scale or rotation invariant, and so the “correct” scaling for the predictor variables may not be known in advance. It can also be viewed as assigning varying “importance” to different variables by penalizing them more or less. General exploration of the p-dimensional surface of solution implied by setting the values of λ1, ..., λp is a difficult problem, which requires significant generalization of the theory presented here (the area of multi-parametric quadratic programming offers some relevant tools, although no general efficient tools. See for example [18]). However, if we limit our interest to a line in this p-dimensional space, then our algorithms for the relevant loss can be applied almost as-is. For example, assume we wanted to find the solution path for a local-penalty version of the Lasso, with the λk’s limited to a line: β(λ) = arg min β ‖y − Xβ‖22 + λ|β1 |+ b2λ|β2 |+ ... + bpλ|βp| with b1 = 1, b2, ..., bp being non-negative constants (independent of λ). Then the LAR-Lasso algorithm, or alternatively our Algorithm 1, can be applied 28 almost as-is, with minimal changes required to account for the “scaling factors” bk, k = 1, ..., p. For... |

1 |
Discussion of 3 boosting papers. The Annals of Statistics.
- Friedman, Hastie, et al.
- 2004
(Show Context)
Citation Context ...uch superior to l2 regularization from a prediction error perspective. Indeed, a sense can be defined in which l1 regularized problems have lower complexity than l2 regularized ones in high dimension =-=[5, 12]-=-. From an inference/interpretation perspective, l1 regularization gives “smooth” variable selection and more compact models than l2 regularization [3]. In the case of orthogonal wavelet bases, the sof... |

1 | On the degrees of freedom of the Lasso. Techincal report. Predictive Modeling Group - Zou, Hastie, et al. - 2004 |