Results 11 - 20
of
84
Group Lasso with Overlap and Graph Lasso
"... We propose a new penalty function which, when used as regularization for empirical risk minimization procedures, leads to sparse estimators. The support of the sparse vector is typically a union of potentially overlapping groups of covariates defined a priori, or a set of covariates which tend to be ..."
Abstract
-
Cited by 47 (6 self)
- Add to MetaCart
We propose a new penalty function which, when used as regularization for empirical risk minimization procedures, leads to sparse estimators. The support of the sparse vector is typically a union of potentially overlapping groups of covariates defined a priori, or a set of covariates which tend to be connected to each other when a graph of covariates is given. We study theoretical properties of the estimator, and illustrate its behavior on simulated and breast cancer gene expression data. 1.
Exploring large feature spaces with hierarchical MKL
, 2008
"... For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or H ..."
Abstract
-
Cited by 45 (9 self)
- Add to MetaCart
For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ 1-norm or the block ℓ 1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance. 1
Sparsity oracle inequalities for the lasso
- Electronic Journal of Statistics
"... Abstract: This paper studies oracle properties of ℓ1-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in terms of the number of non-zero components of the oracle vec ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
Abstract: This paper studies oracle properties of ℓ1-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in terms of the number of non-zero components of the oracle vector. The results are valid even when the dimension of the model is (much) larger than the sample size and the regression matrix is not positive definite. They can be applied to high-dimensional linear regression, to nonparametric adaptive regression estimation and to the problem of aggregation of arbitrary estimators.
Sure independence screening for ultra-high dimensional feature space
, 2006
"... Variable selection plays an important role in high dimensional statistical modeling which nowa-days appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Variable selection plays an important role in high dimensional statistical modeling which nowa-days appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, estimation accuracy and computational cost are two top concerns. In a recent paper, Candes and Tao (2007) propose the Dantzig selector using L1 regularization and show that it achieves the ideal risk up to a logarithmic factor log p. Their innovative procedure and remarkable result are challenged when the dimensionality is ultra high as the factor log p can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the SIS is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be ac-
Multi-task feature selection
- In the workshop of structural Knowledge Transfer for Machine Learning in the 23rd International Conference on Machine Learning (ICML
, 2006
"... We address the problem of joint feature selection across a group of related classification or regression tasks. We propose a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Intuitively, we extend the ℓ1 regularization for single-task esti ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
We address the problem of joint feature selection across a group of related classification or regression tasks. We propose a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Intuitively, we extend the ℓ1 regularization for single-task estimation to the multi-task setting. By penalizing the sum of ℓ2-norms of the blocks of coefficients associated with each feature across different tasks, we encourage multiple predictors to have similar parameter sparsity patterns. To fit parameters under this regularization, we propose a blockwise boosting scheme that follows the regularization path. The algorithm introduces and updates simultaneously the coefficients associated with one feature in all tasks. We show empirically that this approach outperforms independent ℓ1-based feature selection on several datasets. 1
A note on the LASSO and related procedures in model selection
- STATISTICA SINICA
, 2004
"... The Lasso, the Forward Stagewise regression and the Lars are closely re-lated procedures recently proposed for linear regression problems. Each of them can produce sparse models and can be used both for estimation and variable selection. In practical implementations these algorithms are typically tu ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
The Lasso, the Forward Stagewise regression and the Lars are closely re-lated procedures recently proposed for linear regression problems. Each of them can produce sparse models and can be used both for estimation and variable selection. In practical implementations these algorithms are typically tuned to achieve optimal prediction accuracy. We show that, when the predic-tion accuracy is used as the criterion to choose the tuning parameter, in general these procedures are not consistent in terms of variable selection. That is, the sets of variables selected are not consistent at finding the true set of important variables. In particular, we show that for any sample size n, when there are superfluous variables in the linear regression model and the design matrix is orthogonal, the probability of the procedures correctly identifying the true set of important variables is less than a constant (smaller than one) not depending on n. This result is also shown to hold for two dimensional problems with gen-eral correlated design matrices. The results indicate that in problems where
Variable Selection for Cox's Proportional Hazards Model and Frailty Model
- ANNALS OF STATISTICS
, 2002
"... A class of variable selection procedures for parametric models via nonconcave penalized likelihood was proposed in Fan and Li (2001a). It has been shown there that the resulting procedures perform as well as if the subset of significant variables were known in advance. Such a property is called an o ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
A class of variable selection procedures for parametric models via nonconcave penalized likelihood was proposed in Fan and Li (2001a). It has been shown there that the resulting procedures perform as well as if the subset of significant variables were known in advance. Such a property is called an oracle property. The proposed procedures were illustrated in the context of linear regression, robust linear regression and generalized linear models. In this paper, the nonconcave penalized likelihood approach is extended further to the Cox proportional hazards model and the Cox proportional hazards frailty model, two commonly used semi-parametric models in survival analysis. As a result, new variable selection procedures for these two commonly-used models are proposed. It is demonstrated how the rates of convergence depend on the regularization parameter in the penalty function. Further, with a proper choice of the regularization parameter and the penalty function, the proposed estimators possess an oracle property. Standard error formulae are derived and their accuracies are empirically tested. Simulation studies show that the proposed procedures are more stable in prediction and more effective in computation than the best subset variable selection, and they reduce model complexity as effectively as the best subset variable selection. Compared with the LASSO, which is the penalized likelihood method with the L1-penalty, proposed by Tibshirani, the newly proposed approaches have better theoretic properties and finite sample performance.
Hiroshi Imai and Masao Iri. Polygonal approximations of a curve – formulations and algorithms
- Computational Morphology
, 1988
"... Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace no ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. 1.
Boosted lasso
, 2004
"... In this paper, we propose the Boosted Lasso (BLasso) algorithm which ties the Boosting algorithm with the Lasso method. BLasso is derived as a coordinate descent method with a fixed step size applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step an ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
In this paper, we propose the Boosted Lasso (BLasso) algorithm which ties the Boosting algorithm with the Lasso method. BLasso is derived as a coordinate descent method with a fixed step size applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to Boosting and Forward Stagewise Fitting, but the backward step is new and makes the Boosting path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, when the step size goes to zero, the BLasso path is shown to converge to the Lasso path. For cases with a large number of base learners, our simulations show that since BLasso approximate the Lasso paths, the model estimates are sparser than Forward Stagewise Fitting with equivalent or better prediction performance when the true model is sparse and there are more predictors than the sample size. In addition, we extend BLasso to minimizing a general convex loss penalized by a general convex function. Since BLasso relies only on differeneces not derivatives, we demonstrate this extension as a simple off-the-shelf algorithm for tracing the solution paths of regularization problems.
Adaptive Lasso for sparse highdimensional regression
- University of Iowa
, 2006
"... Summary. We study the asymptotic properties of adaptive LASSO estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable selection using the adaptive LASSO, where the L1 norms in the penalty are re-weighted b ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Summary. We study the asymptotic properties of adaptive LASSO estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase with the sample size. We consider variable selection using the adaptive LASSO, where the L1 norms in the penalty are re-weighted by data-dependent weights. We show that, if a reasonable initial estimator is available, then under appropriate conditions, adaptive LASSO correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic dis-tribution that they would have if the zero coefficients were known in advance. Thus, the adaptive LASSO has an oracle property in the sense of Fan and Li (2001) and Fan and Peng (2004). In addition, under a partial orthogonality condition in which the covariates with zero coefficients are weakly correlated with the covariates with nonzero coefficients, univariate regression can be used to obtain the initial estimator. With this initial estimator, adaptive LASSO has the oracle property even when the number of covariates is greater than the sample size. Key Words and phrases. Penalized regression, high-dimensional data, variable selection, asymptotic normality, oracle property, zero-consistency. Short title. Sparse high-dimensional regression

