Results 1  10
of
36
Stability selection
"... Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2 ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2
Nearideal model selection by ℓ1 minimization
, 2008
"... We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse su ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate Xβ using a computationally tractable algorithm. We show that in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error one would achieve with an oracle supplying perfect information about which variables should be included in the model and which variables should not. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where Xβ is a sparse or nearly sparse superposition of variables, but not in all cases. Our results are nonasymptotic and widely applicable since they simply require that pairs of predictor variables are not too collinear.
Boosting algorithms: Regularization, prediction and model fitting
 Statistical Science
, 2007
"... Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and correspo ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in highdimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated opensource software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing userspecified loss functions. Key words and phrases: Generalized linear models, generalized additive models, gradient boosting, survival analysis, variable selection, software. 1.
VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS
, 2008
"... Summary. We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determin ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Summary. We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with Bspline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model and, the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. Following model selection, oracleefficient, asymptotically normal estimators of the nonzero components can be obtained by using existing methods. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method. Key words and phrases. Adaptive group Lasso; component selection; highdimensional data; nonparametric regression; selection consistency. Short title. Nonparametric component selection AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99 1
SparseNet: Coordinate Descent with NonConvex Penalties
, 2009
"... We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed for this purpose, along with a variety of convexrelaxation algorithms for finding good solutions. In this paper we pursue the coordinatedescent approach for optimization, and study its ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed for this purpose, along with a variety of convexrelaxation algorithms for finding good solutions. In this paper we pursue the coordinatedescent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a dfstandardizing reparametrization that assists our pathwise algorithm. The MC+ penalty (Zhang 2010) is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. 1
Discussion of “Onestep sparse estimates in nonconcave penalized likelihood models” (auths
, 2007
"... Hui Zou and Runze Li ought to be congratulated for their nice and interesting work which presents a variety of ideas and insights in statistical methodology, computing and asymptotics. We agree with them that one or even multistep (orstage) procedures are currently among the best for analyzing co ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Hui Zou and Runze Li ought to be congratulated for their nice and interesting work which presents a variety of ideas and insights in statistical methodology, computing and asymptotics. We agree with them that one or even multistep (orstage) procedures are currently among the best for analyzing complex datasets. The focus of our discussion is mainly on highdimensional problems where p ≫ n: we will illustrate, empirically and by describing some theory, that many of the ideas from the current paper are very useful for the p ≫ n setting as well. 1. Nonconvex objective function and multistep convex optimization. The paper demonstrates a nice, and in a sense surprising, connection between difficult nonconvex optimization and computationally efficient Lassotype methodology which involves one (or multi) step convex optimization. The SCADpenalty function [5] has been often criticized from a computational point of view as it corresponds to a nonconvex objective function which is difficult to minimize; mainly in situations with many covariates, optimizing SCADpenalized likelihood becomes an awkward task. The usual way to optimize a SCADpenalized likelihood is to use a local quadratic approximation. Zou and Li show here what happens if one uses a local linear approximation instead. In 2001, when Fan and Li [5] proposed the SCADpenalty, it was probably easier to work with a quadratic approximation. Nowadays, and because of the contribution of the current paper, a local linear approximation seems as easy to use, thanks to the homotopy method [12] and the LARS algorithm [4]. While the latter is suited for linear models, more sophisticated algorithms have been proposed for generalized linear models; cf. [6, 8, 13]. In addition, and importantly, the local linear approximation yields sparse model fits where quite a few or even many of the coefficients in a linear or
Pvalues for highdimensional regression
, 2009
"... Assigning significance in highdimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid pvalues are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Assigning significance in highdimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid pvalues are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits the data into two parts. The number of variables is then reduced to a manageable size using the first split, while classical variable selection techniques can be applied to the remaining variables, using the data from the second split. This yields asymptotic error control under minimal conditions. It involves, however, a onetime random split of the data. Results are sensitive to this arbitrary choice: it amounts to a “pvalue lottery ” and makes it difficult to reproduce results. Here, we show that inference across multiple random splits can be aggregated, while keeping asymptotic control over the inclusion of noise variables. In addition, the proposed aggregation is shown to improve power, while reducing the number of falsely selected variables substantially. Keywords: Highdimensional variable selection, data splitting, multiple comparisons. 1
Smoothing ℓ1penalized estimators for highdimensional timecourse data
 Electronic Journal of Statistics
, 2007
"... Abstract: When a series of (related) linear models has to be estimated it is often appropriate to combine the different datasets to construct more efficient estimators. We use ℓ1penalized estimators like the Lasso or the Adaptive Lasso which can simultaneously do parameter estimation and model sel ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract: When a series of (related) linear models has to be estimated it is often appropriate to combine the different datasets to construct more efficient estimators. We use ℓ1penalized estimators like the Lasso or the Adaptive Lasso which can simultaneously do parameter estimation and model selection. We show that for a timecourse of highdimensional linear models the convergence rates of the Lasso and of the Adaptive Lasso can be improved by combining the different timepoints in a suitable way. Moreover, the Adaptive Lasso still enjoys oracle properties and consistent variable selection. The finite sample properties of the proposed methods are illustrated on simulated data and on a real problem of motif finding in DNA sequences.
Penalized Likelihood Methods for Estimation of sparse high dimensional directed acyclic graphs
, 2010
"... Directed acyclic graphs are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical, as well as biological systems, where directed edges between nodes represent the influence of components of the system o ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Directed acyclic graphs are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical, as well as biological systems, where directed edges between nodes represent the influence of components of the system on each other. Estimation of directed graphs from observational data is computationally NPhard. In addition, directed graphs with the same structure may be indistinguishable based on observations alone. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to the problem of estimating the structure of the network. In this paper, we propose an efficient penalized likelihood method for estimation of the adjacency matrix of directed acyclic graphs, when variables inherit a natural ordering. We study variable selection consistency of both the lasso, as well as the adaptive lasso penalties in high dimensional sparse settings, and propose an errorbased choice for selecting the tuning parameter. We show that although the lasso is only variable selection consistent under stringent conditions, the adaptive lasso can consistently estimate the true graph under the usual regularity assumptions. Simulation studies indicate that the correct ordering of the variables becomes less critical in estimation of high dimensional sparse networks.
Learning Scale Free Networks by Reweighted ℓ1 regularization
"... Methods for ℓ1type regularization have been widely used in Gaussian graphical model selection tasks to encourage sparse structures. However, often we would like to include more structural information than mere sparsity. In this work, we focus on learning socalled “scalefree ” models, a common fea ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Methods for ℓ1type regularization have been widely used in Gaussian graphical model selection tasks to encourage sparse structures. However, often we would like to include more structural information than mere sparsity. In this work, we focus on learning socalled “scalefree ” models, a common feature that appears in many realwork networks. We replace the ℓ1 regularization with a power law regularization and optimize the objective function by a sequence of iteratively reweighted ℓ1 regularization problems, where the regularization coefficients of nodes with high degree are reduced, encouraging the appearance of hubs with high degree. Our method can be easily adapted to improve any existing ℓ1based methods, such as graphical lasso, neighborhood selection, and JSRM when the underlying networks are believed to be scale free or have dominating hubs. We demonstrate in simulation that our method significantly outperforms the a baseline ℓ1 method at learning scalefree networks and hub networks, and also illustrate its behavior on gene expression data. 1