Results 11  20
of
88
Sparse estimation of large covariance matrices via a nested Lasso penalty. Annals of Applied Statistics
, 2007
"... The paper proposes a new covariance estimator for large covariance matrices when the variables have a natural ordering. Using the Cholesky decomposition of the inverse, we impose a banded structure on the Cholesky factor, and select the bandwidth adaptively for each row of the Cholesky factor, using ..."
Abstract

Cited by 33 (8 self)
 Add to MetaCart
The paper proposes a new covariance estimator for large covariance matrices when the variables have a natural ordering. Using the Cholesky decomposition of the inverse, we impose a banded structure on the Cholesky factor, and select the bandwidth adaptively for each row of the Cholesky factor, using a novel penalty we call nested Lasso. This structure has more flexibility than regular banding, but, unlike regular Lasso applied to the entries of the Cholesky factor, results in a sparse estimator for the inverse of the covariance matrix. An iterative algorithm for solving the optimization problem is developed. The estimator is compared to a number of other covariance estimators and is shown to do best, both in simulations and on a real data example. Simulations show that the margin by which the estimator outperforms its competitors tends to increase with dimension. 1. Introduction. Estimating
Structured sparsityinducing norms through submodular functions
 IN ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS
, 2010
"... Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turnedinto a convex optimization problem byreplacing the cardinality function by its convex en ..."
Abstract

Cited by 30 (9 self)
 Add to MetaCart
Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turnedinto a convex optimization problem byreplacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1norm. In this paper, we investigate more general setfunctions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nonincreasing submodular setfunctions, the corresponding convex envelope can be obtained from its Lovász extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or highdimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rankstatistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as nonfactorial priors for supervised learning.
HighDimensional NonLinear Variable Selection through Hierarchical Kernel Learning
, 2009
"... We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. T ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graphadapted sparsityinducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in highdimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show stateoftheart predictive performance for nonlinear regression problems. 1
Simultaneous support recovery in high dimensions: Benefits and perils of block ℓ1/ℓ∞regularization
, 2010
"... ..."
Joint support recovery under highdimensional scaling: Benefits and perils of ℓ1,∞regularization
"... Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This setup suggests the use of ℓ1/ℓ∞regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the highdime ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Given a collection of r ≥ 2 linear regression problems in p dimensions, suppose that the regression coefficients share partially common supports. This setup suggests the use of ℓ1/ℓ∞regularized regression for joint estimation of the p × r matrix of regression coefficients. We analyze the highdimensional scaling of ℓ1/ℓ∞regularized quadratic programming, considering both consistency rates in ℓ∞norm, and also how the minimal sample size n required for performing variable selection grows as a function of the model dimension, sparsity, and overlap between the supports. We begin by establishing bounds on the ℓ∞error as well sufficient conditions for exact variable selection for fixed design matrices, as well as designs drawn randomly from general Gaussian matrices. These results show that the highdimensional scaling of ℓ1/ℓ∞regularization is qualitatively similar to that of ordinary ℓ1regularization. Our second set of results applies to design matrices drawn from standard Gaussian ensembles, for which we provide a sharp set of necessary and sufficient conditions: the ℓ1/ℓ∞regularized method undergoes a phase transition characterized by the rescaled sample size θ1,∞(n, p, s, α) = n/{(4 − 3α)s log(p − (2 − α)s)}. More precisely, for any δ> 0, the probability of successfully recovering both supports converges to 1 for scalings such that θ1, ∞ ≥ 1+δ, and converges to 0 for scalings for which θ1, ∞ ≤ 1 −δ. An implication of this threshold is that use of ℓ1,∞regularization yields improved statistical efficiency if the overlap parameter is large enough (α> 2/3), but performs worse than a naive Lassobased approach for moderate to small overlap (α < 2/3). We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. 1
VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS
, 2008
"... Summary. We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determin ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Summary. We consider a nonparametric additive model of a conditional mean function in which the number of variables and additive components may be larger than the sample size but the number of nonzero additive components is “small” relative to the sample size. The statistical problem is to determine which additive components are nonzero. The additive components are approximated by truncated series expansions with Bspline bases. With this approximation, the problem of component selection becomes that of selecting the groups of coefficients in the expansion. We apply the adaptive group Lasso to select nonzero components, using the group Lasso to obtain an initial estimator and reduce the dimension of the problem. We give conditions under which the group Lasso selects a model whose number of components is comparable with the underlying model and, the adaptive group Lasso selects the nonzero components correctly with probability approaching one as the sample size increases and achieves the optimal rate of convergence. Following model selection, oracleefficient, asymptotically normal estimators of the nonzero components can be obtained by using existing methods. The results of Monte Carlo experiments show that the adaptive group Lasso procedure works well with samples of moderate size. A data example is used to illustrate the application of the proposed method. Key words and phrases. Adaptive group Lasso; component selection; highdimensional data; nonparametric regression; selection consistency. Short title. Nonparametric component selection AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99 1
Variable inclusion and shrinkage algorithms
 Journal of the American Statistical Association
, 2008
"... The Lasso is a popular and computationally efficient procedure for automatically performing both variable selection and coefficient shrinkage on linear regression models. One limitation of the Lasso is that the same tuning parameter is used for both variable selection and shrinkage. As a result, it ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
The Lasso is a popular and computationally efficient procedure for automatically performing both variable selection and coefficient shrinkage on linear regression models. One limitation of the Lasso is that the same tuning parameter is used for both variable selection and shrinkage. As a result, it typically ends up selecting a model with too many variables to prevent over shrinkage of the regression coefficients. We suggest an improved class of methods called ”Variable Inclusion and Shrinkage Algorithms” (VISA). Our approach is capable of selecting sparse models while avoiding over shrinkage problems and uses a path algorithm so is also computationally efficient. We show through extensive simulations that VISA significantly outperforms the Lasso and also provides improvements over more recent procedures, such as the Dantzig selector, Relaxed Lasso and Adaptive Lasso. In addition, we provide theoretical justification for VISA in terms of nonasymptotic bounds on the estimation error that suggest it should exhibit good performance even for large numbers of predictors. Finally, we extend the VISA methodology, path algorithm, and theoretical bounds to the Generalized Linear Models framework.
Heterogeneous Multitask Learning with Joint Sparsity Constraints
"... Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their inputoutput mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In th ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their inputoutput mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1/L ∞ or L1/L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1