Results 1  10
of
68
Simultaneous analysis of Lasso and Dantzig selector
 ANNALS OF STATISTICS
, 2009
"... We show that, under a sparsity scenario, the Lasso estimator and the Dantzig selector exhibit similar behavior. For both methods, we derive, in parallel, oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the ℓp estimation loss for 1 ≤ p ≤ ..."
Abstract

Cited by 189 (5 self)
 Add to MetaCart
We show that, under a sparsity scenario, the Lasso estimator and the Dantzig selector exhibit similar behavior. For both methods, we derive, in parallel, oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the ℓp estimation loss for 1 ≤ p ≤ 2 in the linear model when the number of variables can be much larger than the sample size.
Sharp thresholds for highdimensional and noisy sparsity recovery using l1constrained quadratic programmming (Lasso)
, 2006
"... ..."
Exploring large feature spaces with hierarchical MKL
, 2008
"... For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or H ..."
Abstract

Cited by 77 (20 self)
 Add to MetaCart
For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsityinducing norms such as the ℓ 1norm or the block ℓ 1norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsityinducing norms leads to stateoftheart predictive performance. 1
Stability selection
"... Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2 ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
Proofs subject to correction. Not to be reproduced without permission. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. 1 2
Informationtheoretic limits on sparsity recovery in the highdimensional and noisy setting
, 2007
"... Abstract—The problem of sparsity pattern or support set recovery refers to estimating the set of nonzero coefficients of an un3 p known vector 2 based on a set of n noisy observations. It arises in a variety of settings, including subset selection in regression, graphical model selection, signal de ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
Abstract—The problem of sparsity pattern or support set recovery refers to estimating the set of nonzero coefficients of an un3 p known vector 2 based on a set of n noisy observations. It arises in a variety of settings, including subset selection in regression, graphical model selection, signal denoising, compressive sensing, and constructive approximation. The sample complexity of a given method for subset recovery refers to the scaling of the required sample size n as a function of the signal dimension p, sparsity index k (number of nonzeroes in 3), as well as the minimum value min of 3 over its support and other parameters of measurement matrix. This paper studies the informationtheoretic limits of sparsity recovery: in particular, for a noisy linear observation model based on random measurement matrices drawn from general Gaussian measurement matrices, we derive both a set of sufficient conditions for exact support recovery using an exhaustive search decoder, as well as a set of necessary conditions that any decoder, regardless of its computational complexity, must satisfy for exact support recovery. This analysis of fundamental limits complements our previous work on sharp thresholds for support set recovery over the same set of random measurement ensembles using the polynomialtime Lasso method (`1constrained quadratic programming). Index Terms—Compressed sensing, `1relaxation, Fano’s method, highdimensional statistical inference, informationtheoretic
Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches
"... Abstract. L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the nondifferentiability of the 1norm. In this paper we compare stateoftheart optimization techniques to solve this problem across several loss functions. Furthermore, we propose ..."
Abstract

Cited by 47 (1 self)
 Add to MetaCart
Abstract. L1 regularization is effective for feature selection, but the resulting optimization is challenging due to the nondifferentiability of the 1norm. In this paper we compare stateoftheart optimization techniques to solve this problem across several loss functions. Furthermore, we propose two new techniques. The first is based on a smooth (differentiable) convex approximation for the L1 regularizer that does not depend on any assumptions about the loss function used. The other technique is a new strategy that addresses the nondifferentiability of the L1regularizer by casting the problem as a constrained optimization problem that is then solved using a specialized gradient projection method. Extensive comparisons show that our newly proposed approaches consistently rank among the best in terms of convergence speed and efficiency by measuring the number of function evaluations required. 1
Nearideal model selection by ℓ1 minimization
, 2008
"... We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse su ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
We consider the fundamental problem of estimating the mean of a vector y = Xβ + z, where X is an n × p design matrix in which one can have far more variables than observations and z is a stochastic error term—the socalled ‘p> n ’ setup. When β is sparse, or more generally, when there is a sparse subset of covariates providing a close approximation to the unknown mean vector, we ask whether or not it is possible to accurately estimate Xβ using a computationally tractable algorithm. We show that in a surprisingly wide range of situations, the lasso happens to nearly select the best subset of variables. Quantitatively speaking, we prove that solving a simple quadratic program achieves a squared error within a logarithmic factor of the ideal mean squared error one would achieve with an oracle supplying perfect information about which variables should be included in the model and which variables should not. Interestingly, our results describe the average performance of the lasso; that is, the performance one can expect in an vast majority of cases where Xβ is a sparse or nearly sparse superposition of variables, but not in all cases. Our results are nonasymptotic and widely applicable since they simply require that pairs of predictor variables are not too collinear.
Minimax rates of estimation for highdimensional linear regression over balls
, 2009
"... Abstract—Consider the highdimensional linear regression model,where is an observation vector, is a design matrix with, is an unknown regression vector, and is additive Gaussian noise. This paper studies the minimax rates of convergence for estimating in eitherloss andprediction loss, assuming tha ..."
Abstract

Cited by 43 (15 self)
 Add to MetaCart
Abstract—Consider the highdimensional linear regression model,where is an observation vector, is a design matrix with, is an unknown regression vector, and is additive Gaussian noise. This paper studies the minimax rates of convergence for estimating in eitherloss andprediction loss, assuming that belongs to anball for some.Itisshown that under suitable regularity conditions on the design matrix, the minimax optimal rate inloss andprediction loss scales as. The analysis in this paper reveals that conditions on the design matrix enter into the rates forerror andprediction error in complementary ways in the upper and lower bounds. Our proofs of the lower bounds are information theoretic in nature, based on Fano’s inequality and results on the metric entropy of the balls, whereas our proofs of the upper bounds are constructive, involving direct analysis of least squares overballs. For the special case, corresponding to models with an exact sparsity constraint, our results show that although computationally efficientbased methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix than optimal algorithms involving leastsquares over theball. Index Terms—Compressed sensing, minimax techniques, regression analysis. I.
Some sharp performance bounds for least squares regression with L1 regularization
 Rutgers Univ. MODEL SELECTION 35 Applied and Computational Mathematics California Institute of Technology 300 Firestone, Mail Code 21750 Pasadena, California 91125 Email: emmanuel@acm.caltech.edu plan@acm.caltech.edu
, 2009
"... We derive sharp performance bounds for least squares regression with L1 regularization from parameter estimation accuracy and feature selection quality perspectives. The main result proved for L1 regularization extends a similar result in [Ann. Statist. 35 (2007) 2313–2351] for the Dantzig selector. ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
We derive sharp performance bounds for least squares regression with L1 regularization from parameter estimation accuracy and feature selection quality perspectives. The main result proved for L1 regularization extends a similar result in [Ann. Statist. 35 (2007) 2313–2351] for the Dantzig selector. It gives an affirmative answer to an open question in [Ann. Statist. 35 (2007) 2358–2364]. Moreover, the result leads to an extended view of feature selection that allows less restrictive conditions than some recent work. Based on the theoretical insights, a novel twostage L1regularization procedure with selective penalization is analyzed. It is shown that if the target parameter vector can be decomposed as the sum of a sparse parameter vector with large coefficients and another less sparse vector with relatively small coefficients, then the twostage procedure can lead to improved performance.
HIGHDIMENSIONAL ISING MODEL SELECTION USING ℓ1REGULARIZED LOGISTIC REGRESSION
 SUBMITTED TO THE ANNALS OF STATISTICS
"... We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. The method is ..."
Abstract

Cited by 40 (13 self)
 Add to MetaCart
We consider the problem of estimating the graph associated with a binary Ising Markov random field. We describe a method based on ℓ1regularized logistic regression, in which the neighborhood of any given node is estimated by performing logistic regression subject to an ℓ1constraint. The method is analyzed under highdimensional scaling, in which both the number of nodes p and maximum neighborhood size d are allowed to grow as a function of the number of observations n. Our main results provide sufficient conditions on the triple (n, p, d) and the model parameters for the method to succeed in consistently estimating the neighborhood of every node in the graph simultaneously. With coherence conditions imposed on the population Fisher information matrix, we prove that consistent neighborhood selection can be obtained for sample sizes n = Ω(d 3 log p), with exponentially decaying error. When these same conditions are imposed directly on the sample matrices, we show that a reduced sample size of n = Ω(d 2 log p) suffices for the method to estimate neighborhoods consistently. Although this paper focuses on the binary graphical models, we indicate how a generalization of the method of the paper would apply to general discrete Markov random fields.