Results 1  10
of
101
Learning with Structured Sparsity
"... This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A gener ..."
Abstract

Cited by 127 (15 self)
 Add to MetaCart
This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. Experiments demonstrate the advantage of structured sparsity over standard sparsity. 1.
MultiLabel Prediction via Compressed Sensing
, 902
"... We consider multilabel prediction problems with large output spaces under the assumption of output sparsity – that the target vectors have small support. We develop a general theory for a variant of the popular ECOC (error correcting output code) scheme, based on ideas from compressed sensing for e ..."
Abstract

Cited by 100 (3 self)
 Add to MetaCart
(Show Context)
We consider multilabel prediction problems with large output spaces under the assumption of output sparsity – that the target vectors have small support. We develop a general theory for a variant of the popular ECOC (error correcting output code) scheme, based on ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multilabel regression problems to binary regression problems. It is shown that the number of subproblems need only be logarithmic in the total number of label values, making this approach radically more efficient than others. We also state and prove performance guarantees for this method, and test it empirically. 1.
Blessing of Dimensionality: Highdimensional Feature and Its Efficient Compression for Face Verification
"... Making a highdimensional (e.g., 100Kdim) feature for face recognition seems not a good idea because it will bring difficulties on consequent training, computation, and storage. This prevents further exploration of the use of a highdimensional feature. In this paper, we study the performance of a h ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
(Show Context)
Making a highdimensional (e.g., 100Kdim) feature for face recognition seems not a good idea because it will bring difficulties on consequent training, computation, and storage. This prevents further exploration of the use of a highdimensional feature. In this paper, we study the performance of a highdimensional feature. We first empirically show that high dimensionality is critical to high performance. A 100Kdim feature, based on a singletype Local Binary Pattern (LBP) descriptor, can achieve significant improvements over both its lowdimensional version and the stateoftheart. We also make the highdimensional feature practical. With our proposed sparse projection method, named rotated sparse regression, both computation and model storage can be reduced by over 100 times without sacrificing accuracy quality. 1.
Trading accuracy for sparsity in optimization problems with sparsity constraints
 SIAM Journal on Optimization
"... Abstract. We study the problem of minimizing the expected loss of a linear predictor while constraining its sparsity, i.e., bounding the number of features used by the predictor. While the resulting optimization problem is generally NPhard, several approximation algorithms are considered. We analyz ..."
Abstract

Cited by 44 (12 self)
 Add to MetaCart
(Show Context)
Abstract. We study the problem of minimizing the expected loss of a linear predictor while constraining its sparsity, i.e., bounding the number of features used by the predictor. While the resulting optimization problem is generally NPhard, several approximation algorithms are considered. We analyze the performance of these algorithms, focusing on the characterization of the tradeoff between accuracy and sparsity of the learned predictor in different scenarios.
Submodular meets Spectral: Greedy Algorithms for Sparse Approximation and
 Dictonary Selection, 2011. http://arxiv.org/abs/1102.3975. Diekhoff, G. Statistics for the Social and Behavioral Sciences
"... We study the problem of selecting a subset of k random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used gr ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
(Show Context)
We study the problem of selecting a subset of k random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used greedy heuristics, using insights from the maximization of submodular functions and spectral analysis. We introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated. Using our techniques, we obtain the strongest known approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest ksparse eigenvalue of the covariance matrix. We also analyze greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on realworld and synthetic data sets; the experiments show that the submodularity ratio is a stronger predictor of the performance of greedy algorithms than other spectral parameters. 1.
Trace lasso: A trace norm regularization for correlated designs
 In Advances in Neural Information Processing Systems 24
, 2011
"... Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm, which is a convex surrogate of the rank, of the selected covariates as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted leastsquares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
Confidence intervals for lowdimensional parameters with highdimensional data
 ArXiv.org
"... Abstract. The purpose of this paper is to propose methodologies for statistical inference of lowdimensional parameters with highdimensional data. We focus on constructing confidence intervals for individual coefficients and linear combinations of several of them in a linear regression model, alth ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The purpose of this paper is to propose methodologies for statistical inference of lowdimensional parameters with highdimensional data. We focus on constructing confidence intervals for individual coefficients and linear combinations of several of them in a linear regression model, although our ideas are applicable in a much broader context. The theoretical results presented here provide sufficient conditions for the asymptotic normality of the proposed estimators along with a consistent estimator for their finitedimensional covariance matrices. These sufficient conditions allow the number of variables to far exceed the sample size. The simulation results presented here demonstrate the accuracy of the coverage probability of the proposed confidence intervals, strongly supporting the theoretical results.
On learning discrete graphical models using greedy methods
 In Neural Information Processing Systems (NIPS) (currently under review
, 2011
"... In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a highdimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forwardbackward greedy algorithm as applied to gener ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a highdimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forwardbackward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegreed and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling asn = Ω(d 2 log(p)), in contrast to existing convexoptimization based algorithms that require a sample complexity of Ω(d 3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end. 1
Boosting with Structural Sparsity
"... We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and back ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
We derive generalizations of AdaBoost and related gradientbased coordinate descent methods that incorporate sparsitypromoting penalties for the norm of the predictor that is being learned. The end result is a family of coordinate descent algorithms that integrate forward feature induction and backpruning through regularization and give an automatic stopping criterion for feature induction. We study penalties based on the ℓ1, ℓ2, and ℓ ∞ norms of the predictor and introduce mixednorm penalties that build upon the initial penalties. The mixednorm regularizers facilitate structural sparsity in parameter space, which is a useful property in multiclass prediction and other related tasks. We report empirical results that demonstrate the power of our approach in building accurate and structurally sparse models. 1. Introduction and