Results 1  10
of
14
Structured variable selection with sparsityinducing norms
, 904
"... We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to ov ..."
Abstract

Cited by 97 (15 self)
 Add to MetaCart
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for leastsquares linear regression in low and highdimensional settings.
Convergence Rates of Inexact ProximalGradient Methods for Convex Optimization
 NIPS'11 25 TH ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS
, 2011
"... We consider the problem of optimizing the sum of a smooth convex function and a nonsmooth convex function using proximalgradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the nonsmooth term. We show that b ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
We consider the problem of optimizing the sum of a smooth convex function and a nonsmooth convex function using proximalgradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the nonsmooth term. We show that both the basic proximalgradient method and the accelerated proximalgradient method achieve the same convergence rate as in the errorfree case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems.
Convex relaxation of combinatorial penalties
, 2011
"... In this paper, we propose an unifying view of several recently proposed structured sparsityinducing norms. We consider the situation of a model simultaneously (a) penalized by a setfunction defined on the support of the unknown parameter vector which represents prior knowledge on supports, and (b) r ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
In this paper, we propose an unifying view of several recently proposed structured sparsityinducing norms. We consider the situation of a model simultaneously (a) penalized by a setfunction defined on the support of the unknown parameter vector which represents prior knowledge on supports, and (b) regularized in ℓpnorm. We show that the natural combinatorial optimization problems obtained may be relaxed into convex optimization problems and introduce a notion, the lower combinatorial envelope of a setfunction, that characterizes the tightness of our relaxations. We moreover establish links with norms based on latent representations including the latent group Lasso and blockcoding, and with norms obtained from submodular functions. 1
Efficient Sparse Modeling with Automatic Feature Grouping
"... The grouping of features is highly beneficial in learning with highdimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The grouping of features is highly beneficial in learning with highdimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent sparse modeling tool that achieves this by using a ℓ1regularizer and a pairwise ℓ∞regularizer. However, its optimization is computationally expensive. In this paper, we propose an efficient solver based on the accelerated gradient methods. We show that its key projection step can be solved by a simple iterative group merging algorithm. It is highly efficient and reduces the empirical time complexity from O(d3 ∼ d5) for the existing solvers to just O(d), where d is the number of features. Experimental results on toy and realworld data sets demonstrate that OSCAR is a competitive sparse modeling approach with the added ability of automatic feature grouping. 1.
Learning invariant feature hierarchies
 In Computer Vision–ECCV 2012. Workshops and Demonstrations
, 2012
"... Abstract. Fast visual recognition in the mammalian cortex seems to be a hierarchical process by which the representation of the visual world is transformed in multiple stages from lowlevel retinotopic features to highlevel, global and invariant features, and to object categories. Every single step ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. Fast visual recognition in the mammalian cortex seems to be a hierarchical process by which the representation of the visual world is transformed in multiple stages from lowlevel retinotopic features to highlevel, global and invariant features, and to object categories. Every single step in this hierarchy seems to be subject to learning. How does the visual cortex learn such hierarchical representations by just looking at the world? How could computers learn such representations from data? Computer vision models that are weakly inspired by the visual cortex will be described. A number of unsupervised learning algorithms to train these models will be presented, which are based on the sparse autoencoder concept. The effectiveness of these algorithms for learning invariant feature hierarchies will be demonstrated with a number of practical tasks such as scene parsing, pedestrian detection, and object classification. 1
Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows ∗
"... We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with a small number of connected components. By exp ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to take into account the problem structure, and automatically select a subgraph with a small number of connected components. By exploiting prior knowledge, one can indeed improve the prediction performance and/or obtain better interpretable results. Regularization or penalty functions for selecting features in graphs have recently been proposed but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and “well connected” subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding ” penalties. Unlike existing regularization functions, path coding penalties can both model long range interactions between features in the graph and be tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and lead to more connected subgraphs than other regularization functions for graphs.
Learning Hierarchical and Topographic Dictionaries with Structured Sparsity
"... Recent work in signal processing and statistics have focused on defining new regularization functions, which not only induce sparsity of the solution, but also take into account the structure of the problem. 1–7 We present in this paper a class of convex penalties introduced in the machine learning ..."
Abstract
 Add to MetaCart
Recent work in signal processing and statistics have focused on defining new regularization functions, which not only induce sparsity of the solution, but also take into account the structure of the problem. 1–7 We present in this paper a class of convex penalties introduced in the machine learning community, which take the form of a sum of ℓ2 and ℓ∞norms over groups of variables. They extend the classical groupsparsity regularization8–10 in the sense that the groups possibly overlap, allowing more flexibility in the group design. We review efficient optimization methods to deal with the corresponding inverse problems, 11–13 and their application to the problem of learning dictionaries of natural image patches: 14–18 On the one hand, dictionary learning has indeed proven effective for various signal processing tasks. 17, 19 On the other hand, structured sparsity provides a natural framework for modeling dependencies between dictionary elements. We thus consider a structured sparse regularization to learn dictionaries embedded in a particular structure, for instance a tree11 or a twodimensional grid. 20 In the latter case, the results we obtain are similar to the dictionaries produced by topographic independent component analysis. 21
Asian Conference on Machine Learning Topographic Analysis of Correlated Components
"... Independent component analysis (ICA) is a method to estimate components which are as statistically independent as possible. However, in many practical applications, the estimated components are not independent. Recent variants of ICA have made use of such residual dependencies to estimate an orderin ..."
Abstract
 Add to MetaCart
Independent component analysis (ICA) is a method to estimate components which are as statistically independent as possible. However, in many practical applications, the estimated components are not independent. Recent variants of ICA have made use of such residual dependencies to estimate an ordering (topography) of the components. Like in ICA, the components in those variants are assumed to be uncorrelated, which might be a rather strict condition. In this paper, we address this shortcoming. We propose a generative model for the source where the components can have linear and higher order correlations, which generalizes models in use so far. Based on the model, we derive a method to estimate topographic representations. In numerical experiments on artificial data, the new method is shown to be more widely applicable than previously proposed extensions of ICA. We learn topographic representations for two kinds of real data sets: for outputs of simulated complex cells in the primary visual cortex and for text data.
Noname manuscript No. (will be inserted by the editor) Correlated Topographic Analysis: Estimating an Ordering of Correlated Components
"... Abstract This paper describes a novel method, which we call correlated topographic analysis (CTA), to estimate nonGaussian components and their ordering (topography). The method is inspired by a central motivation of recent variants of independent component analysis (ICA), namely, to make use of th ..."
Abstract
 Add to MetaCart
Abstract This paper describes a novel method, which we call correlated topographic analysis (CTA), to estimate nonGaussian components and their ordering (topography). The method is inspired by a central motivation of recent variants of independent component analysis (ICA), namely, to make use of the residual statistical dependency which ICA cannot remove. We assume that components nearby on the topographic arrangement have both linear and energy correlations, while faraway components are statistically independent. We use these dependencies to fix the ordering of the components. We start by proposing the generative model for the components. Then, we derive an approximation of the likelihood based on the model. Furthermore, since gradient methods tend to get stuck in local optima, we propose a threestep optimization method which dramatically improves topographic estimation. Using simulated data, we show that CTA estimates an ordering of the components and generalizes a previous method in terms of topography estimation. Finally, to demonstrate that CTA is widely applicable, we learn topographic representations for three kinds of real data: natural images, outputs of simulated complex cells and text data.