Results 1  10
of
24
Structured variable selection with sparsityinducing norms
, 2011
"... We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to ov ..."
Abstract

Cited by 193 (31 self)
 Add to MetaCart
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for leastsquares linear regression in low and highdimensional settings.
F.: Proximal methods for sparse hierarchical dictionary learning
 In: ICML
"... This paper proposes to combine two approaches for modeling data admitting sparse representations: On the one hand, dictionary learning has proven very effective for various signal restoration and representation tasks. On the other hand, recent work on structured sparsity provides a natural framework ..."
Abstract

Cited by 126 (23 self)
 Add to MetaCart
(Show Context)
This paper proposes to combine two approaches for modeling data admitting sparse representations: On the one hand, dictionary learning has proven very effective for various signal restoration and representation tasks. On the other hand, recent work on structured sparsity provides a natural framework for modeling dependencies between dictionary elements. We propose to combine these approaches to learn dictionaries embedded in a hierarchy. We show that the proximal operator for the treestructured sparse regularization that we consider can be computed exactly in linear time with a primaldual approach, allowing the use of accelerated gradient methods. Experiments show that for natural image patches, learned dictionary elements organize themselves naturally in such a hierarchical structure, leading to an improved performance for restoration tasks. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Learned sparse representations, initially introduced by Olshausen and Field [1997], have been the focus of much research in machine learning, signal processing and neuroscience, leading to stateoftheart algorithms for several problems in image processing. Modeling signals as a linear combination of a
Convex and network flow optimization for structured sparsity
 JMLR
, 2011
"... We consider a class of learning problems regularized by a structured sparsityinducing norm defined as the sum of ℓ2 or ℓ∞norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address ..."
Abstract

Cited by 35 (9 self)
 Add to MetaCart
We consider a class of learning problems regularized by a structured sparsityinducing norm defined as the sum of ℓ2 or ℓ∞norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞norms can be computed exactly in polynomial time by solving a quadratic mincost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with nonoverlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multitask learning of treestructured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.
MULTISCALE MINING OF FMRI DATA WITH HIERARCHICAL STRUCTURED SPARSITY
, 2011
"... Abstract. Reverse inference, or “brain reading”, is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
Abstract. Reverse inference, or “brain reading”, is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain activity. Reverse inference takes into account the multivariate information between voxels and is currently the only way to assess how precisely some cognitive information is encoded by the activity of neural populations within the whole brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, since there are far more features than samples, i.e., more voxels than fMRI volumes. To address this problem, different methods have been proposed, such as, among others, univariate feature selection, feature agglomeration and regularization techniques. In this paper, we consider a sparse hierarchical structured regularization. Specifically, the penalization we use is constructed from a tree that is obtained by spatiallyconstrained agglomerative clustering. This approach encodes the spatial structure of the data at different scales into the regularization, which makes the overall prediction procedure more robust to intersubject variability. The regularization used induces the selection of spatially coherent predictive brain regions simultaneously at different scales. We test our algorithm on real data acquired to study the mental representation of objects, and we show that the proposed algorithm not only delineates meaningful brain regions but yields as well better prediction accuracy than reference methods.
Additive Gaussian Processes
"... We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of lowdimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models wh ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
We introduce a Gaussian process model of functions which are additive. An additive function is one which decomposes into a sum of lowdimensional functions, each depending on only a subset of the input variables. Additive GPs generalize both Generalized Additive Models, and the standard GP models which use squaredexponential kernels. Hyperparameter learning in this model can be seen as Bayesian Hierarchical Kernel Learning (HKL). We introduce an expressive but tractable parameterization of the kernel function, which allows efficient evaluation of all input interaction terms, whose number is exponential in the input dimension. The additional structure discoverable by this model results in increased interpretability, as well as stateoftheart predictive power in regression tasks. 1
Tight conditions for consistent variable selection in high dimensional nonparametric regression
"... ..."
(Show Context)
Nonparametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
"... We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the nonparametric extension of group ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the nonparametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a GroupOMP based framework for sparse MKL. Unlike l1MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a blackbox singlekernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and GroupOMP [16]. 1
Learning Nonlinear Functions Using Regularized Greedy Forest
"... We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman’s gradient boosting for general loss. In contrast to these traditional boosting algorithms that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We consider the problem of learning a forest of nonlinear decision rules with general loss functions. The standard methods employ boosted decision trees such as Adaboost for exponential loss and Friedman’s gradient boosting for general loss. In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fullycorrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting on many of the datasets we have tested on. Index Terms boosting, decision tree, decision forest, ensemble, greedy algorithm
Efficient Rule Ensemble Learning using Hierarchical Kernels
"... This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input fe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. From the perspectives of interpretability as well as generalization, it is highly desirable to construct rule ensembles with low training error, having rules that are i) simple, i.e., involve few conjunctions and ii) few in number. We propose to explore the (exponentially) large feature space of all possible conjunctions optimally and efficiently by employing the recently introduced Hierarchical Kernel Learning (HKL) framework. The regularizer employed in the HKL formulation can be interpreted as a potential for discouraging selection of rules involving large number of conjunctions – justifying its suitability for constructing rule ensembles. Simulation results show that, in case of many benchmark datasets, the proposed approach improves over stateoftheart REL algorithms in terms of generalization and indeed learns simple rules. Unfortunately, HKL selects a conjunction only if all its subsets are selected. We propose a novel convex formulation which alleviates this problem and generalizes the HKL framework. The main technical contribution of this paper is an efficient mirrordescent based active set algorithm for solving the new formulation. Empirical evaluations on REL problems illustrate the utility of generalized
Multiresolution Gaussian processes
 in Advances in Neural Information Processing Systems 25
, 2012
"... We propose a multiresolution Gaussian process to capture longrange, nonMarkovian dependencies while allowing for abrupt changes and nonstationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Longrange depe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We propose a multiresolution Gaussian process to capture longrange, nonMarkovian dependencies while allowing for abrupt changes and nonstationarity. The multiresolution GP hierarchically couples a collection of smooth GPs, each defined over an element of a random nested partition. Longrange dependencies are captured by the toplevel GP while the partition points define the abrupt changes. Due to the inherent conjugacy of the GPs, one can analytically marginalize the GPs and compute the marginal likelihood of the observations given the partition tree. This property allows for efficient inference of the partition itself, for which we employ graphtheoretic techniques. We apply the multiresolution GP to the analysis of magnetoencephalography (MEG) recordings of brain activity. 1