Results 1  10
of
18
Exploring large feature spaces with hierarchical MKL
, 2008
"... For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or H ..."
Abstract

Cited by 77 (20 self)
 Add to MetaCart
For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsityinducing norms such as the ℓ 1norm or the block ℓ 1norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsityinducing norms leads to stateoftheart predictive performance. 1
ℓpnorm multiple kernel learning
 Journal of Machine Learning Research
, 2011
"... Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, thisℓ1norm MKL is rarely obser ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, thisℓ1norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that isℓpnorms with p≥1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, nonsparse and ℓ∞norm MKL in various scenarios. Importantly, empirical applications of ℓpnorm MKL to three realworld problems from computational biology show that nonsparse MKL achieves accuracies that surpass the stateoftheart. Data sets, source code to reproduce the experiments, implementations of the algorithms, and
Multiple kernel learning algorithms
 JMLR
, 2011
"... In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subs ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or datadependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.
HighDimensional NonLinear Variable Selection through Hierarchical Kernel Learning
, 2009
"... We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. T ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graphadapted sparsityinducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in highdimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show stateoftheart predictive performance for nonlinear regression problems. 1
Simple and Efficient Multiple Kernel Learning by Group Lasso
"... We consider the problem of how to improve the efficiency of Multiple Kernel Learning (MKL). In literature, MKL is often solved by an alternating approach: (1) the minimization of the kernel weights is solved by complicated techniques, such as Semiinfinite Linear Programming, Gradient Descent, or Le ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We consider the problem of how to improve the efficiency of Multiple Kernel Learning (MKL). In literature, MKL is often solved by an alternating approach: (1) the minimization of the kernel weights is solved by complicated techniques, such as Semiinfinite Linear Programming, Gradient Descent, or Level method; (2) the maximization of SVM dual variables can be solved by standard SVM solvers. However, the minimization step in these methods is usually dependent on its solving techniques or commercial softwares, which therefore limits the efficiency and applicability. In this paper, we formulate a closedform solution for optimizing the kernel weights based on the equivalence between grouplasso and MKL. Although this equivalence is not our invention, our derived variant equivalence not only leads to an efficient algorithm for MKL, but also generalizes to the case for LpMKL (p ≥ 1 and denoting the Lpnorm of kernel weights). Therefore, our proposed algorithm provides a unified solution for the entire family of LpMKL models. Experiments on multiple data sets show the promising performance of the proposed technique compared with other competitive methods. 1.
On the algorithmics and applications of a mixednorm based kernel learning formulation
 In Advances in Neural Information Processing Systems
, 2009
"... Motivated from real world problems, like object categorization, we study a particular mixednorm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The form ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Motivated from real world problems, like object categorization, we study a particular mixednorm regularization for Multiple Kernel Learning (MKL). It is assumed that the given set of kernels are grouped into distinct components where each component is crucial for the learning task at hand. The formulation hence employs l ∞ regularization for promoting combinations at the component level and l1 regularization for promoting sparsity among kernels in each component. While previous attempts have formulated this as a nonconvex problem, the formulation given here is an instance of nonsmooth convex optimization problem which admits an efficient MirrorDescent (MD) based procedure. The MD procedure optimizes over product of simplexes, which is not a wellstudied case in literature. Results on realworld datasets show that the new MKL formulation is wellsuited for object categorization tasks and that the MD based algorithm outperforms stateoftheart MKL solvers like simpleMKL in terms of computational effort. 1
Nonsparse multiple kernel learning for fisher discriminant analysis
 In International Conference on Data Mining
, 2009
"... Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use ℓ2 norm regularisation instead. The resulting learning problem is formulated as a semiinfinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its ℓ1 counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.
1 ℓp−ℓq penalty for Sparse Linear and Sparse Multiple Kernel MultiTask Learning
"... Abstract—Recently, there has been a lot of interest around multitask learning (MTL) problem with the constraints that tasks should share a common sparsity profile. Such a problem can be addressed through a regularization framework where the regularizer induces a jointsparsity pattern between task ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Abstract—Recently, there has been a lot of interest around multitask learning (MTL) problem with the constraints that tasks should share a common sparsity profile. Such a problem can be addressed through a regularization framework where the regularizer induces a jointsparsity pattern between task decision functions. We follow this principled framework and focus on ℓp−ℓq (with 0 ≤ p ≤ 1 and 1 ≤ q ≤ 2) mixednorms as sparsityinducing penalties. Our motivation for addressing such a larger class of penalty is to adapt the penalty to a problem at hand leading thus to better performances and better sparsity pattern. For solving the problem in the general multiple kernel case, we first derive a variational formulation of the ℓ1 − ℓq penalty which helps up in proposing an alternate optimization algorithm. Although very simple, the latter algorithm provably converges to the global minimum of the ℓ1 − ℓq penalized problem. For the linear case, we extend existing works considering accelerated proximal gradient to this penalty. Our contribution in this context is to provide an efficient scheme for computing the ℓ1−ℓq proximal operator. Then, for the more general case when 0 < p < 1, we solve the resulting nonconvex problem through a majorizationminimization approach. The resulting algorithm is an iterative scheme which, at each iteration, solves a weighted ℓ1 − ℓq sparse MTL problem. Empirical evidences from toy dataset and realword datasets dealing with BCI single trial EEG classification and protein subcellular localization show the benefit of the proposed approaches and algorithms. Index Terms—Multitask learning, multiple kernel learning, sparsity, mixednorm, Support Vector Machines I.
Efficient Rule Ensemble Learning using Hierarchical Kernels
"... This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input fe ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. From the perspectives of interpretability as well as generalization, it is highly desirable to construct rule ensembles with low training error, having rules that are i) simple, i.e., involve few conjunctions and ii) few in number. We propose to explore the (exponentially) large feature space of all possible conjunctions optimally and efficiently by employing the recently introduced Hierarchical Kernel Learning (HKL) framework. The regularizer employed in the HKL formulation can be interpreted as a potential for discouraging selection of rules involving large number of conjunctions – justifying its suitability for constructing rule ensembles. Simulation results show that, in case of many benchmark datasets, the proposed approach improves over stateoftheart REL algorithms in terms of generalization and indeed learns simple rules. Unfortunately, HKL selects a conjunction only if all its subsets are selected. We propose a novel convex formulation which alleviates this problem and generalizes the HKL framework. The main technical contribution of this paper is an efficient mirrordescent based active set algorithm for solving the new formulation. Empirical evaluations on REL problems illustrate the utility of generalized