Results 1 - 10
of
139
Consistency of the group lasso and multiple kernel learning
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2007
"... We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it ..."
Abstract
-
Cited by 81 (14 self)
- Add to MetaCart
We consider the least-square regression problem with regularization by a block 1-norm, i.e., a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1-norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied.
Learning the discriminative powerinvariance trade-off
- In ICCV
, 2007
"... We investigate the problem of learning optimal descriptors for a given classification task. Many hand-crafted descriptors have been proposed in the literature for measuring visual similarity. Looking past initial differences, what really distinguishes one descriptor from another is the tradeoff that ..."
Abstract
-
Cited by 80 (3 self)
- Add to MetaCart
We investigate the problem of learning optimal descriptors for a given classification task. Many hand-crafted descriptors have been proposed in the literature for measuring visual similarity. Looking past initial differences, what really distinguishes one descriptor from another is the tradeoff that it achieves between discriminative power and invariance. Since this trade-off must vary from task to task, no single descriptor can be optimal in all situations. Our focus, in this paper, is on learning the optimal tradeoff for classification given a particular training set and prior constraints. The problem is posed in the kernel learning framework. We learn the optimal, domain-specific kernel as a combination of base kernels corresponding to base features which achieve different levels of trade-off (such as no invariance, rotation invariance, scale invariance, affine invariance, etc.) This leads to a convex optimisation problem with a unique global optimum which can be solved for efficiently. The method is shown to achieve state-of-the-art performance on the UIUC textures, Oxford flowers and Caltech 101 datasets. 1.
Efficient structure learning of Markov networks using L1regularization
- In NIPS
, 2006
"... Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to ..."
Abstract
-
Cited by 71 (1 self)
- Add to MetaCart
Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally effective method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: modeling the joint distribution of pixel values in the MNIST data, and modeling the joint distribution of genetic sequence variations in the human HapMap data. We show that our L1-based method achieves considerably higher generalization performance than the more standard L2-based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. 1
Learning the kernel function via regularization
- Journal of Machine Learning Research
, 2005
"... We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we ch ..."
Abstract
-
Cited by 57 (4 self)
- Add to MetaCart
We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m+2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel. 1.
Multiple Kernels for Object Detection
"... Our objective is to obtain a state-of-the art object category detector by employing a state-of-the-art image classifier to search for the object in all possible image subwindows. We use multiple kernel learning of Varma and Ray (ICCV 2007) to learn an optimal combination of exponential χ 2 kernels, ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Our objective is to obtain a state-of-the art object category detector by employing a state-of-the-art image classifier to search for the object in all possible image subwindows. We use multiple kernel learning of Varma and Ray (ICCV 2007) to learn an optimal combination of exponential χ 2 kernels, each of which captures a different feature channel. Our features include the distribution of edges, dense and sparse visual words, and feature descriptors at different levels of spatial organization. Such a powerful classifier cannot be tested on all image sub-windows in a reasonable amount of time. Thus we propose a novel three-stage classifier, which combines linear, quasi-linear, and non-linear kernel SVMs. We show that increasing the non-linearity of the kernels increases their discriminative power, at the cost of an increased computational complexity. Our contributions include (i) showing that a linear classifier can be evaluated with a complexity proportional to the number of sub-windows (independent of the sub-window area and descriptor dimension); (ii) a comparison of three efficient methods of proposing candidate regions (including the jumping window classifier of Chum and Zisserman (CVPR 2007) based on proposing windows from scale invariant features); and (iii) introducing overlap-recall curves as a mean to compare and optimize the performance of the intermediate pipeline stages. The method is evaluated on the PASCAL Visual Object Detection Challenge, and exceeds the performances of previously published methods for most of the classes.
On feature combination for multiclass object classication
- In ICCV
"... A key ingredient in the design of visual object classification systems is the identification of relevant class specific aspects while being robust to intra-class variations. While this is a necessity in order to generalize beyond a given set of training images, it is also a very difficult problem du ..."
Abstract
-
Cited by 47 (1 self)
- Add to MetaCart
A key ingredient in the design of visual object classification systems is the identification of relevant class specific aspects while being robust to intra-class variations. While this is a necessity in order to generalize beyond a given set of training images, it is also a very difficult problem due to the high variability of visual appearance within each class. In the last years substantial performance gains on challenging benchmark datasets have been reported in the literature. This progress can be attributed to two developments: the design of highly discriminative and robust image features and the combination of multiple complementary features based on different aspects such as shape, color or texture. In this paper we study several models that aim at learning the correct weighting of different features from training data. These include multiple kernel learning as well as simple baseline methods. Furthermore we derive ensemble methods inspired by Boosting which are easily extendable to several multiclass setting. All methods are thoroughly evaluated on object classification datasets using a multitude of feature descriptors. The key results are that even very simple baseline methods, that are orders of magnitude faster than learning techniques are highly competitive with multiple kernel learning. Furthermore the Boosting type methods are found to produce consistently better results in all experiments. We provide insight of when combination methods can be expected to work and how the benefit of complementary features can be exploited most efficiently.
Exploring large feature spaces with hierarchical MKL
, 2008
"... For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or H ..."
Abstract
-
Cited by 45 (9 self)
- Add to MetaCart
For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ 1-norm or the block ℓ 1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance. 1
More efficiency in multiple kernel learning
- In ICML
, 2007
"... An efficient and general multiple kernel learning (MKL) algorithm has been recently proposed by Sonnenburg et al. (2006). This approach has opened new perspectives since it makes the MKL approach tractable for largescale problems, by iteratively using existing support vector machine code. However, i ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
An efficient and general multiple kernel learning (MKL) algorithm has been recently proposed by Sonnenburg et al. (2006). This approach has opened new perspectives since it makes the MKL approach tractable for largescale problems, by iteratively using existing support vector machine code. However, it turns out that this iterative algorithm needs several iterations before converging towards a reasonable solution. In this paper, we address the MKL problem through an adaptive 2-norm regularization formulation. Weights on each kernel matrix are included in the standard SVM empirical risk minimization problem with a ℓ1 constraint to encourage sparsity. We propose an algorithm for solving this problem and provide an new insight on MKL algorithms based on block 1-norm regularization by showing that the two approaches are equivalent. Experimental results show that the resulting algorithm converges rapidly and its efficiency compares favorably to other MKL algorithms. 1.
Computing Regularization Paths for Learning Multiple Kernels
, 2005
"... The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. ..."
Abstract
-
Cited by 34 (10 self)
- Add to MetaCart
The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems.
Multi-task feature selection
- In the workshop of structural Knowledge Transfer for Machine Learning in the 23rd International Conference on Machine Learning (ICML
, 2006
"... We address the problem of joint feature selection across a group of related classification or regression tasks. We propose a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Intuitively, we extend the ℓ1 regularization for single-task esti ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
We address the problem of joint feature selection across a group of related classification or regression tasks. We propose a novel type of joint regularization of the model parameters in order to couple feature selection across tasks. Intuitively, we extend the ℓ1 regularization for single-task estimation to the multi-task setting. By penalizing the sum of ℓ2-norms of the blocks of coefficients associated with each feature across different tasks, we encourage multiple predictors to have similar parameter sparsity patterns. To fit parameters under this regularization, we propose a blockwise boosting scheme that follows the regularization path. The algorithm introduces and updates simultaneously the coefficients associated with one feature in all tasks. We show empirically that this approach outperforms independent ℓ1-based feature selection on several datasets. 1

