Results 1  10
of
24
More Generality in Efficient Multiple Kernel Learning
"... Recent advances in Multiple Kernel Learning (MKL) have positioned it as an attractive tool for tackling many supervised learning tasks. The development of efficient gradient descent based optimization schemes has made it possible to tackle large scale problems. Simultaneously, MKL based algorithms h ..."
Abstract

Cited by 80 (3 self)
 Add to MetaCart
(Show Context)
Recent advances in Multiple Kernel Learning (MKL) have positioned it as an attractive tool for tackling many supervised learning tasks. The development of efficient gradient descent based optimization schemes has made it possible to tackle large scale problems. Simultaneously, MKL based algorithms have achieved very good results on challenging real world applications. Yet, despite their successes, MKL approaches are limited in that they focus on learning a linear combination of given base kernels. In this paper, we observe that existing MKL formulations can be extended to learn general kernel combinations subject to general regularization. This can be achieved while retaining all the efficiency of existing large scale optimization algorithms. To highlight the advantages of generalized kernel learning, we tackle feature selection problems on benchmark vision and UCI databases. It is demonstrated that the proposed formulation can lead to better results not only as compared to traditional MKL but also as compared to stateoftheart wrapper and filter methods for feature selection. 1.
Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
"... A sparse representation of Support Vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 01 control variable to each input feature, l0norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than di ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
(Show Context)
A sparse representation of Support Vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 01 control variable to each input feature, l0norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than directly solving this MIP, we propose an efficient cutting plane algorithm combining with multiple kernel learning to solve its convex relaxation. A global convergence proof for our method is also presented. Comprehensive experimental results on one synthetic and 10 real world datasets show that our proposed method can obtain better or competitive performance compared with existing SVMbased feature selection methods in term of sparsity and generalization performance. Moreover, our proposed method can effectively handle largescale and extremely high dimensional problems. 1.
Feature Selection as a OnePlayer Game
"... This paper formalizes Feature Selection as a Reinforcement Learning problem, leading to a provably optimal though intractable selection policy. As a second contribution, this paper presents an approximation thereof, based on a oneplayer game approach and relying on the MonteCarlo tree search UCT ( ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
(Show Context)
This paper formalizes Feature Selection as a Reinforcement Learning problem, leading to a provably optimal though intractable selection policy. As a second contribution, this paper presents an approximation thereof, based on a oneplayer game approach and relying on the MonteCarlo tree search UCT (Upper Confidence Tree) proposed by Kocsis and Szepesvari (2006). The Feature Uct SElection (FUSE) algorithm extends UCT to deal with i) a finite unknown horizon (the target number of relevant features); ii) the huge branching factor of the search tree, reflecting the size of the feature set. Finally, a frugal reward function is proposed as a rough but unbiased estimate of the relevance of a feature subset. A proof of concept of FUSE is shown on benchmark data sets. 1.
NonMonotonic Feature Selection
"... We consider the problem of selecting a subset of m most informative features where m is the number of required features. This feature selection problem is essentially a combinatorial optimization problem, and is usually solved by an approximation. Conventional feature selection methods address the c ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of selecting a subset of m most informative features where m is the number of required features. This feature selection problem is essentially a combinatorial optimization problem, and is usually solved by an approximation. Conventional feature selection methods address the computational challenge in two steps: (a) ranking all the features by certain scores that are usually computed independently from the number of specified features m, and (b) selecting the top m ranked features. One major shortcoming of these approaches is that if a feature f is chosen when the number of specified features is m, it will always be chosen when the number of specified features is larger than m. We refer to this property as the “monotonic ” property of feature selection. In this work, we argue that it is important to develop efficient algorithms for nonmonotonic feature selection. To this end, we develop an algorithm for nonmonotonic feature selection that approximates the related combinatorial optimization problem by a Multiple Kernel Learning (MKL) problem. We also present a strategy that derives a discrete solution from the approximate solution of MKL, and show the performance guarantee for the derived discrete solution when compared to the global optimal solution for the related combinatorial optimization problem. An empirical study with a number of benchmark data sets indicates the promising per
Maximum Entropy Discrimination Markov Networks
, 2008
"... Standard maxmargin structured prediction methods concentrate directly on the inputoutput mapping, and the lack of an elegant probabilistic interpretation causes limitations. In this paper, we present a novel framework called Maximum Entropy Discrimination Markov Networks (MaxEntNet) to do Bayesian ..."
Abstract

Cited by 15 (8 self)
 Add to MetaCart
(Show Context)
Standard maxmargin structured prediction methods concentrate directly on the inputoutput mapping, and the lack of an elegant probabilistic interpretation causes limitations. In this paper, we present a novel framework called Maximum Entropy Discrimination Markov Networks (MaxEntNet) to do Bayesian maxmargin structured learning by using expected margin constraints to define a feasible distribution subspace and applying the maximum entropy principle to choose the best distribution from this subspace. We show that MaxEntNet subsumes the standard maxmargin Markov networks (M 3 N) as a spacial case where the predictive model is assumed to be linear and the parameter prior is a standard normal. Based on this understanding, we propose the Laplace maxmargin Markov networks (LapM 3 N) which use the Laplace prior instead of the standard normal. We show that the adoption of a Laplace prior of the parameter makes LapM 3 N enjoy properties expected from a sparsified M 3 N. Unlike the L1regularized maximum likelihood estimation which sets small weights to zeros to achieve sparsity, LapM 3 N posteriorly weights the parameters and features with smaller weights are shrunk more. This posterior weighting effect makes LapM 3 N more stable with respect to the magnitudes of the regularization coefficients and more generalizable. To
Online feature selection for mining big data
 In BigMine
, 2012
"... Most studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for realworld applications when data instances are of high dimensionality or the access toitisexpensivetoacquirethefullsetofattributes/features. To ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Most studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for realworld applications when data instances are of high dimensionality or the access toitisexpensivetoacquirethefullsetofattributes/features. To address this limitation, we investigate the problem of Online Feature Selection(OFS)inwhichtheonlinelearner is only allowed to maintain a classifier involved a small and fixed number of features. The key challenge of Online Feature Selection is how to make accurate prediction using a small and fixednumberof active features. This is in contrast to theclassical setupof online learning where all the features are active and can be used for prediction. We address this challenge by studying sparsity regularization and truncation techniques. Specifically, we present an effective algorithm to solve the problem, give the theoretical analysis, and evaluate theempirical performance oftheproposedalgorithms for online feature selection on several public datasets. We also demonstrate the application of our online feature selection technique to tackle realworld problems of big data mining, which is significantly more scalable than some wellknown batch feature selection algorithms. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques for largescale applications.
Large margin transformation learning
, 2009
"... With the current explosion of data coming from many scientific fields and industry, machine learning algorithms are more important than ever to help make sense of this data in an automated manner. Support vector machine (SVMs) have been a very successful learning algorithm for many applied settings. ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
With the current explosion of data coming from many scientific fields and industry, machine learning algorithms are more important than ever to help make sense of this data in an automated manner. Support vector machine (SVMs) have been a very successful learning algorithm for many applied settings. However, the support vector machine only finds linear classifiers so data often needs to be preprocessed with appropriately chosen nonlinear mappings in order to find a model with good predictive properties. These mappings can either take the form of an explicit transformation or be defined implicitly with a kernel function. Automatically choosing these mappings has been studied under the name of kernel learning. These methods typically optimize a cost function to find a kernel made up of a combination of base kernels thus implicitly learning mappings. This dissertation investigates methods for choosing explicit transformations automatically. This setting differs from the kernel learning framework by learning a combination of base transformations rather than base kernels. This allows prior knowledge to be exploited in the functional form of the transformations which may not be easily encoded as kernels such as when learning monotonic
Variable Selection for Gaussian Graphical Models
, 2012
"... We present a variableselection structure learning approach for Gaussian graphical models. Unlike standard sparseness promoting techniques, our method aims at selecting the mostimportant variables besides simply sparsifying the set of edges. Through simulations, we show that our method outperforms ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We present a variableselection structure learning approach for Gaussian graphical models. Unlike standard sparseness promoting techniques, our method aims at selecting the mostimportant variables besides simply sparsifying the set of edges. Through simulations, we show that our method outperforms the stateoftheart in recovering the ground truth model. Our method also exhibits better generalization performance in a wide range of complex realworld datasets: brain fMRI, gene expression, NASDAQ stock prices and world weather. We also show that our resulting networks are more interpretable in the context of brain fMRI analysis, while retaining discriminability. From an optimization perspective, we show that a block coordinate descent method generates a sequence of positive definite solutions. Thus, we reduce the original problem into a sequence of strictly convex (ℓ1,ℓp) regularized quadratic minimization subproblems for p ∈ {2, ∞}. Our algorithm is well founded since the optimal solution of the maximization problem is unique and bounded.
Markov Networks
, 2008
"... * * To whom correspondence should be addressed. Keywords: Maximum entropy discrimination Markov networks, Bayesian maxmargin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
* * To whom correspondence should be addressed. Keywords: Maximum entropy discrimination Markov networks, Bayesian maxmargin