Results 1  10
of
51
Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
 Journal of Machine Learning Research
, 2003
"... We present a novel and flexible approach to the problem of feature selection, called grafting.Rather than considering feature selection as separate from learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework. To ..."
Abstract

Cited by 79 (2 self)
 Add to MetaCart
We present a novel and flexible approach to the problem of feature selection, called grafting.Rather than considering feature selection as separate from learning, grafting treats the selection of suitable features as an integral part of learning a predictor in a regularized learning framework. To make this regularized learning process sufficiently fast for large scale problems, grafting operates in an incremental iterative fashion, gradually building up a feature set while training a predictor model using gradient descent. At each iteration, a fast gradientbased heuristic is used to quickly assess which feature is most likely to improve the existing model, that feature is then added to the model, and the model is incrementally optimized using gradient descent. The algorithm scales linearly with the number of data points and at most quadratically with the number of features. Grafting can be used with a variety of predictor model classes, both linear and nonlinear, and can be used for both classification and regression. Experiments are reported here on a variant of grafting for classification, using both linear and nonlinear models, and using a logistic regressioninspired loss function. Results on a variety of synthetic and real world data sets are presented. Finally the relationship between grafting, stagewise additive modelling, and boosting is explored.
More Generality in Efficient Multiple Kernel Learning
"... Recent advances in Multiple Kernel Learning (MKL) have positioned it as an attractive tool for tackling many supervised learning tasks. The development of efficient gradient descent based optimization schemes has made it possible to tackle large scale problems. Simultaneously, MKL based algorithms h ..."
Abstract

Cited by 47 (2 self)
 Add to MetaCart
Recent advances in Multiple Kernel Learning (MKL) have positioned it as an attractive tool for tackling many supervised learning tasks. The development of efficient gradient descent based optimization schemes has made it possible to tackle large scale problems. Simultaneously, MKL based algorithms have achieved very good results on challenging real world applications. Yet, despite their successes, MKL approaches are limited in that they focus on learning a linear combination of given base kernels. In this paper, we observe that existing MKL formulations can be extended to learn general kernel combinations subject to general regularization. This can be achieved while retaining all the efficiency of existing large scale optimization algorithms. To highlight the advantages of generalized kernel learning, we tackle feature selection problems on benchmark vision and UCI databases. It is demonstrated that the proposed formulation can lead to better results not only as compared to traditional MKL but also as compared to stateoftheart wrapper and filter methods for feature selection. 1.
Gene Selection Using Support Vector Machines With Nonconvex Penalty
 Bioinformatics
, 2006
"... Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of “high dimensional low sample size.” ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of “high dimensional low sample size.” Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g., between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers, and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide
A comparison of optimization methods and software for largescale l1regularized linear classification
 The Journal of Machine Learning Research
"... Largescale linear classification is widely used in many areas. The L1regularized form can be applied for feature selection; however, its nondifferentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been com ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
Largescale linear classification is widely used in many areas. The L1regularized form can be applied for feature selection; however, its nondifferentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss stateoftheart software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.
Training Support Vector Machine using Adaptive Clustering
 in Proc. of the 4th SIAM International Conference on Data Mining, Lake Buena
, 2004
"... Training support vector machines involves a huge optimization problem and many specially designed algorithms have been proposed. In this paper, we proposed an algorithm called ClusterSVM that accelerates the training process by exploiting the distributional properties of the training data, that is, ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Training support vector machines involves a huge optimization problem and many specially designed algorithms have been proposed. In this paper, we proposed an algorithm called ClusterSVM that accelerates the training process by exploiting the distributional properties of the training data, that is, the natural clustering of the training data and the overall layout of these clusters relative to the decision boundary of support vector machines. The proposed algorithm first partitions the training data into several pairwise disjoint clusters. Then, the representatives of these clusters are used to train an initial support vector machine, based on which we can approximately identify the support vectors and nonsupport vectors. After replacing the cluster containing only nonsupport vectors with its representative, the number of training data can be significantly reduced, thereby speeding up the training process. The proposed ClusterSVM has been tested against the popular training algorithm SMO on both the artificial data and the real data, and a significant speedup was observed. The complexity of ClusterSVM scales with the square of the number of support vectors and, after a further improvement, it is expected that it will scale with square of the number of nonboundary support vectors.
The Interplay of Optimization and Machine Learning Research
 Journal of Machine Learning Research
, 2006
"... The fields of machine learning and mathematical programming are increasingly intertwined. Optimization problems lie at the heart of most machine learning approaches. The Special Topic on Machine Learning and Large Scale Optimization examines this interplay. Machine learning researchers have embra ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
The fields of machine learning and mathematical programming are increasingly intertwined. Optimization problems lie at the heart of most machine learning approaches. The Special Topic on Machine Learning and Large Scale Optimization examines this interplay. Machine learning researchers have embraced the advances in mathematical programming allowing new types of models to be pursued. The special topic includes models using quadratic, linear, secondorder cone, semidefinite, and semiinfinite programs. We observe that the qualities of good optimization algorithms from the machine learning and optimization perspectives can be quite different. Mathematical programming puts a premium on accuracy, speed, and robustness. Since generalization is the bottom line in machine learning and training is normally done offline, accuracy and small speed improvements are of little concern in machine learning. Machine learning prefers simpler algorithms that work in reasonable computational time for specific classes of problems. Reducing machine learning problems to wellexplored mathematical programming classes with robust general purpose optimization codes allows machine learning researchers to rapidly develop new techniques.
Direct convex relaxations of sparse svm
 in ICML ’07: Proceedings of the 24th international conference on Machine learning
"... Although support vector machines (SVMs) for binary classification give rise to a decision rule that only relies on a subset of the training data points (support vectors), it will in general be based on all available features in the input space. We propose two direct, novel convex relaxations of a no ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Although support vector machines (SVMs) for binary classification give rise to a decision rule that only relies on a subset of the training data points (support vectors), it will in general be based on all available features in the input space. We propose two direct, novel convex relaxations of a nonconvex sparse SVM formulation that explicitly constrains the cardinality of the vector of feature weights. One relaxation results in a quadraticallyconstrained quadratic program (QCQP), while the second is based on a semidefinite programming (SDP) relaxation. The QCQP formulation can be interpreted as applying an adaptive softthreshold on the SVM hyperplane, while the SDP formulation learns a weighted innerproduct (i.e. a kernel) that results in a sparse hyperplane. Experimental results show an increase in sparsity while conserving the generalization performance compared to a standard as well as a linear programming SVM. 1.
Exact 1Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... Support vector machines utilizing the 1norm, typically set up as linear programs (Mangasarian, 2000; Bradley and Mangasarian, 1998), are formulated here as a completely unconstrained minimization of a convex differentiable piecewisequadratic objective function in the dual space. The objective f ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Support vector machines utilizing the 1norm, typically set up as linear programs (Mangasarian, 2000; Bradley and Mangasarian, 1998), are formulated here as a completely unconstrained minimization of a convex differentiable piecewisequadratic objective function in the dual space. The objective function, which has a Lipschitz continuous gradient and contains only one additional finite parameter, can be minimized by a generalized Newton method and leads to an exact solution of the support vector machine problem. The approach here is based on a formulation of a very general linear program as an unconstrained minimization problem and its application to support vector machine classification problems. The present approach which generalizes both (Mangasarian, 2004) and (Fung and Mangasarian, 2004) is also applied to nonlinear approximation where a minimal number of nonlinear kernel functions are utilized to approximate a function from a given number of function values.
Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets
"... A sparse representation of Support Vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 01 control variable to each input feature, l0norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than di ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
A sparse representation of Support Vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 01 control variable to each input feature, l0norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than directly solving this MIP, we propose an efficient cutting plane algorithm combining with multiple kernel learning to solve its convex relaxation. A global convergence proof for our method is also presented. Comprehensive experimental results on one synthetic and 10 real world datasets show that our proposed method can obtain better or competitive performance compared with existing SVMbased feature selection methods in term of sparsity and generalization performance. Moreover, our proposed method can effectively handle largescale and extremely high dimensional problems. 1.
Feature Selection for Nonlinear Kernel Support Vector Machines
 SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS
, 2007
"... An easily implementable mixedinteger algorithm is proposed that generates a nonlinear kernel support vector machine (SVM) classifier with reduced input space features. A single parameter controls the reduction. On one publicly available dataset, the algorithm obtains 92.4 % accuracy with 34.7 % of ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
An easily implementable mixedinteger algorithm is proposed that generates a nonlinear kernel support vector machine (SVM) classifier with reduced input space features. A single parameter controls the reduction. On one publicly available dataset, the algorithm obtains 92.4 % accuracy with 34.7 % of the features compared to 94.1 % accuracy with all features. On a synthetic dataset with 1000 features, 900 of which are irrelevant, our approach improves the accuracy of a fullfeature classifier by over 30%. The proposed algorithm introduces a diagonal matrix E with ones for features present in the classifier and zeros for removed features. By alternating between optimizing the continuous variables of an ordinary nonlinear SVM and the integer variables on the diagonal of E, a decreasing sequence of objective function values is obtained. This sequence converges to a local solution minimizing the usual data fit and solution complexity while also minimizing the number of features used.