Results 11 - 20
of
1,135
Diffusion Kernels on Graphs and Other Discrete Input Spaces
, 2002
"... The application of kernel-based learning algorithms has, so far, largely been confined to realvalued data and a few special data types, such as strings. In this paper we propose a general method of constructing natural families of kernels over discrete structures, based on the matrix exponenti ..."
Abstract
-
Cited by 134 (7 self)
- Add to MetaCart
The application of kernel-based learning algorithms has, so far, largely been confined to realvalued data and a few special data types, such as strings. In this paper we propose a general method of constructing natural families of kernels over discrete structures, based on the matrix exponentiation idea. In particular, we focus on generating kernels on graphs, for which we propose a special class of exponential kernels called diffusion kernels, which are based on the heat equation and can be regarded as the discretization of the familiar Gaussian kernel of Euclidean space.
Pegasos: Primal Estimated sub-gradient solver for SVM
"... We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a singl ..."
Abstract
-
Cited by 131 (10 self)
- Add to MetaCart
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ɛ is Õ(1/ɛ), where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require Ω(1/ɛ2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is Õ(d/(λɛ)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.
Large scale multiple kernel learning
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We s ..."
Abstract
-
Cited by 129 (13 self)
- Add to MetaCart
While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-infinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at
Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data
- Journal of the American Statistical Association
, 2004
"... Two-category support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We pro ..."
Abstract
-
Cited by 116 (10 self)
- Add to MetaCart
Two-category support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassi � cation costs. As a tuning criterion for the MSVM, an approximate leave-one-out cross-validation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classi � cation using microarray data and cloud classi � cation with satellite radiance pro � les.
Optimal Aggregation of Classifiers in Statistical Learning
, 2001
"... The problem of statistical learning can be considered as a problem of nonparametric estimation of sets, where the risk is de ned by means of a speci c distance function between sets associated to the misclassi cation error. The rates of convergence of classi ers depend on two parameters: the ..."
Abstract
-
Cited by 100 (4 self)
- Add to MetaCart
The problem of statistical learning can be considered as a problem of nonparametric estimation of sets, where the risk is de ned by means of a speci c distance function between sets associated to the misclassi cation error. The rates of convergence of classi ers depend on two parameters: the complexity of the class of candidate sets and the "margin" parameter. The dependence is explicitly given, in particular the optimal rates up to O(n ) can be attained, where n is the sample size, and the proposed classi ers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classi ers: we suggest a classi er that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.
Learning Multiple Tasks with Kernel Methods
- Journal of Machine Learning Research
, 2005
"... Editor: John Shawe-Taylor We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis sh ..."
Abstract
-
Cited by 96 (5 self)
- Add to MetaCart
Editor: John Shawe-Taylor We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task.
Marginalized kernels between labeled graphs
- Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... A new kernel function between two labeled graphs is presented. Feature vectors are defined as the counts of label paths produced by random walks on graphs. The kernel computation finally boils down to obtaining the stationary state of a discrete-time linear system, thus is efficiently performed by s ..."
Abstract
-
Cited by 94 (9 self)
- Add to MetaCart
A new kernel function between two labeled graphs is presented. Feature vectors are defined as the counts of label paths produced by random walks on graphs. The kernel computation finally boils down to obtaining the stationary state of a discrete-time linear system, thus is efficiently performed by solving simultaneous linear equations. Our kernel is based on an infinite dimensional feature space, so it is fundamentally different from other string or tree kernels based on dynamic programming. We will present promising empirical results in classification of chemical compounds. 1 1.
Diffusion kernels on graphs and other discrete structures
- In Proceedings of the ICML
, 2002
"... The application of kernel-based learning algorithms has, so far, largely been confined to realvalued data and a few special data types, such as strings. In this paper we propose a general method of constructing natural families of kernels over discrete structures, based on the matrix exponentiation ..."
Abstract
-
Cited by 94 (4 self)
- Add to MetaCart
The application of kernel-based learning algorithms has, so far, largely been confined to realvalued data and a few special data types, such as strings. In this paper we propose a general method of constructing natural families of kernels over discrete structures, based on the matrix exponentiation idea. In particular, we focus on generating kernels on graphs, for which we propose a special class of exponential kernels, based on the heat equation, called diffusion kernels, and show that these can be regarded as the discretisation of the familiar Gaussian kernel of Euclidean space.
Query Learning with Large Margin Classifiers
, 2000
"... The active selection of instances can significantly improve the generalisation performance of a learning machine. Large margin classifiers such as Support Vector Machines classify data using the most informative instances (the support vectors). This makes them natural candidates for instance s ..."
Abstract
-
Cited by 92 (1 self)
- Add to MetaCart
The active selection of instances can significantly improve the generalisation performance of a learning machine. Large margin classifiers such as Support Vector Machines classify data using the most informative instances (the support vectors). This makes them natural candidates for instance selection strategies. In this paper we propose an algorithm for the training of Support Vector Machines using instance selection. We give a theoretical justification for the strategy and experimental results on real and artificial data demonstrating its effectiveness. The technique is most efficient when the dataset can be learnt using few support vectors. 1. Introduction The labour-intensive task of labelling data is a serious bottleneck for many data mining tasks. Often cost or time constraints mean that only a fraction of the available instances can be labeled. For this reason there has been increasing interest in the problem of handling partially labeled datasets. One approach ...
Semi-Supervised Classification by Low Density Separation
, 2005
"... We believe that the cluster assumption is key to successful semi-supervised learning. Based on this, we propose three semi-supervised algorithms: 1. deriving graph-based distances that emphazise low density regions between clusters, followed by training a standard SVM; 2. optimizing the Transd ..."
Abstract
-
Cited by 89 (9 self)
- Add to MetaCart
We believe that the cluster assumption is key to successful semi-supervised learning. Based on this, we propose three semi-supervised algorithms: 1. deriving graph-based distances that emphazise low density regions between clusters, followed by training a standard SVM; 2. optimizing the Transductive SVM objective function, which places the decision boundary in low density regions, by gradient descent; 3. combining the first two to make maximum use of the cluster assumption. We compare with state of the art algorithms and demonstrate superior accuracy for the latter two methods.

