Results 1  10
of
21
Large scale multiple kernel learning
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... While classical kernelbased learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We s ..."
Abstract

Cited by 229 (17 self)
 Add to MetaCart
While classical kernelbased learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classification, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semiinfinite linear program that can be efficiently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and oneclass classification. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million realworld splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
Building Support Vector Machines with Reduced Classifier Complexity
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions ..."
Abstract

Cited by 60 (1 self)
 Add to MetaCart
Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (d max ) to approximate the SVM primal cost function well; (3) it is efficient and roughly scales as O(nd max ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors.
Matrix exponentiated gradient updates for online learning and Bregman projections
 Journal of Machine Learning Research
, 2005
"... We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that ..."
Abstract

Cited by 50 (10 self)
 Add to MetaCart
We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: Online learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the nondiagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements. 1
Efficient Margin Maximizing with Boosting
, 2002
"... AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current signed linear combination, whic ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current signed linear combination, which can be viewed as hyperplane in feature space where the base hypotheses form the features.
Learning interpretable SVMs for biological sequence classification
 BMC BIOINFORMATICS
, 2005
"... We propose novel algorithms for solving the socalled Support Vector Multiple Kernel Learning problem and show how they can be used to understand the resulting support vector decision function. While classical kernelbased algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Le ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
We propose novel algorithms for solving the socalled Support Vector Multiple Kernel Learning problem and show how they can be used to understand the resulting support vector decision function. While classical kernelbased algorithms (such as SVMs) are based on a single kernel, in Multiple Kernel Learning a quadraticallyconstraint quadratic program is solved in order to find a sparse convex combination of a set of support vector kernels. We show how this problem can be cast into a semiinfinite linear optimization problem which can in turn be solved efficiently using a boostinglike iterative method in combination with standard SVM optimization algorithms. The proposed method is able to deal with thousands of examples while combining hundreds of kernels within reasonable time. In the second part we show how this technique can be used to understand the obtained decision function in order to extract biologically relevant knowledge about the sequence analysis problem at hand. We consider the problem of splice site identification and combine string kernels at different sequence positions and with various substring (oligomer) lengths. The proposed algorithm computes a sparse weighting over the length and the substring, highlighting which substrings are important for discrimination. Finally, we propose a bootstrap scheme in order to reliably identify a few statistically significant positions, which can then be used for further analysis such as consensus finding.
Maximizing the Margin with Boosting
, 2002
"... AdaBoost produces a linear combination of weak hypotheses. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current linear combination, i.e. by a hyperplane in feature space spanned by the weak hypotheses ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
AdaBoost produces a linear combination of weak hypotheses. It has been observed that the generalization error of the algorithm continues to improve even after all examples are classified correctly by the current linear combination, i.e. by a hyperplane in feature space spanned by the weak hypotheses. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even when the training error is already zero, that is all examples are on the correct side of the hyperplane. We give an iterative version of AdaBoost that explicitly maximizes the minimum margin of the examples. We bound the number of iterations and the number of hypotheses used in the final linear combination which approximates the maximum margin hyperplane with a certain precision. Our modified algorithm essentially retains the exponential convergence properties of AdaBoost and our result does not depend on the size of the hypothesis class. 1
On the convergence of leveraging
 In Advances in Neural Information Processing Systems (NIPS
, 2002
"... We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the LeastSquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly com ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the LeastSquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the GaussSouthwell method known from numerical optimization and state nonasymptotical convergence results for all these methods. Our analysis includes ℓ1norm regularized cost functions leading to a clean and general way to regularize ensemble learning. 1
Posterior probability support vector machines for unbalanced data
 IEEE Trans. Neural Netw
, 2005
"... Abstract—This paper proposes a complete framework of posterior probability support vector machines (PPSVMs) for weighted training samples using modified concepts of risks, linear separability, margin, and optimal hyperplane. Within this framework, a new optimization problem for unbalanced classifi ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract—This paper proposes a complete framework of posterior probability support vector machines (PPSVMs) for weighted training samples using modified concepts of risks, linear separability, margin, and optimal hyperplane. Within this framework, a new optimization problem for unbalanced classification problems is formulated and a new concept of support vectors established. Furthermore, a soft PPSVM with an interpretable parameter is obtained which is similar to theSVM developed by Schölkopf et al., and an empirical method for determining the posterior probability is proposed as a new approach to determine. The main advantage of an PPSVM classifier lies in that fact that it is closer to the Bayes optimal without knowing the distributions. To validate the proposed method, two synthetic classification examples are used to illustrate the logical correctness of PPSVMs and their relationship to regular SVMs and Bayesian methods. Several other classification experiments are conducted to demonstrate that the performance of PPSVMs is better than regular SVMs in some cases. Compared with fuzzy support vector machines (FSVMs), the proposed PPSVM is a natural and an analytical extension of regular SVMs based on the statistical learning theory. Index Terms—Bayesian decision theory, classification, margin, maximal margin algorithms,SVM, posterior probability, sup
Nonsparse multiple kernel learning for fisher discriminant analysis
 In International Conference on Data Mining
, 2009
"... Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract—We consider the problem of learning a linear combination of prespecified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an ℓ1 norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use ℓ2 norm regularisation instead. The resulting learning problem is formulated as a semiinfinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its ℓ1 counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.