## Multiclass multiple kernel learning

### Cached

### Download Links

Venue: | In ICML. ACM |

Citations: | 43 - 3 self |

### BibTeX

@INPROCEEDINGS{Zien_multiclassmultiple,

author = {Alexander Zien and Cheng Soon Ong},

title = {Multiclass multiple kernel learning},

booktitle = {In ICML. ACM},

year = {},

pages = {2007},

publisher = {Press}

}

### OpenURL

### Abstract

In many applications it is desirable to learn from several kernels. “Multiple kernel learning” (MKL) allows the practitioner to optimize over linear combinations of kernels. By enforcing sparse coefficients, it also generalizes feature selection to kernel selection. We propose MKL for joint feature maps. This provides a convenient and principled way for MKL with multiclass problems. In addition, we can exploit the joint feature map to learn kernels on output spaces. We show the equivalence of several different primal formulations including different regularizers. We present several optimization methods, and compare a convex quadratically constrained quadratic program (QCQP) and two semi-infinite linear programs (SILPs) on toy data, showing that the SILPs are faster than the QCQP. We then demonstrate the utility of our method by applying the SILP to three real world datasets. 1.

### Citations

3667 |
L.: Convex Optimization
- BOYD, VANDENBERGHE
- 2004
(Show Context)
Citation Context ...nary. Within each column the primals are identical up to a variable transformation, hence equivalent. The two convex problems are equivalent as they share the same dual (and strong duality holds, cf. =-=[4]-=-). By the Multiclass Multiple Kernel Learning chain of equivalences, all shown OPs are equivalent, despite their different regularizers. 2.4. Optimization Recently unconstrained primal optimization is... |

2030 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ...tasets. 1. Introduction In support vector machines (SVMs), a kernel function k implicitly maps examples x to a feature space given by a feature map Φ via the identity k(xi, xj) = 〈Φ(xi), Φ(xj)〉 (e.g. =-=[19]-=-). It is often unclear what the most suitable kernel for the task at hand is, and hence the user may wish to combine several possible kernels. One problem with simply adding kernels is that using unif... |

545 | Learning the kernel matrix with semidefinite programming
- Lanckriet, Cristianini, et al.
(Show Context)
Citation Context ... are quadratically constrained quadratic programs (QCQPs). 2.3. Relation to Previous Work There have been several developments on optimizing a linear combination of kernels while training a predictor =-=[1, 3, 6, 8, 12, 14, 15, 20]-=- ([16, 18] considers general parameterized kernel functions). To show the relationship of our approach to two previous approaches [1, 20], we consider the unfolded dual. For the case of a single kerne... |

312 | Support vector machine learning for interdependent and structured output spaces - Tsochantaridis, Hofmann, et al. - 2004 |

143 |
Predicting subcellular localization of proteins based on their N-terminal amino acid sequence
- Emanuelson, Nielsen, et al.
- 2000
(Show Context)
Citation Context ...ls. When comparing our predictor to current state of the art methods, we perform substantially better. Figure 3 summarizes the results in [17] on three datasets. The original plant dataset of TargetP =-=[7]-=- is classified in [11] as a 4 class problem. The method TargetLoc [11] uses three layers of SVMs, the first layer detecting certain features, and the second and third layers combining the outputs of p... |

138 | A statistical framework for genomic data fusion
- Lanckriet, Bie, et al.
(Show Context)
Citation Context ...kernels is that using uniform weights is possibly not optimal. An extreme example is the case that one kernel is not correlated with the labels at all – then giving it positive weight just adds noise =-=[13]-=-. Multiple kernel learning (MKL) is a way of optimizing kernel weights while training the SVM. In addition to leading to good classification accuracies, MKL can also be useful for identifying relAppea... |

134 |
von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence
- Emanuelsson, Nielsen, et al.
(Show Context)
Citation Context ...ls. When comparing our predictor to current state of the art methods, we perform substantially better. Figure 3 summarizes the results in [17] on three datasets. The original plant dataset of TargetP =-=[7]-=- is classified in [11] as a 4 class problem. The method TargetLoc [11] uses three layers of SVMs, the first layer detecting certain features, and the second and third layers combining the outputs of p... |

121 | Semi-supervised classification by low density separation
- Chapelle, Zien
(Show Context)
Citation Context ...caling of the features changes the resulting weights w (as it is equivalent to a change of the regularizer). Analoguously, in MKL the scaling of the kernels affects their resulting coefficients β. In =-=[5]-=- it is argued that a reasonable order of magnitude of the SVM regularization parameter C can be estimated as the inverse of the variance of the points in feature space. Conversely, for C = 1 a reasona... |

78 | Learning the kernel with hyperkernels
- Ong, Smola, et al.
(Show Context)
Citation Context ... quadratic programs (QCQPs). 2.3. Relation to Previous Work There have been several developments on optimizing a linear combination of kernels while training a predictor [1, 3, 6, 8, 12, 14, 15, 20] (=-=[16, 18]-=- considers general parameterized kernel functions). To show the relationship of our approach to two previous approaches [1, 20], we consider the unfolded dual. For the case of a single kernel, p = 1, ... |

73 | On the complexity of learning the kernel matrix
- Bousquet, Herrmann
- 2002
(Show Context)
Citation Context ... are quadratically constrained quadratic programs (QCQPs). 2.3. Relation to Previous Work There have been several developments on optimizing a linear combination of kernels while training a predictor =-=[1, 3, 6, 8, 12, 14, 15, 20]-=- ([16, 18] considers general parameterized kernel functions). To show the relationship of our approach to two previous approaches [1, 20], we consider the unfolded dual. For the case of a single kerne... |

72 | Protein function prediction via graph kernels
- Borgwardt, Ong, et al.
(Show Context)
Citation Context ...ng time while increasing a single parameter of the base case of 300 examples, 3 classes and 3 kernels. The range of values [min,max] on a log scale used were [100,5000] examples, [3,100] classes, and =-=[2,20]-=- kernels. The QCQP is also a constant amount slower for any particular dataset size. Examples Classes Kernels QP, unfolded 2.5 1.8 – QCQP, unfolded (9) 3.0 2.0 2.3 SILP, unfolded (11) 2.4 1.7 1.1 SILP... |

60 | Kernel design using boosting
- Crammer, Keshet, et al.
- 2002
(Show Context)
Citation Context ... are quadratically constrained quadratic programs (QCQPs). 2.3. Relation to Previous Work There have been several developments on optimizing a linear combination of kernels while training a predictor =-=[1, 3, 6, 8, 12, 14, 15, 20]-=- ([16, 18] considers general parameterized kernel functions). To show the relationship of our approach to two previous approaches [1, 20], we consider the unfolded dual. For the case of a single kerne... |

37 |
A general and efficient multiple kernel learning algorithm
- Sonnenburg, Ratsch, et al.
- 2005
(Show Context)
Citation Context ...ful for identifying relAppearing in Proceedings of the 24 th International Conference on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). evant and meaningful features =-=[2, 13, 20]-=-. Since in many real world applications more than two classes are to be distinguished, there has been a lot of work on decomposing multiclass problems into several standard binary classification probl... |

32 |
MultiLoc: Prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition
- Höglund, Dönnes, et al.
(Show Context)
Citation Context ...r predictor to current state of the art methods, we perform substantially better. Figure 3 summarizes the results in [17] on three datasets. The original plant dataset of TargetP [7] is classified in =-=[11]-=- as a 4 class problem. The method TargetLoc [11] uses three layers of SVMs, the first layer detecting certain features, and the second and third layers combining the outputs of previous SVMs. Their av... |

28 |
PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis
- Gardy, Laird, et al.
(Show Context)
Citation Context ..., an unweighted sum often performs well. However, here multiclass MKL performs even better. We further compare our approach to the method PSORTb on another two datasets of bacterial protein locations =-=[9]-=-. The psort+ dataset contains 4 classes, and PSORTb achieves an average F1 score of 90.0%, which we outperform with an F1 score of 93.8%. On psort-, a 5 class problem, PSORTb reaches an average F1 sco... |

28 | Optimal kernel selection in kernel fisher discriminant analysis
- Kim, Magnani, et al.
- 2006
(Show Context)
Citation Context |

17 | A fast iterative algorithm for Fisher discriminant using heterogeneous kernels
- Fung, Dundar, et al.
- 2004
(Show Context)
Citation Context |

14 |
Feature space perspectives for learning the kernel
- Micchelli, Pontil
- 2007
(Show Context)
Citation Context |

9 |
Semi-Infinite Programming: Theory
- Hettich, Kortanek
- 1993
(Show Context)
Citation Context ...gives rise to a constraint on θ which is linear in β. We alternate generating new constraints in this way and solving the LP with the constraints collected so far. This procedure is known to converge =-=[10, 20]-=-. Hence our seemingly complicated problem can be solved with off-the-shelf solvers. In our implementation, we used CPLEX with the dual simplex method for both QPs and LPs.s�� �� pX 1 min βk�wk� β,w,b,... |

9 | An Automated Combination of Kernels for Predicting Protein Subcellular Localization
- Ong, Zien
- 2008
(Show Context)
Citation Context ...es, 3 kernels from BLAST E-values, and 64 sequence motif kernels. When comparing our predictor to current state of the art methods, we perform substantially better. Figure 3 summarizes the results in =-=[17]-=- on three datasets. The original plant dataset of TargetP [7] is classified in [11] as a 4 class problem. The method TargetLoc [11] uses three layers of SVMs, the first layer detecting certain feature... |