## Composite kernel learning (2008)

### Cached

### Download Links

Venue: | IN PROC. ICML |

Citations: | 23 - 3 self |

### BibTeX

@INPROCEEDINGS{Szafranski08compositekernel,

author = {Marie Szafranski and Yves Grandvalet and Alain Rakotomamonjy},

title = {Composite kernel learning},

booktitle = {IN PROC. ICML},

year = {2008},

publisher = {}

}

### OpenURL

### Abstract

The Support Vector Machine (SVM) is an acknowledged powerful tool for building classifiers, but it lacks flexibility, in the sense that the kernel is chosen prior to learning. Multiple Kernel Learning (MKL) enables to learn the kernel, from an ensemble of basis kernels, whose combination is optimized in the learning process. Here, we propose Composite Kernel Learning to address the situation where distinct components give rise to a group structure among kernels. Our formulation of the learning problem encompasses several setups, putting more or less emphasis on the group structure. We characterize the convexity of the learning problem, and provide a general wrapper algorithm for computing solutions. Finally, we illustrate the behavior of our method on multi-channel data where groups correpond to channels.

### Citations

9827 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...= K(x,x′). In other words, the kernel defines 1. the hypothesis space H; 2. the complexity measure ‖f‖2H indexing the family of nested functional spaces in the structural risk minimization principle (=-=Vapnik 1995-=-); 3. the representation space of data endowed with a scalar product. These observations motivate the developments of means to avoid the use of unsupported kernel, which do not represent prior knowled... |

4037 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...nvexity: Problem (10) is convex if and only if 0 ≤ q ≤ 1 and 0 ≤ p + q ≤ 1. Proof. A problem minimizing a convex criterion on a convex set is convex. The objective function of Problem (10) is convex (=-=Boyd & Vandenberghe, 2004-=-, p. 89). The first, second and fourth constraints define convex ) sets, and the q ( ∑ third one also provided (i) is 0 ≤ q ≤ 1, and (ii) ∑ ℓ 0 ≤ p + q ≤ 1. m∈Gℓ σ1/q m ℓ t1/(p+q) is a norm, that is c... |

2048 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ... particular cases listed below, such algorithms exist. They all consider γ0 = 1 that enforces sparseness at the group level and identical norms {γℓ}Lℓ=1 at the parameter level: • γℓ = 1 is the LASSO (=-=Tibshirani 1996-=-), which clears the group structure; • γℓ = 4/3 is the Hierarchical Penalization (Szafranski et al. 2008a), which gives rise to few dominant variables within groups; • γℓ = 2 is the group-LASSO (Yuan ... |

792 | An introduction to variable and feature selection - Guyon, Elisseeff - 2003 |

578 | Learning the kernel matrix with semi-definite programming - Lanckriet, Cristianini, et al. - 2002 |

443 |
Learning with Kernels: Support Vector
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ..., . . .HM , one may consider the penalty M∑ m=1 ‖fm‖Hm = M∑ m=1 ( α⊤mKmαm )1/2 , whereαm ∈ Rn, Km is the mth kernel matrix Km(i, j) = Km(xi,xj) and fm(x) =∑n i=1αm(i)K(xi,x). The representer theorem (=-=Schölkopf and Smola 2001-=-) ensures that the fm solving the MKL Problem (3a) are of the above form. Hence, MKL may be seen as a kernelization of LASSO, extended to SVM classifiers, whose penalty generalizes the ones proposed i... |

323 | Choosing Multiple Parameters for Support Vector Machines
- Chapelle
- 2002
(Show Context)
Citation Context ...the inner loop of two nested optimizers, whose outer loop is dedicated to adjust the kernel. This tuning may be guided by various generalization bounds (Cristianini et al., 1999; Weston et al., 2001; =-=Chapelle et al., 2002-=-). Kernel learning can also be embedded in Problem (1), with the SVM objective value minimized jointly with respect to the SVM parameters and the kernel hyper-parameters (Grandvalet & Canu, 2003). Our... |

294 | Multiple kernel learning, conic duality
- Bach, Lanckriet, et al.
- 2004
(Show Context)
Citation Context ...1s- 2 3sO cts2 01 0 The original MKL formulation of Lanckriet et al. (2004) was based on the dual of the SVM optimization problem. It was later shown to be equivalent to the following primal problem (=-=Bach et al. 2004-=-) min f1,...,fM b,ξ 1 2 ( M∑ m=1 ‖fm‖Hm )2 + C n∑ i=1 ξi s. t. yi ( M∑ m=1 fm(xi) + b ) ≥ 1− ξi , ξi ≥ 0 , i = 1, . . . , n , (3a) (3b) whose solution leads to a decision rule of the fo... |

247 | On kernel target alignment - Cristianini, Kandola, et al. - 2002 |

242 | 2006a) Large scale multiple kernel learning - Sonnenburg |

218 | Feature selection for SVMs
- Weston
- 2000
(Show Context)
Citation Context ...issues to poor performances (see for 1 ha l-0 05 28 98 1,sv er sio ns1s- 2 3sO cts2 01 0 Author manuscript, published in "Machine Learning 79, 1 (2010) 73-103"sDOI : 10.1007/s10994-009-5150-6 example =-=Weston et al. 2001-=-, Grandvalet and Canu 2003). “Learning the kernel”, aims at alleviating these problems, by adapting the kernel to the problem at hand. A general model of learning the kernel has two components: (i) a ... |

192 |
Talking off the top of your head: Toward a mental prosthesis utilizing event-related brain potentials. Electroencephalography and Clinical Neurophysiology 70, 510–523.2005. P300 matrix speller classification via stepwise linear discriminant analysis
- Farwell, Donchin
- 1988
(Show Context)
Citation Context ... Protocol The so-called oddball paradigm states that a rare expected stimulus produces a positive deflection in an EEG signal after about 300 ms. The P300 speller interface is based on this paradigm (=-=Farwell and Donchin 1998-=-). Its role is to trigger a related event potential, namely the P300, in response to a visual stimulus. This protocol uses a matrix composed of 6 rows and 6 columns of letters and numbers, as illustra... |

179 | Stability and generalization - Bousquet, Elisseeff |

164 |
Heuristics of instability and stabilization in model selection
- Breiman
- 1996
(Show Context)
Citation Context ...ently combinatorial problem, for which finding a global optimum is challenging even with a small number of kernels. Second, this type of hard selection problem is known to provide unstable solutions (=-=Breiman 1996-=-), especially when the number of kernels is not orders of magnitude lower than the training set size. Unstability refers here to large changes in the overall predictor, in particular via the changes i... |

144 | Convex multi-task feature learning - Argyriou, Evgeniou, et al. |

87 |
Contingent negative variation: an electrical sign of sensorimotor association and expectency in the human. Nature
- WG, Cooper, et al.
- 1964
(Show Context)
Citation Context ...g some activated regions in the brain when an event is being anticipated (Garipelli et al. to appear).6 The potentials are here recorded according to the Contingent Negative Variation (CNV) paradigm (=-=Walter et al. 1964-=-). In this paradigm, a warning stimulus predicts the appearance of an imperative stimulus in a predictable inter-stimulus-interval. More precisely, an experiment processes as follows. A subject, looki... |

59 | Dynamically Adapting Kernels in Support Vector Machines
- CRISTIANINI
- 1998
(Show Context)
Citation Context ...fier. In wrapper algorithms, the SVM solver is the inner loop of two nested optimizers, whose outer loop is dedicated to adjust the kernel. This tuning may be guided by various generalization bounds (=-=Cristianini et al. 1999-=-, Weston et al. 2001, Chapelle et al. 2002). In all these methods, the set of admissible kernels K is defined by kernel parameter(s) θ, where θ may be the kernel bandwidth, or a diagonal or a full cov... |

58 | Local strong homogeneity of a regularized estimator - Nikolova |

46 | Adaptive scaling for feature selection in SVMs
- Grandvalet, Canu
- 2003
(Show Context)
Citation Context ...rmances (see for 1 ha l-0 05 28 98 1,sv er sio ns1s- 2 3sO cts2 01 0 Author manuscript, published in "Machine Learning 79, 1 (2010) 73-103"sDOI : 10.1007/s10994-009-5150-6 example Weston et al. 2001, =-=Grandvalet and Canu 2003-=-). “Learning the kernel”, aims at alleviating these problems, by adapting the kernel to the problem at hand. A general model of learning the kernel has two components: (i) a family of kernels, that is... |

41 | The BCI competition 2003: progress and perspectives in detection and discrimination of EEG single trials
- BLANKERTZ, MULLER, et al.
- 2004
(Show Context)
Citation Context ... for BCI real-life applications, since it makes the acquisition system easier to use and to set-up. We use here the dataset from the BCI 2003 competition for the task of interfacing the P300 Speller (=-=Blankertz et al., 2004-=-). The dataset consists in 7560 EEG signals paired with positive or negative stimuli responses. The signal, processed as in (Rakotomamonjy et al., 2005), leads to 7560 examples of dimension 896 (14 ti... |

41 | An extended level method for efficient multiple kernel learning
- Xu, Jin, et al.
- 2009
(Show Context)
Citation Context ...for which several efficient implementations exist. This type of approach was also used in the multiple task learning framework by Argyriou et al. (2008), and again in some recent developments of MKL (=-=Xu et al. 2009-=-, Bach 2009). We first chose the gradient-based approach that was demonstrated to be efficient for MKL (Szafranski et al. 2008b). Nevertheless, moving along a curved surface such as the ones illustrat... |

39 | A DC-programming algorithm for kernel selection - Argyriou, Hauser, et al. - 2006 |

31 | No unbiased estimator of the variance of k-fold cross-validation - Bengio, Granvalet - 2003 |

30 | Learning bounds for support vector machines with learned kernels - Srebro, Ben-David - 2006 |

25 |
Sparsity and persistence: mixed norms provide simple signals models with dependent coefficients
- Kowalski, Torrésani
(Show Context)
Citation Context ...ion where we want all sources to participate to the solution, but where the relevant similarities are to be discovered for each source. It has been used in the regression framework for audio signals (=-=Kowalski and Torrésani 2008-=-). The fourth solution, leading to a ℓ(1, 4/3) norm is the kernelized version of hierarchical penalization (Szafranski et al. 2008a), which takes into account the group structure, provides sparse resu... |

8 | Optimization problems with pertubation : A guided tour - Bonnans, Shapiro |

8 |
Robust EEG channel selection across subjects for brain computer interfaces
- Schröder, Lal, et al.
- 2005
(Show Context)
Citation Context ...sents the frontal direction. Automated channel selection has to be performed for each single subject since it leads to better performances or a substantial reduction of the number of useful channels (=-=Schröder et al. 2005-=-). Reducing the number of channels involved in the decision function is of primary importance for BCI real-life applications, since it makes the acquisition system cheaper, easier to use and to set-up... |

1 | BCI competition 3: Dataset 2 - ensemble of SVM for BCI P300 speller - Rakotomamonjy, Guigue |