## Choosing multiple parameters for support vector machines (2002)

### Cached

### Download Links

- [www.research.microsoft.com]
- [research.microsoft.com]
- [olivier.chapelle.cc]
- [www.ai.mit.edu]
- [www.cmap.polytechnique.fr]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 300 - 15 self |

### BibTeX

@INPROCEEDINGS{Chapelle02choosingmultiple,

author = {Olivier Chapelle and École Polytechnique and Nello Cristianini},

title = {Choosing multiple parameters for support vector machines},

booktitle = {Machine Learning},

year = {2002},

pages = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered. This is done by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters. Usual methods for choosing parameters, based on exhaustive search become intractable as soon as the number of parameters exceeds two. Some experimental results assess the feasibility of our approach for a large number of parameters (more than 100) and demonstrate an improvement of generalization performance.

### Citations

8984 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...ort vector), we can restrict the preceding sum to support vectors and upper bound each term in the sum by 1 which gives the following bound on the number of errors made by the leave-one-out procedure =-=[17]-=-: T = N SV ` ; where N SV denotes the number of support vectors. 3.2.2 Jaakkola-Haussler bound For SVMs without threshold, analyzing the optimization performed by the SVM algorithm when computing the ... |

2173 | Support-Vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...ept that the constraint P i y i = 0 disappears. Dealing with non-separability For the non-separable case, one needs to allow training errors which results in the so called soft margin SVM algorithm [=-=4-=-]. It can be shown that soft margin SVMs with quadratic penalization of errors can be considered as a special case of the hard margin version with the modied kernel [4, 16] KsK+ 1 C I; (3) where I is ... |

1227 | Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
- Golub, Slomin, et al.
- 1999
(Show Context)
Citation Context ...problem, among the filter methods only the Kolmogorov-Smirnov test improved performance over standard SVMs. 8.2. DNA microarray data Next, we tested this idea on two leukemia discrimination problems (=-=Golub et al., 1999-=-) and a problem of predicting treatment outcome for Medulloblastoma. 6 The first problem was tos156 O. CHAPELLE ET AL. classify myeloid versus lymphoblastic leukemias based on the expression of 7129 g... |

943 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...nik, 1995). It can be shown that soft margin SVMs with quadratic penalization of errors can be considered as a special case of the hard margin version with the modified kernel (Cortes & Vapnik, 1995; =-=Cristianini & Shawe-Taylor, 2000-=-). K ← K + 1 I, (3) C where I is the identity matrix and C a constant penalizing the training errors. In the rest of the paper, we will focus on the hard margin SVM and use (3) whenever we have to dea... |

272 | Perturbation analysis of optimization problems - BONNANS, SHAPIRO - 2000 |

255 | Soft margins for AdaBoost
- Rätsch, Onoda, et al.
(Show Context)
Citation Context ....62 2.03 4.56 1.97 Titanic 22.42 1.02 22.88 1.23 22.5 0.88 Table 1: Test error found by dierent algorithms for selecting the SVM parameters C and . Thesrst column reports the results from [14]. In the second and last column, the parameters are found by minimizing R 2 =M 2 and the span-bound using a gradient descent algorithm. 7.2 Benchmark databases In asrst set of experiments, we tried to... |

145 |
Handbook of Matrices
- Lütkepohl
- 1996
(Show Context)
Citation Context ...optimal value of ~ is H 1 v, it follows: S 2 p = K(x p ; x p ) v T H 1 v = 1=( ~ K 1 SV ) pp : (12) The last equality comes from the following block matrix identity, known as the \Woodbury" form=-=ula -=-[11] A 1 A T A A 2 1 = B 1 B T B B 2 ; where B 1 = (A 1 AA 1 2 A T ) 1 . 18 The closed form we obtain is particularly attractive since we can compute the value of the span for each support vector ... |

104 | Probalistic kernel regression models
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ...r of support vectors. 3.2.2 Jaakkola-Haussler bound For SVMs without threshold, analyzing the optimization performed by the SVM algorithm when computing the leave-one-out error, Jaakkola and Haussler =-=-=-[8] proved the inequality: y p (f 0 (x p ) f p (x p )) 0 p K(x p ; x p ) = U p which leads to the following upper bound: T = 1 ` ` X p=1s( 0 p K(x p ; x p ) 1): Note that Wahba et al. [20] proposed ... |

97 | Estimating the generalization performance of a SVM efficiently - Joachims - 2000 |

82 | Model selection for support vector machines
- Chapelle, Vapnik
(Show Context)
Citation Context ...ollowing upper bound on the number of errors of the leave-one-out procedure: T = 1 ` R 2 M 2 : where R and M are the radius and the margin as dened in theorem 1. 3.2.5 Span bound Vapnik and Chapelle [=-=19, 3]-=- derived an estimate using the concept of span of support vectors. Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, the following equality is t... |

63 | Face detection in still gray images
- Heisele, Pontil
(Show Context)
Citation Context ...60 genes using the gradient descent on R 2 M 2 we achieved an error of 15. 29 8.3 Face detection The trainable system for detecting frontal and near-frontal views of faces in gray images presented in =-=[7-=-] gave good results in terms of detection rates. The system used gray values of 1919 images as inputs to a second-degree polynomial kernel SVM. This choice of kernel lead to more than 40,000 features ... |

63 | Bounds on error expectation for support vector machines
- Vapnik, Chapelle
(Show Context)
Citation Context ...Suppose that the maximal distance is equal to M and that the images (x 1 ); :::; (x ` ) of the training vectors x 1 ; :::; x ` are within a sphere of radius R. Then the following theorem holds true [1=-=9-=-]. Theorem 1 Given a training set Z = f(x 1 ; y 1 ); : : : ; (x ` ; y ` )g of size `, a feature space H and an hyperplane (w; b), the margin M(w; b; Z) and the radius R(Z) are dened by M(w; b; Z) = mi... |

57 | Dynamically adapting kernels in support vector machines
- Cristianini, Campbell, et al.
- 1998
(Show Context)
Citation Context ... criterion 2 We note ~ P (x) as an abbreviation for ~ PA ;B (x) 12 T . When both T and the SVM solution are continuous with respect to , a better approach has been proposed by Cristianini et al. [5]: using an incremental optimization algorithm, one can train an SVM with little eort when is changed by a small amount. However, as soon as has more than one component computing T (; ) for ever... |

51 |
On estimation of characters obtained in statistical procedure of recognition
- Luntz, Brailovsky
- 1969
(Show Context)
Citation Context ...shion one tests all ` elements of the training data (using ` dierent decision rules). Let us denote the number of errors in the leave-one-out procedure by L(x 1 ; y 1 ; :::; x ` ; y ` ). It is known [=-=10]-=- that the the leave-one-out procedure gives an almost unbiased estimate of the expected generalization error: Lemma 1 Ep ` 1 err = 1 ` E(L(x 1 ; y 1 ; :::; x ` ; y ` )); where p ` 1 err is the probabi... |

46 |
Probabilities for support vector machines
- Platt
- 1999
(Show Context)
Citation Context ...tly a smooth approximation of the test error by estimating posterior probabilities. Recently, Platt proposed the following estimate of the posterior distribution P (Y = 1jX = x) of an SVM output f(x) =-=[13]-=-: ~ P A;B (x) = ~ P (Y = 1jX = x) = 1 1 + exp(Af(x) +B) ; 10 -4.0-3.0-2.0-1.0-0.0 1.0 -2.547 -2.348 -2.149 -1.949 -1.750 -1.551 -4.0-3.0-2.0-1.0-0.0 1.0 -2.487 -2.262 -2.038 -1.814 -1.589 -1.365 -4.0-... |

25 | Robust bounds on the generalization from the margin distribution
- Shawe-Taylor, Cristianini
- 1998
(Show Context)
Citation Context ...alled soft margin SVM algorithm [4]. It can be shown that soft margin SVMs with quadratic penalization of errors can be considered as a special case of the hard margin version with the modied kernel [=-=4, 16]-=- KsK+ 1 C I; (3) where I is the identity matrix and C a constant penalizing the training errors. We will focus in the remainder on the hard margin SVM and use (3) whenever we have to deal with non-sep... |

23 |
Gradient-based optimization of hyperparameters
- Bengio
- 2000
(Show Context)
Citation Context ... and one rather looks for a way to optimize T along a trajectory in the kernel parameter space. Using the gradient of a model selection criterion to optimize the model parameters has been proposed in =-=[2]-=- and demonstrated in the case of linear regression and time-series prediction. It has also been proposed by [9] to optimize the regularization parameters of a neural network. Here we propose an algori... |

22 |
Molecular classi of cancer: Class discovery and class prediction by gene expression monitoring, Science 286
- Golub, Slonim, et al.
- 1999
(Show Context)
Citation Context ...res. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points. 8.2 DNA Microarray Data Next, we tested this idea on two leukemia discrimination problems =-=[6]-=- and a problem of predicting treatment outcome for Medulloblastoma [1]. Thesrst problem was to classify myeloid versus lymphoblastic leukemias based on the expression of 7129 genes. The training set c... |

18 | Gaussian Processes and SVM: Mean Field and Leave-One-Out - Opper, Winther - 2000 |

14 | Adaptive regularization in neural network modeling," in Neural Networks
- Larsen, Svarer, et al.
- 1998
(Show Context)
Citation Context ...ent of a model selection criterion to optimize the model parameters has been proposed in [2] and demonstrated in the case of linear regression and time-series prediction. It has also been proposed by =-=[9]-=- to optimize the regularization parameters of a neural network. Here we propose an algorithm that alternates the SVM optimization with a gradient step is the direction of the gradient of T in the para... |

14 | Generalized approximate cross validation for support vector machines - Wahba, Lin, et al. - 2000 |

13 |
Feature selection for support vector machines
- Weston, Mukherjee, et al.
- 2000
(Show Context)
Citation Context ...cies. In the two following articial datasets our objective was to assess the ability of the algorithm to select a small number of target features in the presence of irrelevant and redundant features [=-=21]-=-. For thesrst example, six dimensions of 202 were relevant. The probability of y = 1 or 1 was equal. Thesrst three features fx 1 ; x 2 ; x 3 g were drawn as x i = yN(i; 1) and the second three feature... |

7 |
approximate cross validation for support vector machines.In
- Wahba, Lin, et al.
- 2000
(Show Context)
Citation Context ...d Haussler [8] proved the inequality: y p (f 0 (x p ) f p (x p )) 0 p K(x p ; x p ) = U p which leads to the following upper bound: T = 1 ` ` X p=1s( 0 p K(x p ; x p ) 1): Note that Wahba et al. [20=-=-=-] proposed an estimate of the number of errors made by the leave-one-out procedure, which in the hard margin SVM case turns out to be T = X 0 p K(x p ; x p ); which can be seen as an upper bound of t... |

5 | Feature selection for face detection
- Serre, Heisele, et al.
- 2000
(Show Context)
Citation Context ...al-time reducing the dimensionality of the input space and the feature space was required. The feature selection in principal components space was used to reduce the dimensionality of the input space =-=[15-=-]. The method was evaluated on the large CMU test set 1 consisting of 479 faces and about 57,000,000 non-face patterns. In Figure 9, we compare the ROC curves obtained for dierent numbers of selected ... |

2 | Estimating the generalization performance of a SVM ef - Joachims - 2000 |

1 |
Gaussian processes and svm: Mean and leave-one-out
- Opper, Winther
- 2000
(Show Context)
Citation Context ... X 0 p K(x p ; x p ); which can be seen as an upper bound of the Jaakkola-Haussler one sinces(x 1) x for x 0. 3.2.3 Opper-Winther bound For hard margin SVMs without threshold, Opper and Winther [12=-=]-=- used a method inspired from linear response theory to prove the following: under the assumption that the set of support vectors does not change when removing the example p, we have y p (f 0 (x p ) f ... |

1 | Gradient-based optimization of hyper-parameters - unknown authors - 2000 |