## On the Influence of the Kernel on the Consistency of Support Vector Machines (2001)

### Cached

### Download Links

- [www.jmlr.org]
- [jmlr.csail.mit.edu]
- [www0.cs.ucl.ac.uk]
- [jmlr.org]
- [cbio.ensmp.fr]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 170 - 20 self |

### BibTeX

@ARTICLE{Steinwart01onthe,

author = {Ingo Steinwart},

title = {On the Influence of the Kernel on the Consistency of Support Vector Machines},

journal = {Journal of Machine Learning Research},

year = {2001},

volume = {2},

pages = {67--93}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this article we study the generalization abilities of several classifiers of support vector machine (SVM) type using a certain class of kernels that we call universal. It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well. In particular we derive a simple su#cient condition for this parameter in the case of Gaussian RBF kernels. On the one hand our considerations are based on an investigation of an approximation property---the so-called universality---of the used kernels that ensures that all continuous functions can be approximated by certain kernel expressions. This approximation property also gives a new insight into the role of kernels in these and other algorithms. On the other hand the results are achieved by a precise study of the underlying optimization problems of the classifiers. Furthermore, we show consistency for the maximal margin classifier as well as for the soft margin SVM's in the presence of large margins. In this case it turns out that also constant regularization parameters ensure consistency for the soft margin SVM's. Finally we prove that even for simple, noise free classification problems SVM's with polynomial kernels can behave arbitrarily badly.

### Citations

1698 |
An Introduction to Support Vector Machines (and Other Kernel-based Learning Methods
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...ctor machines, other concepts such as data dependent structural risk minimization, e.g. in terms of the observed margin, were introduced (cf. Shawe-Taylor et al., 1998; Bartlett & Shawe-Taylor, 1999; =-=Cristianini & Shawe-Taylor, 2000-=-, chap. 2). The latter usually needs large margins on the c #2001 Steinwart. Steinwart training sets to provide good bounds. It is, however, open which distributions and kernels guarantee this assumpt... |

1048 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...= # x # X : P (y = 1|x)s(y = -1|x) # B 0 (P ) := # x # X : P (y = 1|x) = P (y = -1|x) # and a function f 0 : X # {-1, 1} with f 0 (x) = 1 if x # B 1 (P ) and f 0 (x) = -1 if x # B-1 (P ) we have (cf. =-=Devroye et al., 1997-=-, Thm. 2.1.) R P (f 0 ) = inf # R P (f) : f : X # {-1, 1} measurable # = # X p(x) PX (dx) , (1) where the noise level p : X # R is defined by p(x) := P (y = -1|x) for x # B 1 (P ), p(x) := P (y = 1|x)... |

497 |
Real Analysis and Probability
- Dudley
- 1989
(Show Context)
Citation Context ...8; Cristianini & Shawe-Taylor, 2000). Let (X, d) be a compact metric space 1 , Y := {-1, 1} and P be a probability measure on XY , where X is equipped with the Borel #-algebra. By disintegration (cf. =-=Dudley, 1989-=-, Lem. 1.2.1.) there exists a map x ## P ( . |x) from X into the set of all probability measures on Y such that P is the joint distribution of (P ( . |x)) x and of the marginal distribution PX of P on... |

376 | Weak Convergence and Empirical Processes - Vaart, Wellner - 2000 |

257 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...-Chervonenkis theory fails to be applicable for support vector machines, other concepts such as data dependent structural risk minimization, e.g. in terms of the observed margin, were introduced (cf. =-=Shawe-Taylor et al., 1998-=-; Bartlett & Shawe-Taylor, 1999; Cristianini & Shawe-Taylor, 2000, chap. 2). The latter usually needs large margins on the c #2001 Steinwart. Steinwart training sets to provide good bounds. It is, how... |

146 | A generalized representer theorem
- Schölkopf, Herbrich, et al.
- 2001
(Show Context)
Citation Context ...,cn T f 2,k,cn T (x) = n� i=1 yiαik(xi, x) + b 2,k,cn T , where αi ≥ 0 are suitable constants depending on T and b 2,k,cn T the help of the kernel (cf. Cristianini & Shawe-Taylor, 2000; Vapnik, 1998; =-=Schölkopf et al., 2001-=-). Note, that if k is a kernel on X which separates all finite sets and X has infinitely many elements then the function class represented by the 2-SMC has infinite VC-dimension. For more information ... |

141 |
Central limit theorems for empirical measures
- Dudley
- 1978
(Show Context)
Citation Context ... # P-1 # P 1 let F n,A := # # (x 1 , y 1 ), . . . , (x n , y n ) # # (X Y ) n : # # {l : x l # A} # # # m # F n := # A#P-1#P 1 F n,A . 85 Steinwart For n # mM the Cherno#-Okamoto inequality (see e.g. =-=Dudley, 1978-=-) then yields P n # (X Y ) n \ F n,A # # exp # - # n # M - (m - 1) # 2 2n # M # 1 - # M # # # exp # - n 2 # # M # 2 - 2n # M (m - 1) + (m - 1) 2 2n # M # # exp # - #n 2M +m # and thus P n (F n ) # 1 -... |

123 | Generalization Performance of Support Vector Machines and Other Pattern Classiers
- Bartlett, S-Taylor
- 1999
(Show Context)
Citation Context ...to be applicable for support vector machines, other concepts such as data dependent structural risk minimization, e.g. in terms of the observed margin, were introduced (cf. Shawe-Taylor et al., 1998; =-=Bartlett & Shawe-Taylor, 1999-=-; Cristianini & Shawe-Taylor, 2000, chap. 2). The latter usually needs large margins on the c #2001 Steinwart. Steinwart training sets to provide good bounds. It is, however, open which distributions ... |

65 |
Entropy, compactness and the approximation of operators
- Carl, Stephani
- 1990
(Show Context)
Citation Context ... e -t # t for all t # 0 we observe d k (x, y) = # 2 - 2 exp(-# 2 #x - y# 2 2 ) # # 2##x - y# 2 . This yields N # (X, d k ), # # # N # (X, #.# 2 ), # # 2# # and thus N # (X, d k ), # # # O(# -d ) (cf. =-=Carl & Stephani, 1990-=-, p. 9). For the classification problems we have considered up to now we usually may not expect that we obtain a large margin for sample sizes growing to infinity. In the following we restrict ourselv... |

53 |
Analysis Now
- Pedersen
- 1989
(Show Context)
Citation Context ...continuous functions f : X # R on the compact metric space (X, d) endowed with the usual supremum norm #f# # := sup x#X |f(x)| . The following well-known approximation theorem of Stone-Weierstra (cf. =-=Pedersen, 1988-=-, Cor. 4.3.5.) states that certain subalgebras of C(X) generate the whole space. This result will be the key tool when considering approximation properties of kernels in the next section: 70 On the Co... |

51 |
On optimal nonlinear associative recall
- Poggio
- 1975
(Show Context)
Citation Context .....,k d #0 c k 1 ,...,k d d # i=1 (x i y i ) k i = # k 1 ,...,k d #0 a k 1 ++k d c k 1 ,...,k d d # i=1 x k i i d # i=1 y k i i , where c k 1 ,...,k d := ( # d i=1 k i !) -1 ( # d i=1 k i )! (cf. also =-=Poggio, 1975-=-, Lem. 2.1). Note, that the series can be rearranged since it is absolutely summable. In particular, for x = y we obtain that # : X # # 2 (N d 0 ) is well defined by #(x) := # # a k 1 ++k d c k 1 ,...... |

29 | Uniqueness of the svm solution - Burges, Crisp - 2000 |

15 |
A generalized representer theorem
- Schoelkopf, Herbrich, et al.
- 2001
(Show Context)
Citation Context ... # i k(x i , x) + b 2,k,cn T , where # i # 0 are suitable constants depending on T and b 2,k,cn T can also be computed with the help of the kernel (cf. Cristianini & Shawe-Taylor, 2000; Vapnik, 1998; =-=Scholkopf et al., 2001-=-). Note, that if k is a kernel on X which separates all finite sets and X has infinitely many elements then the function class represented by the 2-SMC has infinite VC-dimension. For more information ... |

8 | On the generalization ability of support vector machines - Steinwart - 2001 |

8 | A leave-one-out cross validation bound for kernel methods with applications in learning
- ZHANG
- 2001
(Show Context)
Citation Context ...ques and a geometric motivation we again refer to Cristianini & Shawe-Taylor (2000, Ch. 6 and 7) and Vapnik (1998, Ch. 10). The MMC is assumed to work poorly in the absence of large margins (cf. Tong =-=Zhang, 2001-=-). Thus we only consider the setting of Theorem 18. We begin with a result similar to Theorem 18 and Theorem 24: Theorem 25 Let (X, d) be a compact metric space and k a universal kernel on X. Suppose ... |

6 |
Theory of support vector machines
- Stitson, Wetson, et al.
- 1996
(Show Context)
Citation Context ...ion 8 and exp(-# 2 #x - y# 2 2 ) = exp(-##x# 2 2 ) exp(-##y# 2 2 ) exp(# # 2#x, # 2#y#) the assertion follows for the RBF kernel. Example 2 Let X := {x # R d : #x# 2s1} and # > 0. Then V. Vovk's (cf. =-=Saunders et al., 1998-=-, p. 15) infinite polynomial kernel k(x, y) := (1 - #x, y#) -# , x, y # X, is universal on every compact subset of X. Proof To check the assertion we use that (1 - t) -# = # # n=0 # -# n # (-1) n t n ... |

5 |
Ryshik: Summen-, Produkt und Integral-Tafeln
- Gradstein, M
- 1981
(Show Context)
Citation Context ...apnik (1998, p. 470) and Saunders et al. (1998, p. 15) is universal on every compact subset of [0, 2#) d . Proof The assertion can be seen using Corollary 11 and f(t) = 1/2 + # # n=1 q n cos(nt) (cf. =-=Gradstein & Ryshik, 1981-=-, p. 68). Example 4 Let 0s# and f(t) := # cosh # (# - |t|)/q # / # 2q sinh(#/q) # for all t with -2# # t # 2#. Then the weaker regularized Fourier kernel k(x, y) := # d i=1 f(x i - y i ) considered by... |

4 | On the influence of the kernel on the generalization ability of support vector machines - Steinwart - 2001 |