## Consistency of support vector machines and other regularized kernel classifiers (2002)

Citations: | 54 - 13 self |

### BibTeX

@MISC{Steinwart02consistencyof,

author = {Ingo Steinwart},

title = {Consistency of support vector machines and other regularized kernel classifiers},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

9023 | The Nature of Statistical Learning Theory
- Vapnik
(Show Context)
Citation Context ...small . Moreover, in order to apply results on empirical risk minimization it was assumed to use ,or—as an approximation— . Unfortunately, using universal kernels on infinite spaces the motivation in =-=[16]-=- cannot work since the corresponding function classes always have infinite Vapnik–Chervonenkis (VC) dimension. However, the classifiers based on are universally consistent for suitable sequences . Thi... |

2188 | Support-vector network
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... in [3]. We begin with the most common SVM. Example 1.1: Let , be the hinge loss function , , and be a positive sequence with . Then the classifiers based on either (2) or (3) are called L1-SVMs (cf. =-=[5]-=-). We will show that the L1-SVM based on (2) is universally consistent whenever a universal kernel (see the following examples and Section II for a definition) is used and satisfies . If the L1-SVM is... |

2059 |
The elements of statistical learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...inally, our techniques can also be adapted to classifiers based on (3). The resulting conditions are very similar. Several other loss function can also be treated by our results including (see, e.g., =-=[13]-=- and [14]) the sigmoid loss function, a truncated hinge loss function, and some smooth approximations of the margin loss functions of the above examples. The following last two examples are mainly of ... |

2045 | Learning with kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...a minimizer a solution of (2) is given by . A similarly argument can be employed for (3). However, for and specific convex loss functions the dual of (2) or (3) is usually solved instead (cf. [3] and =-=[4]-=-). For example, if is the hinge loss then the dual problem becomes maximize subject to (5) Recall that if (3) is considered instead, then the additional constraint appears in (5). Finally, for the squ... |

1002 |
A Probabilistic Theory of Pattern Recognition (Stochastic Modelling and Applied Probability
- Devroye, Gy¨orfi, et al.
- 1997
(Show Context)
Citation Context ...larization, support vector machines (SVMs), universal consistency. I. INTRODUCTION WE treat the statistical classification problem which have been studied in both statistics and machine learning (cf. =-=[1]-=- for a throughout treatment). For recalling this problem let be a nonempty set, and .Aclassifier is a rule that assigns to every training set a measurable function . Here, it is always assumed that is... |

945 |
An introduction to support vector machines
- Shawe-Taylor, Cristianini
- 2000
(Show Context)
Citation Context ...s found a minimizer a solution of (2) is given by . A similarly argument can be employed for (3). However, for and specific convex loss functions the dual of (2) or (3) is usually solved instead (cf. =-=[3]-=- and [4]). For example, if is the hinge loss then the dual problem becomes maximize subject to (5) Recall that if (3) is considered instead, then the additional constraint appears in (5). Finally, for... |

787 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ... all . Again, means that both and hold. Furthermore, we always assume . Throughout the paper, let be a compact metric space. For a positive semidefinite kernel , we denote the corresponding RKHS (cf. =-=[24]-=-, [25, Ch. 3], and the Appendix) by or simply . For its closed unit ball we write . Recall that the map , fulfills by the reproducing property. We will often use the quantity Recall that is the smalle... |

314 | Regularization theory and neural networks architectures
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ...TIONS ON INFORMATION THEORY, VOL. 51, NO. 1, JANUARY 2005 and . The analogous classifiers based on (2) are called regularization networks or kernel ridge regression classifiers and were introduced in =-=[12]-=-. Since , , and for , the conditions for both classifiers coincide with those of the corresponding L2-SVMs. Since many SVMs for classification are actually based on a regression-like approach, it is r... |

269 | Regularization networks and support vector machines
- Evegeniou, Pontil, et al.
- 2000
(Show Context)
Citation Context ...s. Example 1.8: In order to motivate SVMs, the regularization function defined by in combination with the structural risk minimization method with respect to the hinge loss function was considered in =-=[15]-=-. Using Proposition 3.3, this approach actually yields universal consistency for classifiers with the above in terms of structural risk minimization. Moreover, our results also yield another method ma... |

263 | Least squares support vector machine classifiers. Neural processing letters
- Suykens, Vandewalle
- 1999
(Show Context)
Citation Context ...e square loss function. Interestingly, the first has been inspired by the SVM approach but the second has been introduced independently from SVMs. Example 1.3: Least square SVMs (LS-SVMs) proposed in =-=[11]-=- are based on the minimization problem (3) with130 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 1, JANUARY 2005 and . The analogous classifiers based on (2) are called regularization network... |

224 | On the mathematical foundations of learning
- Cucker, Smale
(Show Context)
Citation Context ... we have for all . This also holds for restrictions of to , . Now we can state a concentration inequality for classifiers based on (2). The proof of this inequality which is similar to the methods of =-=[30]-=- for the square loss function in regression scenarios can be found in Section IV. Lemma 3.4: Let be a continuous kernel on , be an admissible loss function, and be a regularization function. Then for ... |

206 | Scale-sensitive dimensions, uniform convergence, and learnability - Alon, Ben-David, et al. - 1997 |

169 |
Convex functions, monotone operators and differentiability
- Phelps
- 1993
(Show Context)
Citation Context ...vex, continuous function the subdifferential of in is defined by Since is regular, there also exists a constant independent of such that and . Therefore, we get For basic properties we refer to [37], =-=[38]-=-, and in particular [46, Theorems 23.8 and 23.9]. In the following, denotes the expectation of with respect to the empirical measure induced by . We will also use this notation for -valued functions .... |

167 | Stability and generalization
- Bousquet, Elisseeff
(Show Context)
Citation Context ...n followed so far. Unfortunately, it is shown in [19] that the existing bounds cannot explain the generalization performance of SVMs and thus, a sample-dependent theory still has to be developed (cf. =-=[20]-=- for the best known results in this direction). On the other hand, there are only a few works dealing with consistency of classifiers based on (2) or (3). Some preliminary results in [21] and [22] sho... |

163 | On the influence of the kernel on the consistency of support vector machines
- Steinwart
(Show Context)
Citation Context ...cf. [20] for the best known results in this direction). On the other hand, there are only a few works dealing with consistency of classifiers based on (2) or (3). Some preliminary results in [21] and =-=[22]-=- show consistency of L1-SVMs for restricted classes of distributions. Furthermore, there exist two results establishing universal consistency for classifiers based on (2) or (3). As already mentioned,... |

138 | Central limit theorem for empirical measures - Dudley - 1978 |

138 | A generalized representer theorem - Schölkopf, Herbrich, et al. - 2001 |

136 | Sparseness of support vector machines
- Steinwart
- 2003
(Show Context)
Citation Context ...ch we will need in the following. Definition 2.1: Let function which is continuous in Roughly speaking, it turns out that the solutions of (2) or (3) tend to a function that minimizes this -risk (see =-=[27]-=- for details). Hence, the following definition is fundamental in order to guarantee that these solutions tend to have the same sign as the Bayes decision rule. Definition 2.2: A continuous function wi... |

113 | Statistical behavior and consistency of classification methods based on convex risk minimization (with discussion
- Zhang
- 2004
(Show Context)
Citation Context ...rived from our general results using the facts , , and for (see Section III for a definition of these quantities and the mentioned results). The above conditions on are stronger than those derived in =-=[7]-=- for the L1-SVM without offset. However, they significantly improve the only known conditions (cf. [8]) for the L1-SVM with offset. Furthermore, for both SVMs the conditions almost coincide with the c... |

102 | Harmonic analysis on semigroups. Theory of positive definite and related functions - Berg, Christensen, et al. - 1984 |

90 | Lagrangian support vector machines
- Mangasarian
- 2001
(Show Context)
Citation Context ...y small in order to ensure strong universal consistency. Again, for a Gaussian RBF kernel on this condition can be further weakened to if one is only interested in universal consistency. Moreover, in =-=[10]-=-, Mangasarian and Musicant proposed another variation of the theme, called Lagrangian SVM, which is based on the optimization problem (6) with and . The corresponding consistency conditions on are obv... |

84 | Robust trainability of single neurons
- Hoffgen, Horn, et al.
- 1995
(Show Context)
Citation Context ...samples: the admissible loss function , , approximates the loss function for . Since error minimization in with respect to is NP-hard whenever the training set cannot be linearly separated (cf. [17], =-=[18]-=-), it was proposed to replace by for small . Moreover, in order to apply results on empirical risk minimization it was assumed to use ,or—as an approximation— . Unfortunately, using universal kernels ... |

84 | Support vector machines and the bayes rule in classi cation
- Lin
- 1999
(Show Context)
Citation Context ...veloped (cf. [20] for the best known results in this direction). On the other hand, there are only a few works dealing with consistency of classifiers based on (2) or (3). Some preliminary results in =-=[21]-=- and [22] show consistency of L1-SVMs for restricted classes of distributions. Furthermore, there exist two results establishing universal consistency for classifiers based on (2) or (3). As already m... |

72 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators - Williamson, Smola, et al. - 2001 |

71 |
Asymptotic analysis of penalized likelihood and related estimators,” Ann
- Cox, O’Sullivan
- 1990
(Show Context)
Citation Context ...ion approach. Regularization techniques are well known and have a broad range of applications. In particular, for statistical problems they have been intensively studied in the literature (cf., e.g., =-=[23]-=- and the references therein). However, the classifiers based on (2) or (3) differ from the commonly considered regularization scenarios in statistics. • In general, using the loss function of interest... |

61 |
Entropy, compactness and the approximation of operators
- Carl, Stephani
- 1990
(Show Context)
Citation Context ... we define Note, that we have for all and and instead. Finally, recall, that there also exists a concept—the so-called entropy numbers—which is “inverse” to the above notions. For details we refer to =-=[29]-=-. In the following, we also have to measure the continuity of a given loss function . To this end, we use the inverted modulus of continuity which is defined by Moreover, we even obtain for . The main... |

46 |
Support vector machines are universally consistent
- Steinwart
(Show Context)
Citation Context ...quantities and the mentioned results). The above conditions on are stronger than those derived in [7] for the L1-SVM without offset. However, they significantly improve the only known conditions (cf. =-=[8]-=-) for the L1-SVM with offset. Furthermore, for both SVMs the conditions almost coincide with the condition in [7] if the used kernel is smooth. Recall, that there exists another well-known variant, th... |

38 | Support-vector networks. Machine learning - Cortes, Vapnik - 1995 |

38 |
The covering number in learning theory
- Zhou
(Show Context)
Citation Context ...le 3.8: As already mentioned, the Gaussian RBF kernel on also fits into the framework of Example 3.7. For this kernel, however, a sharper upper bound for the covering numbers of was recently shown in =-=[32]-=-. Namely, there was proved that Consequently, the classifier considered in Corollary 3.6 is universal consistent if and Finally, we consider classifiers based on (3). As already indicated, we only hav... |

37 | Algorithmic stability and generalization performance
- Bousquet, Elisseeff
(Show Context)
Citation Context ...nd Then the classifier based on (2) with respect to , , , and is universally consistent. In order to prove this kernel-independent condition on we have to recall the notion of stable classifiers (cf. =-=[36]-=-): let and . Moreover, let denote the training set that is identical to apart from the th sample which is replaced by . A classifier based on the optimization problem (2) is called stable with respect... |

32 | SSVM a smooth support vector machine for classification
- Lee, Mangasarian
- 1999
(Show Context)
Citation Context ...ur techniques can also be adapted to classifiers based on (3). The resulting conditions are very similar. Several other loss function can also be treated by our results including (see, e.g., [13] and =-=[14]-=-) the sigmoid loss function, a truncated hinge loss function, and some smooth approximations of the margin loss functions of the above examples. The following last two examples are mainly of theoretic... |

30 | A note on margin-based loss functions in classification,” Statist
- Lin
- 2004
(Show Context)
Citation Context ...on 2.2: A continuous function with is called an admissible loss function if for every and every with (12) we have if and if . A similar notion together with some sufficient conditions can be found in =-=[28]-=-. Besides the asymmetric hinge loss function discussed in Example 1.5, all loss functions treated in the examples of the Introduction are admissible. We will see in Lemma 4.1 that there always exists ... |

29 | Uniqueness of the SVM solution - Burges, Crisp - 1999 |

24 |
Average-case analysis of numerical problems
- Ritter
- 2000
(Show Context)
Citation Context ...e find for some Then the classifier based on (2) with respect to , , , and is strongly universally consistent. Examples of kernels that satisfy one of the above smoothness assumptions can be found in =-=[31]-=-. Here, we only consider a specific class of universal kernels: Example 3.7: Let and be a function that can be expressed by its Taylor series in , i.e., if where the last estimate is due to (14). The ... |

18 | Metric entropy of convex hulls in Banach spaces - Carl, Kyrezi, et al. - 1999 |

18 |
Gelfand numbers of operators with values in a Hilbert space
- Carl, Pajor
- 1988
(Show Context)
Citation Context ...oof of Lemma 3.4, we define . Then Lemma 3.4 in [43] yields Now the assertion follows by the arguments used in the proof of Lemma 3.4. Proof of Corollary 3.15: By the dual Maurey-Carl inequality (cf. =-=[34]-=-, [44], and [40]) we have Therefore, we find we can easily deduce the second assertion. Proof of Lemma 3.11: By Lemma 4.4, we may assume without loss of generality that is not degenerated. Let be chos... |

11 |
Convexity and optimization in Banach spaces
- Barbu, Precupanu
- 1986
(Show Context)
Citation Context ... a convex, continuous function the subdifferential of in is defined by Since is regular, there also exists a constant independent of such that and . Therefore, we get For basic properties we refer to =-=[37]-=-, [38], and in particular [46, Theorems 23.8 and 23.9]. In the following, denotes the expectation of with respect to the empirical measure induced by . We will also use this notation for -valued funct... |

7 | Convergence of large margin separable linear classification - Zhang - 2001 |

3 |
The densest hemisphere problem,” Theor
- Johnson, Preparata
- 1978
(Show Context)
Citation Context ...ified samples: the admissible loss function , , approximates the loss function for . Since error minimization in with respect to is NP-hard whenever the training set cannot be linearly separated (cf. =-=[17]-=-, [18]), it was proposed to replace by for small . Moreover, in order to apply results on empirical risk minimization it was assumed to use ,or—as an approximation— . Unfortunately, using universal ke... |

2 |
Which Data–Dependent Bounds are Suitable for SVM’s
- Steinwart
- 2002
(Show Context)
Citation Context ...dent approach that estimates the risk of decision functions in terms of observed data on the training set. For SVMs, mainly the latter approach has been followed so far. Unfortunately, it is shown in =-=[19]-=- that the existing bounds cannot explain the generalization performance of SVMs and thus, a sample-dependent theory still has to be developed (cf. [20] for the best known results in this direction). O... |

2 | Theory of Function Spaces", Akademische Verlagsgesellschaft Geest & Portig - Triebel |

1 | s\Gamma numbers of integral operators with H"older-continuous kernels over metric compacta - Carl, Heinrich, et al. - 1988 |

1 | Average-Case Analysis of Numerical Problems, Lecture Notes in Math. 1733 (2000) [26] R.T. Rockafellar, "Convex Analysis - Ritter - 1970 |

1 |
Entropy numbers of convex hulls and an application to learning algorithms
- Steinwart
- 2003
(Show Context)
Citation Context ...inuous functions that are uniformly bounded with respect to the -norm and is a strictly positive sequence. Examples of such kernels can be found in [22]. If we even have for some then it was shown in =-=[35]-=- that Then the classifier based on (3) with respect to , , , and is strongly universally consistent. D. Consistency Results Based on Stability In practice, one usually considers convex loss functions ... |

1 |
Support Vector Machine—Reference
- Saunders, Stitson, et al.
- 1998
(Show Context)
Citation Context ...d by using smoothness properties of . In particular, if is even a universal -kernel on a closed ball of —e.g., the Gaussian radial basis function (RBF) kernel or Vovk’sinfinite polynomial kernel (see =-=[6]-=- and Example 3.7)—it suffices to use a sequence with for some arbitrary small in order to ensure strong universal consistency. For a Gaussian RBF kernel, the latter condition can be further weakened t... |

1 |
the optimal parameter choice for -support vector machines
- “On
- 2003
(Show Context)
Citation Context ... classifier is different to that of the algorithms in consideration. Indeed, the (almost) optimal value for the regularization parameter is determined by the Bayes risk of the underlying measure (see =-=[9]-=-). Let us now treat the so-called L2-SVM. Example 1.2: Let , be the squared hinge loss function , and be a positive sequence with . Then the classifiers based on either (2) or (3) are called L2-SVMs. ... |