## A Generalized Representer Theorem (2001)

### Cached

### Download Links

- [mlg.anu.edu.au]
- [stat.cs.tu-berlin.de]
- [www.research.microsoft.com]
- [axiom.anu.edu.au]
- [users.cecs.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the Annual Conference on Computational Learning Theory |

Citations: | 135 - 17 self |

### BibTeX

@INPROCEEDINGS{Schölkopf01ageneralized,

author = {Bernhard Schölkopf and Ralf Herbrich and Alex J. Smola},

title = {A Generalized Representer Theorem},

booktitle = {In Proceedings of the Annual Conference on Computational Learning Theory},

year = {2001},

pages = {416--426}

}

### Years of Citing Articles

### OpenURL

### Abstract

Wahba's classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

### Citations

8953 | The Nature of Statistical Learning Theory - Vapnik - 2000 |

1284 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...ve Definite Kernels The question under which conditions kernels correspond to dot products in linear spaces has been brought to the attention of the machine learning community by Vapnik and coworkers =-=[1,5,23]-=-. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the functional analytic theory of kernels is [4]. Most of... |

776 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ... directly from the definition: for all functions (8), 〈k(·,x),f〉 = f(x), (11) i.e., k is the representer of evaluation. In particular, 〈k(·,x),k(·,x ′ )〉 = k(x, x ′ ), the reproducing k=-=ernel property [2,4,24], hence (cf. (7)) we indee-=-d have k(x, x ′ )= 〈φ(x),φ(x ′ )〉. Moreover, by (11) and (6) we have |f(x)| 2 = |〈k(·,x),f〉| 2 ≤ k(x, x) ·〈f,f〉. (12) Therefore, 〈f,f〉 = 0 implies f = 0, which is the last pr... |

412 | Schapire, Large margin classification using the perceptron algorithm
- Freund, E
- 1999
(Show Context)
Citation Context ...x, we will typically only be able to find local minima. Independent of the convexity issue, the result lends itself well to gradientbased on-line algorithms for minimizing RKHS-based risk functionals =-=[10,9,17, 11,8,16]-=-: for the computation of gradients, we only need the objective function to be differentiable; convexity is not required. Such algorithms can thus be adapted to deal with more general regularizers. Exa... |

383 | Text classification using string kernels
- Lodhi, Saunders, et al.
- 2002
(Show Context)
Citation Context ...hink of the pair (X ,k) as a (subset of a) Hilbert space. From a mathematical point of view, this is attractive, since we can thus study various data structures (e.g., strings over discrete alphabets =-=[26,13,18]-=-) in Hilbert spaces, whose theory is very well developed. From a practical point of view, however, we now face the problem that for many popular kernels, the Hilbert space is known to be infinite-dime... |

367 | Convolution kernels on discrete structures
- Haussler
- 1999
(Show Context)
Citation Context ...hink of the pair (X ,k) as a (subset of a) Hilbert space. From a mathematical point of view, this is attractive, since we can thus study various data structures (e.g., strings over discrete alphabets =-=[26,13,18]-=-) in Hilbert spaces, whose theory is very well developed. From a practical point of view, however, we now face the problem that for many popular kernels, the Hilbert space is known to be infinite-dime... |

311 | Regularization theory and neural-network architectures, Neural Computation
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ...r coupling of losses at different points). The present generalization to g(�f�) is, to our knowledge, new. For a machine learning point of view on the representer theorem and a variational proof, =-=cf. [12]-=-. The significance of the theorem is that it shows that a whole range of learning algorithms have solutions that can be expressed as expansions in terms of the training examples. Note that monotonicit... |

282 |
Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote Control
- Aizerman, Braverman, et al.
- 1964
(Show Context)
Citation Context ...ve Definite Kernels The question under which conditions kernels correspond to dot products in linear spaces has been brought to the attention of the machine learning community by Vapnik and coworkers =-=[1,5,23]-=-. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the functional analytic theory of kernels is [4]. Most of... |

278 |
Some results on Tchebycheffian spline functions
- Kimeldorf, Wahba
- 1971
(Show Context)
Citation Context ...roof. In its original form, with mean squared loss c((x1,y1,f(x1)),...,(xm,ym,f(xm))) = 1 m m� (yi − f(xi)) 2 , (17) or hard constraints on the outputs, and g(�f�) =λ�f�2 (λ>0), the theo=-=rem is due to [15]. -=-Note that in our formulation, hard constraints on the solution are included by the possibility of c taking the value ∞. A generalization to non-quadratic cost functions was stated by [7], cf. the di... |

254 |
Functions of positive and negative type and their connection with the theory of integral equations
- Mercer
- 1909
(Show Context)
Citation Context ...r to study the problem of learning, we need additional structure. In kernel methods, this is provided by a similarity measure k : X×X→R, (x, x ′ ) ↦→ k(x, x ′ ). (2) The function k is calle=-=d a kernel [20]. The te-=-rm stems from the first use of this type of function in the study of integral operators, where a function k giving rise to an operator Tk via � (Tkf)(x) = k(x, x ′ )f(x ′ ) dx ′ (3) X is calle... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ... posterior [14,12]. In this case, exp(−c((xi,yi,f(xi))i=1,...,m)) is the likelihood of the data, while exp(−g(�f�)) is the prior over the set of functions. The well-known Gaussian process prio=-=r (e.g. [24,27]), with covaria-=-nce function k, is obtained by using g(�f�) = λ�f� 2 (here, λ > 0, and, as above, �·� is the norm of the RKHS associated with k). A Laplacian prior would be obtained by using i=1s424 B.... |

193 |
Nonlinear programming
- Mangasarian
- 1994
(Show Context)
Citation Context ...ns thereof), this regularizer i=1sA Generalized Representer Theorem 423 leads to a convex quadratic programming problem [5,23]. In that case, the standard Kuhn-Tucker machinery of optimization theory =-=[19] c-=-an be applied to derive a so-called dual optimization problem, which consists of finding the expansion coefficients α1,...,αm rather than the solution f in the RKHS. From the point of view of learni... |

177 | Kernel principal component analysis
- Schölkopf, Müller
- 1999
(Show Context)
Citation Context ...hown to correspond to the case of ⎧ ⎪⎨ c((xi,yi,f(xi))i=1,...,m) = 0 if ⎪⎩ 1 � m� m f(xi) − i=1 1 m� m f(xj) j=1 ∞ otherwise � 2 =1 (31) with g an arbitrary strictly monotonicall=-=y increasing function [21]-=-. The constraint ensures that we are only considering linear feature extraction functionals that produce outputs of unit empirical variance. Note that in this case of unsupervised learning, there are ... |

150 | Support vector machines, reproducing kernel hilbert spaces and randomized gacv
- Wahba
- 1998
(Show Context)
Citation Context ...in our formulation, hard constraints on the solution are included by the possibility of c taking the value ∞. A generalization to non-quadratic cost functions was stated by [7], cf. the discussion i=-=n [25] (no-=-te, however, that [7] did not yet allow for coupling of losses at different points). The present generalization to g(�f�) is, to our knowledge, new. For a machine learning point of view on the rep... |

146 |
Spline models for observational data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ... posterior [14,12]. In this case, exp(−c((xi,yi,f(xi))i=1,...,m)) is the likelihood of the data, while exp(−g(�f�)) is the prior over the set of functions. The well-known Gaussian process prio=-=r (e.g. [24,27]), with covaria-=-nce function k, is obtained by using g(�f�) = λ�f� 2 (here, λ > 0, and, as above, �·� is the norm of the RKHS associated with k). A Laplacian prior would be obtained by using i=1s424 B.... |

133 |
Harmonic Analysis on Semigroups
- Berg, Christensen, et al.
- 1984
(Show Context)
Citation Context ...ers [1,5,23]. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the functional analytic theory of kernels is =-=[4]-=-. Most of the material in the present introductory section is taken from that work. Readers familiar with the basics of kernels can skip over the remainder of it. D. Helmbold and B. Williamson (Eds.):... |

131 |
A Correspondence Between Bayesian Estimation of Stochastic Processes and Smoothing by Splines
- Kimeldorf, Wahba
- 1970
(Show Context)
Citation Context ...dapted to deal with more general regularizers. Example 4 (Bayesian MAP estimates). The well-known correspondence to Bayesian methods is established by identifying (15) with the negative log posterior =-=[14,12]. In thi-=-s case, exp(−c((xi,yi,f(xi))i=1,...,m)) is the likelihood of the data, while exp(−g(�f�)) is the prior over the set of functions. The well-known Gaussian process prior (e.g. [24,27]), with cov... |

122 | Dynamic alignment kernels
- Watkins
- 1999
(Show Context)
Citation Context ...hink of the pair (X ,k) as a (subset of a) Hilbert space. From a mathematical point of view, this is attractive, since we can thus study various data structures (e.g., strings over discrete alphabets =-=[26,13,18]-=-) in Hilbert spaces, whose theory is very well developed. From a practical point of view, however, we now face the problem that for many popular kernels, the Hilbert space is known to be infinite-dime... |

118 | Generalization performance of support vector machines and other pattern classifiers
- Bartlett, Shawe-Taylor
- 1998
(Show Context)
Citation Context ... great interest for learning theory, both since they comprise a number of useful algorithms as special cases and since their statistical performance can be analyzed with tools of learning theory (see =-=[23,3]-=-, and, more specifically dealing with regularized risk functionals, [6]). Theorem 1 (Nonparametric Representer Theorem). Suppose we are given a nonempty set X , a positive definite real-valued kernel ... |

95 |
The kernel adatron algorithm: a fast and simple learning procedure for support vector machines
- Friess, Cristianini, et al.
- 1998
(Show Context)
Citation Context ...x, we will typically only be able to find local minima. Independent of the convexity issue, the result lends itself well to gradientbased on-line algorithms for minimizing RKHS-based risk functionals =-=[10,9,17, 11,8,16]-=-: for the computation of gradients, we only need the objective function to be differentiable; convexity is not required. Such algorithms can thus be adapted to deal with more general regularizers. Exa... |

77 | On a kernel-based method for pattern recognition, regression, approximation and operator inversion - Smola, Schölkopf - 1998 |

71 |
Asymptotic analysis of penalized likelihood and related estimators
- Cox, O’Sullivan
- 1990
(Show Context)
Citation Context ... is due to [15]. Note that in our formulation, hard constraints on the solution are included by the possibility of c taking the value ∞. A generalization to non-quadratic cost functions was stated b=-=y [7], cf-=-. the discussion in [25] (note, however, that [7] did not yet allow for coupling of losses at different points). The present generalization to g(�f�) is, to our knowledge, new. For a machine learn... |

48 | Sparse representation for gaussian process models
- Csató, Opper
(Show Context)
Citation Context ...x, we will typically only be able to find local minima. Independent of the convexity issue, the result lends itself well to gradientbased on-line algorithms for minimizing RKHS-based risk functionals =-=[10,9,17, 11,8,16]-=-: for the computation of gradients, we only need the objective function to be differentiable; convexity is not required. Such algorithms can thus be adapted to deal with more general regularizers. Exa... |

37 | Algorithmic stability and generalization performance
- Bousquet, Elisseeff
(Show Context)
Citation Context ...useful algorithms as special cases and since their statistical performance can be analyzed with tools of learning theory (see [23,3], and, more specifically dealing with regularized risk functionals, =-=[6]). T-=-heorem 1 (Nonparametric Representer Theorem). Suppose we are given a nonempty set X , a positive definite real-valued kernel k on X×X,a training sample (x1,y1),...,(xm,ym) ∈X×R, a strictly monoton... |

32 | B.: Semiparametric support vector and linear programming machines
- Smola, Frieß, et al.
- 1999
(Show Context)
Citation Context ...e trade-off between regularization and fit to the training set. In addition, a single (M = 1) constant function ψ1(x) =b (b ∈ R) is used as an offset that is not regularized by the algorithm [25]. =-=In [22], -=-a semiparametric extension was proposed which shows how to deal with the case M>1 algorithmically. Theorem 2 applies in that case, too. Example 2 (SV classification). Here, the targets satisfy yi ∈{... |

17 |
Maximal margin perceptron
- Kowalczyk
- 2000
(Show Context)
Citation Context |

8 |
An overview of statistical learningtheory
- Vapnik
(Show Context)
Citation Context ...thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space. 1 Introduction Following the development of support vector (SV) machines =-=[23]-=-, positive definite kernels have recently attracted considerable attention in the machine learning community. It turns out that a number of results that have now become popular were already known in t... |

3 | Some results on Tchebyche¢ an spline functions - G, Wahba - 1971 |

1 |
Approximate maximal margin classification with respect to an arbitrary norm
- Gentile
(Show Context)
Citation Context |

1 |
On-line algorithms for kernel methods. in preparation
- Kivinen, Smola, et al.
- 2001
(Show Context)
Citation Context |