## Entropy Numbers, Operators and Support Vector Kernels (1998)

### Cached

### Download Links

- [www.neurocolt.org]
- [www.neurocolt.org]
- [mlg.anu.edu.au]
- [axiom.anu.edu.au]
- [users.cecs.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE TRANSACTIONS ON INFORMATION THEORY |

Citations: | 12 - 3 self |

### BibTeX

@INPROCEEDINGS{Williamson98entropynumbers,,

author = {Robert C. Williamson and Alex J. Smola and Bernhard Schölkopf},

title = {Entropy Numbers, Operators and Support Vector Kernels},

booktitle = {IEEE TRANSACTIONS ON INFORMATION THEORY},

year = {1998},

pages = {127--144},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We derive new bounds for the generalization error of feature space machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs are based on a viewpoint that is apparently novel in the field of statistical learning theory. The hypothesis class is described in terms of a linear operator mapping from a possibly infinite dimensional unit ball in feature space into a finite dimensional space. The covering numbers of the class are then determined via the entropy numbers of the operator. These numbers, which characterize the degree of compactness of the operator, can be bounded in terms of the eigenvalues of an integral operator induced by the kernel function used by the machine. As a consequence we are able to theoretically explain the effect of the choice of kernel functions on the generalization performance of support vector machines.

### Citations

9946 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...y control. Support Vector (SV) machines, which have recently been proposed as a new class of learning algorithms solving problems of pattern recognition, regression estimation, and operator inversion =-=[13]-=- are a well known example of this class. 1 A key feature of the present paper is the manner in which we directly bound the covering numbers of interest rather than making use of a Combinatorial dimens... |

7146 |
A mathematical theory of communication
- Shannon
- 1948
(Show Context)
Citation Context ...oncerning the asymptotic rate of decrease of the entropy numbers in terms of the asymptotic behaviour of the eigenvalues. A similar result is actually implicit in section 22 of Shannon’s famous pape=-=r [27], -=-where he considered the effect of different convolution operators on the entropy of an ensemble. Prosser’s paper [25] led to a handful of papers (see e.g. [26,15,3,21]) which studied various convolu... |

2386 | Support vector network
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...e generates can be expressed as hw;sxi + b where both w andsx are defined in the feature space S = span(F(X)) and b 2R. The kernel trick as introduced by [1] was then successfully employed in [4] and =-=[7] to extend-=- the Optimal Margin Hyperplane classifier to what is now known as the SV machine. (The "+b" term is readily dealt with; we omit such considerations here though.) Consider the class F Rw := f... |

1422 | Vapnik –“A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
- 1992
(Show Context)
Citation Context ...rnels. In order to apply the above reasoning to a rather general class of nonlinearsfunctions, one can use kernels computing dot products in high-dimensional spaces nonlinearly related to input space =-=[1, 4]-=-. Under certain conditions on a kernel k, to be stated below (Theorem 2), there exists a nonlinear map F into a reproducing kernel Hilbert space F such that k computes the dot product in F , i.e. k(x;... |

301 |
Theoretical foundations of the potential function method in pattern recognition learning
- Aizerman, Braverman, et al.
- 1964
(Show Context)
Citation Context ...rnels. In order to apply the above reasoning to a rather general class of nonlinearsfunctions, one can use kernels computing dot products in high-dimensional spaces nonlinearly related to input space =-=[1, 4]-=-. Under certain conditions on a kernel k, to be stated below (Theorem 2), there exists a nonlinear map F into a reproducing kernel Hilbert space F such that k computes the dot product in F , i.e. k(x;... |

259 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...rm pattern recognition using linear hyperplanes, often a maximum margin of separation between the classes is sought for, as this leads to good generalization ability independent of the dimensionality =-=[28]. It can be show-=-n that for separable training data (x1,y1),...,(xm,ym) ∈ Rd ×{±1}, this is achieved by minimizing �w�2 subject to the constraints yj(〈w, xj〉 + b) ≥ 1forj = 1,...,m,andsomeb∈R. The deci... |

211 | Scale sensitive dimensions, uniform convergence and learnability
- Alon, Ben-David, et al.
- 1997
(Show Context)
Citation Context ...ass --- the focus of the present paper. Results for both classification and regression are now known. For the sake of concreteness, we quote below a result suitable for regression which was proved in =-=[2]-=-. Let P m ( f ) := 1 msm i=1 f (x i ) denote the empirical mean of f on the sample x 1 ; : : : ; xm . Lemma 1 (Alon, Ben--David, Cesa--Bianchi, and Haussler, 1997) Let F be a class of functions from X... |

154 | The connection between regularization operators and support vector kernels. Neural Networks
- Smola, Scholkopf, et al.
- 1998
(Show Context)
Citation Context ...ctions written as kernel expansions f (x) = m i=1 a j k(x j ; x) +b; with a j 2 R, j = 1; : : : ; m. It has been noticed that different kernels can be characterized by their regularization properties =-=[12]-=-. This provides insight into the regularization properties of SV kernels. However, it does not give us a comprehensive understanding of how to select a kernel for a given learning problem, and how usi... |

88 |
Operator ideals
- Pietsch
- 1980
(Show Context)
Citation Context ...Kolmogorov’s ɛ-entropy of a stochastic process was shown in [2]. Independently, another group of mathematicians including Carl and Stephani [8] studied covering numbers [31] and later entropy numbe=-=rs [23] in -=-the context of operator ideals. (They seem to be unaware of Prosser’s work — see e.g. [9, p. 136].) Connections between the local theory of Banach spaces and uniform convergence of empirical means... |

72 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators
- Williamson, Smola, et al.
- 1998
(Show Context)
Citation Context ...fined by (12). 1. If ɛn(A) =O(log −α n) for some α>0 then ɛn(T )=O(log −(α+2) n). 2. If log ɛn(A) =O(log −β n) for some β>0 then log ɛn(T )=O(log −β n). This Lemma (the proof of whic=-=h is omitted; see [35]) -=-shows that in the first case, Maurey’s result (theorem 3) allows an improvement in the exponent of the entropy number of T , whereas in the second, it affords none (since the entropy numbers decay s... |

67 |
Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1981
(Show Context)
Citation Context ...eneralization Bounds via Uniform Convergence 4 2 Generalization Bounds via Uniform Convergence The generalization performance of learning machines can be bounded via uniform convergence results as in =-=[15]-=-. The key thing about these results is the role of the covering numbers of the hypothesis class --- the focus of the present paper. Results for both classification and regression are now known. For th... |

67 |
Tihomirov, ε-entropy and ε-capacity of sets in functional spaces
- Kolmogorov, M
- 1959
(Show Context)
Citation Context ...arks. The concept of the metric entropy of a set has been around for some time. It seems to have been introduced by Pontriagin and Schnirelmann [24] and was studied in detail by Kolmogorov and others =-=[19]-=-. The use of metric entropy to say something about linear operators was developed independently by several people. Prosser [25] appears to have been the first to make the idea explicit. He determined ... |

65 |
Entropy, Compactness and the Approximation of Operators
- Carl, Stephani
- 1990
(Show Context)
Citation Context ...pact operator, i.e. if T (U E ) is compact. The dyadic entropy numbers of an operator are defined by e n (T ) := e 2 n\Gamma1 (T ); n 2 N : A very nice introduction to entropy numbers of operators is =-=[6]-=-. The e-covering number of F with respect to the metric d denoted N(e;F;d) is the size of the smallest e-cover for F using the metric d. Let N m (e;F) := sup x 1 ;:::;x m2X m N(e;F;` X m ): By log and... |

41 |
Theory of Pattern Recognition [in Russian
- Vapnik, Chervonenkis
- 1974
(Show Context)
Citation Context ...l in terms of the parameter Rw which is the inverse of the size of the margin in feature space, or equivalently, the size of the weight vector in feature space as defined by the dot product in S (see =-=[14, 13]-=- for details). In the following we will call such hypothesis classes with length constraint on the weight vectors in feature space SV classes. Let T be the operator T = SsX mRw where Rw 2 R and the op... |

40 | Eigenvalue distribution of compact operators - König - 1986 |

29 |
A note on a scale-sensitive dimension of linear bounded functionals
- Gurvits
- 1997
(Show Context)
Citation Context ...re of Prosser’s work — see e.g. [9, p. 136].) Connections between the local theory of Banach spaces and uniform convergence of empirical means has been noted before (e.g. [22]). More recently Gurv=-=its [14]-=- has obtained a result relating the Rademacher type of a Banach space to the fatshattering dimension of linear functionals on that space and hence via the key result in [4] to the covering numbers of ... |

28 | Information Theory, Interscience - Ash - 1965 |

25 |
Inequalities of Bernstein–Jackson-type and the degree of compactness of operators in Banach spaces
- Carl
- 1985
(Show Context)
Citation Context ... wi; : : : ; h x m ; wi) : (15) withsx j 2 F(X) for all j. The following theorem is useful when computing entropy numbers in terms of T and A. It is originally due to Maurey, and was extended by Carl =-=[5]-=-. See [16] for some extensions and historical remarks. Theorem 6 (Carl and Stephani [6, p. 246]) Let S 2 L(H;` m ) where H is a Hilbert space. Then there exists a constant c ? 0 such that for all m 2 ... |

20 | Probabilistic analysis of learning in artificial neural networks: The pac model and its variants
- Anthony
- 1997
(Show Context)
Citation Context ...n of N m (ɛ, F) to the computation of a single “dimensionlike” quantity. An overview of these various dimensions, some details of their history, and some examples of their computation can be foun=-=d in [5].s-=-288 Robert C. Williamson, Alex J. Smola, and Bernhard Schölkopf In the present work, we view the class F as being induced by an operator ¯ Tk depending on some kernel function k. ThusFisthe image of... |

16 | Geometric and probabilistic estimates for entropy and approximation numbers of operators - Gordon, König, et al. - 1987 |

16 | A framework for structural risk minimisation
- Shawe-Taylor, Bartlett, et al.
- 1996
(Show Context)
Citation Context ...rm pattern recognition using linear hyperplanes, often a maximum margin of separation between the classes is sought for, as this leads to good generalization ability independent of the dimensionality =-=[10]-=-. It can be shown that for separable training data (x 1 ; y 1 ); : : : ; (x m ; y m ) 2 R d \Theta f\Sigma1g; 1 An extended version of this paper is available as NC2-TR-1998-019. Introduction, Definit... |

15 |
Generalization performance of classifiers in terms of observed covering numbers
- Shawe-Taylor, Williamson
- 1999
(Show Context)
Citation Context ...is paper (NC2-TR-1998-019). Furthermore the statistical argument needed to exploit such techniques (bounding generalization error in terms of empirical covering numbers has now been developed --- see =-=[11]-=-. 4 Eigenvalue Decay Rates The results presented above show that if one knows the eigenvalue sequence (l i ) i of a compact operator, one can bound its entropy numbers. A commonly used kernel is k(x;y... |

12 |
Entropy numbers of diagonal operators with an application to eigenvalue problems
- Carl
- 1981
(Show Context)
Citation Context ... s 2 jλj|ψj(x)| 2 . (8) j=1 Then �an(·)� L1(X) = An due to the normalization condition on ψj. However, as µ(X) < ∞ there exists a set ˜ X of nonzero measure such that an(x) ≥ An µ(X) fo=-=r all x ∈ ˜ X. (9) Combining the left s-=-ide of (7) with (8) we obtain �SΦ(x)� 2 ℓ2 ≥ an(x) for all n ∈ N and almost all x. Since an(x) is unbounded for a set ˜ X with nonzero measure in X, wecanseethatSΦ(X) �⊂ ℓ2. � The... |

10 |
A Maximum Margin Miscellany
- Williamson, Scholkopf, et al.
- 1999
(Show Context)
Citation Context ...: ; h x m ; wi) : (15) withsx j 2 F(X) for all j. The following theorem is useful when computing entropy numbers in terms of T and A. It is originally due to Maurey, and was extended by Carl [5]. See =-=[16]-=- for some extensions and historical remarks. Theorem 6 (Carl and Stephani [6, p. 246]) Let S 2 L(H;` m ) where H is a Hilbert space. Then there exists a constant c ? 0 such that for all m 2 N , and 1s... |

5 |
Sur une propriete metrique de la dimension
- Pontriagin, Schnirelmann
- 1932
(Show Context)
Citation Context ... We conclude this section with some brief historical remarks. The concept of the metric entropy of a set has been around for some time. It seems to have been introduced by Pontriagin and Schnirelmann =-=[24]-=- and was studied in detail by Kolmogorov and others [19]. The use of metric entropy to say something about linear operators was developed independently by several people. Prosser [25] appears to have ... |

4 |
Characterization of weak type by the entropy distribution of r-nuclear operators
- Defant, Junge
- 1993
(Show Context)
Citation Context ...he space it maps to), and the rate of decay of its entropy numbers has beensEntropy Numbers, Operators and Support Vector Kernels 289 (independently) shown by Kolchinskiĭ [17,18] and Defant and Junge=-= [12,16].-=- Note that the exact formulation of their results differs. Kolchinskiĭ was motivated by probabilistic problems not unlike ours. 3 Generalization Bounds via Uniform Convergence The generalization perf... |

4 |
ε-entropy and approximation of bandlimited functions
- Jagerman
- 1969
(Show Context)
Citation Context ...n section 22 of Shannon’s famous paper [27], where he considered the effect of different convolution operators on the entropy of an ensemble. Prosser’s paper [25] led to a handful of papers (see e=-=.g. [26,15,3,21]) whic-=-h studied various convolutional operators. A connection between Prosser’s ɛ-entropy of an operator and Kolmogorov’s ɛ-entropy of a stochastic process was shown in [2]. Independently, another gro... |

4 |
Some estimates on entropy numbers
- Defant, Junge
- 1993
(Show Context)
Citation Context ...he space it maps to), and the rate of decay of its entropy numbers has beensEntropy Numbers, Operators and Support Vector Kernels 289 (independently) shown by Kolchinskiĭ [17,18] and Defant and Junge=-= [12,16].-=- Note that the exact formulation of their results differs. Kolchinskiĭ was motivated by probabilistic problems not unlike ours. 3 Generalization Bounds via Uniform Convergence The generalization perf... |

3 |
Interpolationseigenschaften von Entropie- und Durchmesseridealen kompakter Operatoren
- Triebel
- 1970
(Show Context)
Citation Context ...s ɛ-entropy of an operator and Kolmogorov’s ɛ-entropy of a stochastic process was shown in [2]. Independently, another group of mathematicians including Carl and Stephani [8] studied covering numb=-=ers [31] and-=- later entropy numbers [23] in the context of operator ideals. (They seem to be unaware of Prosser’s work — see e.g. [9, p. 136].) Connections between the local theory of Banach spaces and uniform... |

2 |
The ε–entropy and ε–capacity of certain time–varying channels
- Prosser
- 1966
(Show Context)
Citation Context ...and Schnirelmann [24] and was studied in detail by Kolmogorov and others [19]. The use of metric entropy to say something about linear operators was developed independently by several people. Prosser =-=[25] a-=-ppears to have been the first to make the idea explicit. He determined the effect of an operator’s spectrum on its entropy numbers. In particular, he proved a number of results concerning the asympt... |

2 |
A maximum margin miscellany,” Typescript
- Williamson, Schölkopf, et al.
- 1999
(Show Context)
Citation Context ...ing task. We believe that the new viewpoint in itself is potentially very valuable, perhaps more so than the specific results in the paper. A further exploitation of the new viewpoint can be found in =-=[36]. -=-There are in fact a variety of ways to define exactly what is meant by ¯ Tk, and we have deliberately not been explicit in the picture. We make use of one particular ¯ Tk in this paper. A slightly d... |

1 |
An operator theoretical characterization of ɛ-entropy in gaussian processes
- Akashi
- 1986
(Show Context)
Citation Context ...f papers (see e.g. [26,15,3,21]) which studied various convolutional operators. A connection between Prosser’s ɛ-entropy of an operator and Kolmogorov’s ɛ-entropy of a stochastic process was sho=-=wn in [2]-=-. Independently, another group of mathematicians including Carl and Stephani [8] studied covering numbers [31] and later entropy numbers [23] in the context of operator ideals. (They seem to be unawar... |

1 |
The asymptotic behaviour of ε-entropy of a compact positive operator
- Akashi
- 1990
(Show Context)
Citation Context ...n section 22 of Shannon’s famous paper [27], where he considered the effect of different convolution operators on the entropy of an ensemble. Prosser’s paper [25] led to a handful of papers (see e=-=.g. [26,15,3,21]) whic-=-h studied various convolutional operators. A connection between Prosser’s ɛ-entropy of an operator and Kolmogorov’s ɛ-entropy of a stochastic process was shown in [2]. Independently, another gro... |

1 |
Operators of type p and metric entropy
- Kolchinskiĭ
- 1988
(Show Context)
Citation Context ... type of an operator (or of the space it maps to), and the rate of decay of its entropy numbers has beensEntropy Numbers, Operators and Support Vector Kernels 289 (independently) shown by Kolchinskiĭ=-= [17,18] -=-and Defant and Junge [12,16]. Note that the exact formulation of their results differs. Kolchinskiĭ was motivated by probabilistic problems not unlike ours. 3 Generalization Bounds via Uniform Conver... |

1 |
Entropic order of operators in banach spaces and the central limit theorem
- Kolchinskii
- 1991
(Show Context)
Citation Context ... type of an operator (or of the space it maps to), and the rate of decay of its entropy numbers has beensEntropy Numbers, Operators and Support Vector Kernels 289 (independently) shown by Kolchinskiĭ=-= [17,18] -=-and Defant and Junge [12,16]. Note that the exact formulation of their results differs. Kolchinskiĭ was motivated by probabilistic problems not unlike ours. 3 Generalization Bounds via Uniform Conver... |

1 |
ε-entropy ε-rate, and interpolation spaces revisited with an application to linear communication channels
- Koski, Persson, et al.
- 1994
(Show Context)
Citation Context ...n section 22 of Shannon’s famous paper [27], where he considered the effect of different convolution operators on the entropy of an ensemble. Prosser’s paper [25] led to a handful of papers (see e=-=.g. [26,15,3,21]) whic-=-h studied various convolutional operators. A connection between Prosser’s ɛ-entropy of an operator and Kolmogorov’s ɛ-entropy of a stochastic process was shown in [2]. Independently, another gro... |

1 |
Sous-espaces ℓ 1 n des espaces de Banach
- Pajor
- 1985
(Show Context)
Citation Context ...deals. (They seem to be unaware of Prosser’s work — see e.g. [9, p. 136].) Connections between the local theory of Banach spaces and uniform convergence of empirical means has been noted before (e=-=.g. [22]-=-). More recently Gurvits [14] has obtained a result relating the Rademacher type of a Banach space to the fatshattering dimension of linear functionals on that space and hence via the key result in [4... |

1 |
The ε-entropy and ε-capacity of certain timeinvariant channels
- Prosser, Root
- 1968
(Show Context)
Citation Context |