## A unified framework for Regularization Networks and Support Vector Machines (1999)

Citations: | 50 - 13 self |

### BibTeX

@MISC{Evgeniou99aunified,

author = {Theodoros Evgeniou and Massimiliano Pontil},

title = {A unified framework for Regularization Networks and Support Vector Machines},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

This report describers research done at the Center for Biological & Computational Learning and the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. This research was sponsored by theN ational Science Foundation under contractN o. IIS-9800032, the O#ce ofN aval Research under contractN o.N 0001493 -1-0385 and contractN o.N 00014-95-1-0600. Partial support was also provided by Daimler-Benz AG, Eastman Kodak, Siemens Corporate Research, Inc., ATR and AT&T. Contents Introductic 3 2 OverviF of stati.48EF learni4 theory 5 2.1 Unifo6 Co vergence and the Vapnik-Chervo nenkis bo und ............. 7 2.2 The metho d o Structural Risk Minimizatio ..................... 10 2.3 #-unifo8 co vergence and the V # ..................... 10 2.4 Overviewo fo urappro6 h ............................... 13 3 Reproduci9 Kernel HiT ert Spaces: a briL overviE 14 4RegulariEqq.L Networks 16 4.1 Radial Basis Functio8 ................................. 19 4.2 Regularizatioz generalized splines and kernel smo oxy rs .............. 20 4.3 Dual representatio o f Regularizatio Netwo rks ................... 21 4.4 Fro regressioto 5 Support vector machiT9 22 5.1 SVMin RKHS ..................................... 22 5.2 Fro regressioto 6SRMforRNsandSVMs 26 6.1 SRMfo SVMClassificatio .............................. 28 6.1.1 Distributio dependent bo undsfo SVMC .................. 29 7 A BayesiL Interpretatiq ofRegulariTFqEL and SRM? 30 7.1 Maximum A Po terio6 Interpretatio o f ............... 30 7.2 Bayesian interpretatio o f the stabilizer in the RN andSVMfunctio6I6 ...... 32 7.3 Bayesian interpretatio o f the data term in the Regularizatio andSVMfunctioy8 33 7.4 Why a MAP interpretatio may be misleading .................... 33 Connectine between SVMs and Sparse Ap...

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...]), and from now on we concentrate on the two-sided uniform convergence in probability, which we simply refer to as uniform convergence. The theory of uniform convergence of ERM has been developed in =-=[95, 96, 97, 92, 94]-=-. It has also been studied in the context of empirical processes [28, 72, 29]. Here we summarize the main results of the theory. 5 Typically for regression the loss functions is of the form V (y − f(x... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...y, which we simply refer to as uniform convergence. The theory of uniform convergence of ERM has been developed in [95, 96, 97, 92, 94]. It has also been studied in the context of empirical processes =-=[28, 72, 29]-=-. Here we summarize the main results of the theory. 5 Typically for regression the loss functions is of the form V (y − f(x)). 6 In the case that V is (y − f(x)) 2 , the minimizer of eq. (10) is the r... |

3629 |
Neural Networks: A Comprehensive Foundation (2 nd ed
- Haykin
- 1999
(Show Context)
Citation Context ... quadratic functional 1 min H[f] = f∈H l l∑ (yi − f(xi)) 2 + λ‖f‖ 2 K (32) i=1 for a fixed λ. Formulations like equation (32) are a special form of regularization theory developed by Tikhonov, Ivanov =-=[90, 44]-=- and others to solve ill-posed problems and in particular to solve the problem of approximating the functional relation between x and y given a finite number of examples D = {xi,yi} l i=1 . As we ment... |

2284 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...=1 ξi while controlling capacity measured in terms of the norm of f in the RKHS. In fact, the norm of f is related to the notion of margin, an important idea for SVMC for which we refer the reader to =-=[94, 15]-=-. 21As we mentioned in section 2, for binary pattern classification the empirical error is defined as a sum of binary numbers which in problem (5.4) would correspond to ∑l i=1 θ(ξi). However in such a... |

2171 | Support vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...earer we sketch a way to construct a RKHS, which is relevant to our paper. The mathematical details (such as the convergence or not of certain series) can be found in the theory of integral equations =-=[43, 19, 22]-=-. Let us assume that we have a sequence of positive numbers λn and linearly independent functions φn(x) such that they define a function K(x, y) in the following way 16 : K(x, y) ≡ ∞∑ λnφn(x)φn(y), (2... |

1722 | Ten Lectures on Wavelets - Daubechies - 1992 |

1695 | A Theory of the Learnable
- Valiant
- 1984
(Show Context)
Citation Context ...is arguments which rely on asymptotic results and do not consider finite data sets3 . Regularization is the approach we have taken in earlier work on learning [67, 37, 75]. The seminal work of Vapnik =-=[92, 93, 94]-=- has now set the foundations for a more general theory that justifies regularization functionals for learning from finite sets and can be used to extend considerably the classical framework of regular... |

1652 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 2001
(Show Context)
Citation Context ...near superposition of a small number of basis functions selected from a large, redundant set of basis functions, called a dictionary. These techniques go under the name of Sparse Approximations (SAs) =-=[17, 16, 63, 40, 23, 55, 20, 25]-=-. We will start with a short overview of SAs. Then we will discuss a result due to Girosi [36] that shows an 26equivalence between SVMs and a particular SA technique. Finally we will discuss the prob... |

1273 | Spline models for observational data - Wahba - 1990 |

1160 | Modeling by shortest data description - Rissanen - 1983 |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ... binary pattern classification, i.e. the case where we are given data that belong to one of two classes (classes -1 and 1) and we want to find a function that separates these classes. It can be shown =-=[27]-=- that, if V in equation (35) is (y − f(x)) 2 ,andifK defines a finite dimensional RKHS, then the minimizer of the equation H[f] = 1 l 19 Notice that this “duality” is different from the one mentioned ... |

946 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...]), and from now on we concentrate on the two-sided uniform convergence in probability, which we simply refer to as uniform convergence. The theory of uniform convergence of ERM has been developed in =-=[95, 96, 97, 92, 94]-=-. It has also been studied in the context of empirical processes [28, 72, 29]. Here we summarize the main results of the theory. 5 Typically for regression the loss functions is of the form V (y − f(x... |

928 |
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ... functional is the same as that of equation (70), but here it is important to notice that λ(l) = α l . As noticed by Girosi et al. [37], functionals of the type (72) are common in statistical physics =-=[65]-=-, where the stabilizer (here ‖f‖2 K ) plays the role of an energy functional. As we will see later, the RKHS setting we use in this paper makes clear that the correlation function of the physical syst... |

824 |
Solution of Ill-posed Problems
- Tikhonov, Arsenin
(Show Context)
Citation Context ...is arguments which rely on asymptotic results and do not consider finite data sets3 . Regularization is the approach we have taken in earlier work on learning [67, 37, 75]. The seminal work of Vapnik =-=[92, 93, 94]-=- has now set the foundations for a more general theory that justifies regularization functionals for learning from finite sets and can be used to extend considerably the classical framework of regular... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...=1 ξi while controlling capacity measured in terms of the norm of f in the RKHS. In fact, the norm of f is related to the notion of margin, an important idea for SVMC for which we refer the reader to =-=[94, 15]-=-. 21As we mentioned in section 2, for binary pattern classification the empirical error is defined as a sum of binary numbers which in problem (5.4) would correspond to ∑l i=1 θ(ξi). However in such a... |

777 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...scussed. First we present an overview of RKHS which are the hypothesis spaces we consider in the paper. 3 Reproducing Kernel Hilbert Spaces: a brief overview A Reproducing Kernel Hilbert Space (RKHS) =-=[5]-=- is a Hilbert space H of functions defined over some bounded domain X ⊂ R d with the property that, for each x ∈ X, the evaluation functionals Fx defined as Fx[f] =f(x) ∀f ∈H are linear, bounded funct... |

639 |
F Girosi, “Networks for approximation and learning
- Poggio
- 1990
(Show Context)
Citation Context ...ecessarily orthonormal), and the kernel K is the “correlation” matrix associated with these basis functions. It is in fact well know that there is a close relation between Gaussian processes and RKHS =-=[56, 38, 70]-=-. Wahba [100] discusses in depth the relation between regularization, RKHS and correlation functions of Gaussian processes. The choice of the φn defines a space of functions – the functions that are s... |

628 | Constructive Approximation
- Devore, Lorentz
- 1993
(Show Context)
Citation Context ...nnealed VC-entropy also exist. These are tighter than the VC-dimension ones. 11We want (16) to hold simultaneously for all spaces Hi, since we choose the best ˆ fi,l. 12Various cases are discussed in =-=[26]-=-, i.e. n(l) =l. (17) 7However, in practice l is finite (“small”), so n(l) is small which means that H = ⋃ n(l) i=1 Hi is a small space. Therefore I[fH] may be much larger than the expected risk of ou... |

567 |
Convergence of Stochastic Processes
- Pollard
- 1984
(Show Context)
Citation Context ... found to be f. Then, there exists a value a ∈ (0, 1) such that for ∀ɛ ∈ [a, 1), if the regression problem (5.1) is solved with parameter (1 − ɛ)C, the optimal solution will be (1 − ɛ)f . We refer to =-=[74]-=- for the proof. A sketch of the proof is given in Appendix D. A direct implication of this result is that one can solve any SVMC problem through the SVMR formulation. A formal proof of this result can... |

493 | Entropy-based algorithms for best basis selection
- Coifman, Wickerhauser
- 1992
(Show Context)
Citation Context ...xi)) (7) V (yi,f(x)) = θ(−yif(xi)) (8) Where θ(·) is the Heaviside function. For classification one should minimize (8) (or (7)), but in practice other loss functions, such as the soft margin one (6) =-=[21, 93]-=-, are used. We discuss this issue further in section 6. The minimizer of (3) using the three loss functions has the same general form (2) (or f(x) = ∑l i=1 ciK(x, xi)+b, see later) but interestingly d... |

356 |
Spline Functions: Basic Theory
- Schumaker
- 1981
(Show Context)
Citation Context ...f ∈ HA: Pr{I[f] >ɛ+ Iemp[f; l]} ≤G(ɛ, m, hγ), (69) Notice that (69) is different from existing bounds that use the empirical hard margin (θ(1 − yf(x))) error [8]. It is similar in spirit to bounds in =-=[85]-=- where the ∑l i=1 ξ2 i is used22 . On the other hand, it can be shown [33] that the Vγ dimension for loss functions of the form |1 − yf(x)| σ + is of the form O( R2A 2 γ 2 )for∀0<σ≤ 1. Thus, σ using t... |

311 | Regularization theory and neural-network architectures, Neural Computation
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ...on of the approximation problem is known as the dual19 of equation (48), and the basis functions bi(x) are called the equivalent kernels, because of the similarity with the kernel smoothing technique =-=[86, 39, 41]-=-. Notice that, while in equation (48) the difficult part is the computation of coefficients ci, the kernel function K(x, xi) being predefined, in the dual representation (49) the difficult part is the... |

297 |
An information maximization approach to blind separation and blind deconvolution
- Bell, Sejnowski
- 1995
(Show Context)
Citation Context ...nd the 28sources x are unknown, and we assume that xi(t) are statistically independent, while we don’t have any explicit restriction on A. Various methods for ICA have been developed in recent years =-=[3, 9, 61, 51, 63]-=-. A review of the methods can be found in [50]. Typically the problem is solved by assuming a probability distribution model for the sources xi(t). A typical prior distribution is the Laplacian, namel... |

289 | Natural gradient works efficiently in learning
- Amari
- 1998
(Show Context)
Citation Context ...tion is Blind Source Separation (BSS) where one is given a signal and seeks to decompose it as a linear combination of a number of unknown statistically independent sources. Following the notation in =-=[4]-=-, the problem can be formulated as finding at any time t both the n (n predefined) sources x(t) =(x1(t),...,xn(t)) and the mixing matrix A (which is assumed to be the same for every t) of the system o... |

277 | Interpolation of scattered data: Distance matrices and conditionally positive de nite functions
- Micchelli
- 1986
(Show Context)
Citation Context ...y may give better results than other regression approaches [64]. • Invariances and Virtual Examples. In many pattern recognition problem specific invariances are known to hold a priori. Niyogy et al. =-=[60]-=- showed how several invariances can be embedded in the stabilizer or, equivalently, in virtual examples (see for a related work on tangent distance [87] and [82]). • Generative probabilistic models. J... |

254 | Structural risk minimization over data-dependent hierarchies
- Shawe-Taylor, Bartlett, et al.
- 1998
(Show Context)
Citation Context ...on of the approximation problem is known as the dual19 of equation (48), and the basis functions bi(x) are called the equivalent kernels, because of the similarity with the kernel smoothing technique =-=[86, 39, 41]-=-. Notice that, while in equation (48) the difficult part is the computation of coefficients ci, the kernel function K(x, xi) being predefined, in the dual representation (49) the difficult part is the... |

239 |
Statistical Field Theory
- Parisi
- 1988
(Show Context)
Citation Context ...formulation is based on functional analysis arguments which rely on asymptotic results and do not consider finite data sets3 . Regularization is the approach we have taken in earlier work on learning =-=[67, 37, 75]-=-. The seminal work of Vapnik [92, 93, 94] has now set the foundations for a more general theory that justifies regularization functionals for learning from finite sets and can be used to extend consid... |

239 |
Computational Vision and Regularization Theory
- Poggio, Koch, et al.
- 1985
(Show Context)
Citation Context ...in section 8, given the equivalence with SVM. Data terms of the type V (yi − f(xi)) can be interpreted [38] in probabilistic terms as non-Gaussian noise models. Recently, Pontil, Mukherjee and Girosi =-=[73]-=- have derived a noise model corresponding to Vapnik’s ɛ-insensitive loss function. It turns out that the underlying noise model consists of the superposition of Gaussian processes with different varia... |

238 |
Efficient pattern recognition using a new transformation distance
- Simard, LeCun, et al.
- 1993
(Show Context)
Citation Context ...e operator 17 in L2(Ω) has an expansion of the form (27), in which the φi and the λi are respectively the orthogonal eigenfunctions and the positive eigenvalues of the operator corresponding to K. In =-=[89]-=- it is reported that the positivity of the operator associated to K is equivalent to the statement that the kernel K is positive definite, that is the matrix Kij = K(xi, xj) is positive definite for a... |

233 |
Learning from Data: Concepts Theory and Methods
- Cherkassky, Mulier
- 1998
(Show Context)
Citation Context ...earer we sketch a way to construct a RKHS, which is relevant to our paper. The mathematical details (such as the convergence or not of certain series) can be found in the theory of integral equations =-=[43, 19, 22]-=-. Let us assume that we have a sequence of positive numbers λn and linearly independent functions φn(x) such that they define a function K(x, y) in the following way 16 : K(x, y) ≡ ∞∑ λnφn(x)φn(y), (2... |

205 | Scale-sensitive dimensions, uniform convergence, and learnability
- Alon, Ben-David, et al.
- 1997
(Show Context)
Citation Context ...necessary condition for uniform convergence in the case of real valued functions. To get a necessary condition we need a slight extension of the VC-dimension that has been developed (among others) in =-=[48, 2]-=-, known as the Vγ–dimension 13 . Here we summarize the main results of that theory that we will also use later on to design regression machines for which we will have distribution independent uniform ... |

203 | An equivalence between sparse approximation and support vector machines
- Girosi
- 1998
(Show Context)
Citation Context ... in the most compact way is chosen as the “best” hypothesis. Similar ideas have been explored by others (see [93, 94] for a summary). 24where V (x) is any monotonically increasing loss function (see =-=[38]-=-). In particular it can be applied to the SVM (regression) case in which the relevant functional is 1 l l∑ |yi − f(xi)|ɛ + λ‖f‖ 2 K . (76) i=1 In both cases, one can write appropriate P [Dl|f] andP [f... |

197 | Efficient Distribution-free Learning of Probabilistic Concepts
- Kearns, RE
- 1994
(Show Context)
Citation Context ...e statistically independent, while we don’t have any explicit restriction on A. Various methods for ICA have been developed in recent years [3, 9, 61, 51, 63]. A review of the methods can be found in =-=[50]-=-. Typically the problem is solved by assuming a probability distribution model for the sources xi(t). A typical prior distribution is the Laplacian, namely P (x(t)) ∝·e |x1(t)|+···+|xn(t)| . Moreover,... |

195 | A Theory of Networks for Approximation and Learning
- Poggio, Girosi
(Show Context)
Citation Context ...m of equations as in equation (34) [35, 38, 88]. For a proof see Appendix C. The approximation scheme of equation (33) has a simple interpretation in terms of a network with one layer of hidden units =-=[69, 37]-=-. Using different kernels we get various RN’s. A short list of examples is given in Table 1. Kernel Function Regularization Network K(x − y) =exp(−‖x − y‖2 ) Gaussian RBF K(x − y) =(‖x− y‖2 + c2 1 − )... |

184 | Probabilistic solution of ill-posed problems in computational vision - Marroquin, Mitter, et al. - 1987 |

179 | Ill-posed problems in early vision - Bertero, Poggio, et al. - 1987 |

178 | The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Imporatant than the Size of the Network
- Bartlett
- 1998
(Show Context)
Citation Context ...5), and not for the machines of the form (3) typically used in practice [33]. This is unlike the bound in [8] which holds for machines of the form (65) and is derived using the theoretical results of =-=[6]-=- where a type of “continuous” SRM (for example for a structure of hypothesis spaces defined through the continuous parameter A of (65)) is studied23 . In the case of classification the difficulty is t... |

175 |
Methods of Mathematical Physics
- Courant, Hilbert
- 1962
(Show Context)
Citation Context ...near superposition of a small number of basis functions selected from a large, redundant set of basis functions, called a dictionary. These techniques go under the name of Sparse Approximations (SAs) =-=[17, 16, 63, 40, 23, 55, 20, 25]-=-. We will start with a short overview of SAs. Then we will discuss a result due to Girosi [36] that shows an 26equivalence between SVMs and a particular SA technique. Finally we will discuss the prob... |

166 | The Theory of Radial Basis Function Approximation - Powell - 1992 |

164 |
Approximation of Functions
- Lorentz
- 1986
(Show Context)
Citation Context ...c positive definite so that the stabilizer is a norm. However, the theory can be extended without problems to the case in which K is positive semidefinite, in which case the stabilizer is a semi-norm =-=[100, 54, 30, 32]-=-. This approach was also sketched in [88]. The stabilizer in equation (32) effectively constrains f to be in the RKHS defined by K. It is possible to show (see for example [67, 37]) that the function ... |

163 |
Contributions to the problem of approximation of equidistant data by analytic functions
- Schoenberg
(Show Context)
Citation Context ... known to hold a priori. Niyogy et al. [60] showed how several invariances can be embedded in the stabilizer or, equivalently, in virtual examples (see for a related work on tangent distance [87] and =-=[82]-=-). • Generative probabilistic models. Jaakkola and Haussler [45] consider the case in which prior information is available in terms of a parametric probabilistic model P (x,y) of the process generatin... |

142 |
Matching pursuit in a time-frequency dictionary
- Mallat, Zhang
- 1993
(Show Context)
Citation Context ... K, on the density of data points, and on the regularization parameter λ. This shows that apparently “global” approximation schemes can be regarded as local, memory-based techniques (see equation 49) =-=[57]-=-. 4.4 From regression to classification So far we only considered the case that the unknown function can take any real values, specifically the case of regression. In the particular case that the unkn... |

132 |
The relationship between variable selection and data augmentation and a method for prediction
- Allen
- 1974
(Show Context)
Citation Context ...es a RKHS in which the ”features” φn are Fourier components that is ∞∑ ∞∑ K(x, y) ≡ λnφn(x)φn(y) ≡ n=0 n=0 i=1 λne i2πn·x e −i2πn·y . (44) Thus any positive definite radial kernel defines a RKHS over =-=[0, 1]-=- with a scalar product of the form: <f,g>H≡ ∞∑ n=0 ˜f(n)˜g ∗ (n) , (45) λn 14where ˜ f is the Fourier transform of f. The RKHS becomes simply the subspace of L2([0, 1] d ) of the functions such that ... |

123 | Basis Pursuit
- Chen, Donoho
- 1994
(Show Context)
Citation Context ...tion of H in equation (3) for different choices of V : • Classical (L2) Regularization Networks (RN) V (yi,f(xi)) = (yi − f(xi)) 2 (4) 1 There is a large literature on the subject: useful reviews are =-=[42, 18, 100, 37]-=-, [94] and references therein. 2 The general regularization scheme for learning is sketched in Appendix A. 3 The method of quasi-solutions of Ivanov and the equivalent Tikhonov’s regularization techni... |

120 | Local learning algorithms - Bottou, Vapnik - 1992 |

118 | Generalization performance of support vector machines and other pattern classifiers
- Bartlett, Shawe-Taylor
- 1998
(Show Context)
Citation Context ...ity (68) implies that (uniformly) for all f ∈ HA: Pr{I[f] >ɛ+ Iemp[f; l]} ≤G(ɛ, m, hγ), (69) Notice that (69) is different from existing bounds that use the empirical hard margin (θ(1 − yf(x))) error =-=[8]-=-. It is similar in spirit to bounds in [85] where the ∑l i=1 ξ2 i is used22 . On the other hand, it can be shown [33] that the Vγ dimension for loss functions of the form |1 − yf(x)| σ + is of the for... |

112 |
Introduction to Gaussian processes
- MacKay
- 1998
(Show Context)
Citation Context ...near superposition of a small number of basis functions selected from a large, redundant set of basis functions, called a dictionary. These techniques go under the name of Sparse Approximations (SAs) =-=[17, 16, 63, 40, 23, 55, 20, 25]-=-. We will start with a short overview of SAs. Then we will discuss a result due to Girosi [36] that shows an 26equivalence between SVMs and a particular SA technique. Finally we will discuss the prob... |

109 | An experimental and theoretical comparison of model selection methods
- Kearns, Mansour, et al.
- 1995
(Show Context)
Citation Context ...n of Regularization It is well known that a variational principle of the type of equation (1) can be derived not only in the context of functional analysis [90], but also in a probabilistic framework =-=[49, 100, 98, 71, 56, 11]-=-. In this section we illustrate this connection for both RN and SVM, in the setting of RKHS. Consider the classical regularization case Following Girosi et al. [37] let us define: 1 min H[f] = f∈H l l... |

104 | Probalistic kernel regression models
- Jaakkola, Haussler
- 1999
(Show Context)
Citation Context ... (l) is not known in practice, we can only “implement” the extended SRM approximately by minimizing (67) with various values of λ and then picking the best λ using techniques such as cross-validation =-=[1, 98, 99, 47]-=-, Generalized Cross Validation, Finite Prediction Error and the MDL criteria (see [94] for a review and comparison). Summarizing, both the RN and the SVMR methods discussed in sections 4 and 5 can be ... |

87 | Interpolation and approximation by radial and related functions. In: Approximation Theory - DYN - 1989 |