## The mathematics of learning: Dealing with data (2003)

### Cached

### Download Links

Venue: | Notices of the American Mathematical Society |

Citations: | 103 - 15 self |

### BibTeX

@ARTICLE{Poggio03themathematics,

author = {Tomaso Poggio and Steve Smale},

title = {The mathematics of learning: Dealing with data},

journal = {Notices of the American Mathematical Society},

year = {2003},

volume = {50},

pages = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Draft for the Notices of the AMS Learning is key to developing systems tailored to a broad range of data analysis and information extraction tasks. We outline the mathematical foundations of learning theory and describe a key algorithm of it. 1

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... be as in Equation 10, where BR is the ball of radius R in a Reproducing Kernel Hilbert Space (RKHS) with a smooth K (or in a Sobolev space). In this context, R plays an analogous role to VC dimension=-=[50]-=-. Estimates for the covering numbers in these cases were provided by Cucker, Smale and Zhou [10, 54, 55]. The proof of Theorem 3.1 starts from Hoeffding inequality (which can be regarded as an exponen... |

2171 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ...ch corresponds to minimizing the RKHS norm. The regularization algorithm and learning theory The Mercer theorem was introduced in learning theory by Vapnik and RKHS by Girosi [22] and later by Vapnik =-=[9, 50]-=-. Poggio and Girosi [41, 40, 23] had introduced Tikhonov regularization in learning theory (the reformulation of Support Vector Machines as a special case of regularization can be found in [19]). Earl... |

2046 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...tational speed and use of computational resources. For instance, the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks (see =-=[24]).-=- Hierarchical system usually decompose a task in a series of simple computations at each level – often an advantage for fast implementations. There may also be the more fundamental issue of sample c... |

1722 |
Ten Lectures on Wavelets
- Daubechies
- 1992
(Show Context)
Citation Context ...time also the estimation error). It is naturally studied in the theory of probability and of empirical processes [16, 30, 31]. The second term (A) is dealt with via approximation theory (see [15] and =-=[12, 14, 13, 32, 33]-=-) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent, to the well known bias (A) and variance (S) decomposition in statistics. 3.1 Sample Error Firs... |

1273 |
Spline models for observational data
- Wahba
- 1990
(Show Context)
Citation Context ...e predicted label is then {−1, +1}, depending on the sign of the function f of Equation 2. Regression applications are the oldest. Typically they involved fitting data in a small number of dimension=-=s [53, 44, 45]-=-. More recently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages an... |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ...t be estimated in probability over z and the estimate is called the sample errror (sometime also the estimation error). It is naturally studied in the theory of probability and of empirical processes =-=[16, 30, 31]-=-. The second term (A) is dealt with via approximation theory (see [15] and [12, 14, 13, 32, 33]) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent,... |

946 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...ns f. F is a uniform Glivenko-Cantelli class if for every ε>0 lim m→∞ sup IP{ sup |Eρmf − Eρf| >ε} =0. (21) ρ f∈F where ρn is the empirical measure supported on a set x1, ..., xn. 3In [1=-=] – following [51, 17] – a-=- necessary and sufficient condition is proved for uniform convergence of |Iemp[f] − Iexp[f]|, in terms of the finiteness for all γ>0 of a combinatorial quantity called Vγ dimension of F (which is ... |

777 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ... this space by setting 〈Kx,Kxj 〉 = K(x, xj) � and extend linearly to r j=1 ajKxj . The completion of the space in the associated norm is the RKHS, that is a Hilbert space HK with the norm �f��=-=�2 K (see [10, 2]). Not-=-e that 〈f,Kx〉 = f(x) for f ∈HK (just let f = Kxj and extend linearly). To minimize the functional in Equation 4 we take the functional derivative with respect to f, apply it to an element f of t... |

639 |
F Girosi, “Networks for approximation and learning
- Poggio
- 1990
(Show Context)
Citation Context ... (1 − yf(x))+, we can perform least-squares regularized classification via the loss function V (f(x,y)) = (f(x) − y) 2 . This classification scheme was used at least as early as 1989 (for reviews =-=see [7, 40] and t-=-hen rediscovered again by many others (see [21, 49]), including Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square ... |

628 | Constructive Approximation
- Devore, Lorentz
- 1993
(Show Context)
Citation Context ...time also the estimation error). It is naturally studied in the theory of probability and of empirical processes [16, 30, 31]. The second term (A) is dealt with via approximation theory (see [15] and =-=[12, 14, 13, 32, 33]-=-) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent, to the well known bias (A) and variance (S) decomposition in statistics. 3.1 Sample Error Firs... |

476 |
Fast learning in networks of locally-tuned processing units
- Moody, Darken
- 1989
(Show Context)
Citation Context ...c.) to the output space (the price of the option) from historical data [27]. Binary classification applications abound. The algorithm was used to perform binary classification on a number of problems =-=[7, 34]-=-. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images [39, 8]. Other applications span bioinforma... |

434 |
Multivariable functional interpolation and adaptive networks
- Broomhead, Lowe
- 1988
(Show Context)
Citation Context ...c.) to the output space (the price of the option) from historical data [27]. Binary classification applications abound. The algorithm was used to perform binary classification on a number of problems =-=[7, 34]-=-. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images [39, 8]. Other applications span bioinforma... |

311 | Regularization theory and neural-network architectures, Neural Computation
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ...g the RKHS norm. The regularization algorithm and learning theory The Mercer theorem was introduced in learning theory by Vapnik and RKHS by Girosi [22] and later by Vapnik [9, 50]. Poggio and Girosi =-=[41, 40, 23]-=- had introduced Tikhonov regularization in learning theory (the reformulation of Support Vector Machines as a special case of regularization can be found in [19]). Earlier, Gaussian Radial Basis Funct... |

292 | T.: A general framework for object detection
- Papageorgiou, Oren, et al.
- 1998
(Show Context)
Citation Context ...rained in this way with thousands of images has been recently tested in an experimental car of Daimler. It runs on a PC in the trunk and looks at the road in front of the car through a digital camera =-=[36, 26, 43]-=-. Algorithms have been developed that can produce a diagnosis of the type of cancer from a set of measurements of the expression level of many thousands human genes in a biopsy of the tumor measured w... |

277 | Interpolation of scattered data: Distance matrices and conditionally positive de nite functions
- Micchelli
- 1986
(Show Context)
Citation Context ...time also the estimation error). It is naturally studied in the theory of probability and of empirical processes [16, 30, 31]. The second term (A) is dealt with via approximation theory (see [15] and =-=[12, 14, 13, 32, 33]-=-) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent, to the well known bias (A) and variance (S) decomposition in statistics. 3.1 Sample Error Firs... |

277 |
GIROSO F.: Regularization algorithms for learning that are equivalent to multi-layer networks
- POGGIO
- 1990
(Show Context)
Citation Context ...g the RKHS norm. The regularization algorithm and learning theory The Mercer theorem was introduced in learning theory by Vapnik and RKHS by Girosi [22] and later by Vapnik [9, 50]. Poggio and Girosi =-=[41, 40, 23]-=- had introduced Tikhonov regularization in learning theory (the reformulation of Support Vector Machines as a special case of regularization can be found in [19]). Earlier, Gaussian Radial Basis Funct... |

267 | Regularization networks and support vector machines
- Evgeniou, Pontil, et al.
- 2000
(Show Context)
Citation Context ...n (ERM) – of finding the function in H which minimizes 1 m m� (f(xi) − yi) 2 i=1 which is in general ill-posed, depending on the choice of the hypothesis space H. Following Tikhonov (see for ins=-=tance [19]) we minimize, i-=-nstead, over the hypothesis space HK, for a fixed positive parameter γ, the regularized functional 1 m m� (yi − f(xi)) 2 + γ�f� 2 K , (4) i=1 where �f� 2 K is the norm in HK – the Repr... |

263 |
Radial basis functions for multivariable interpolation: a review
- Powell
- 1987
(Show Context)
Citation Context ...e predicted label is then {−1, +1}, depending on the sign of the function f of Equation 2. Regression applications are the oldest. Typically they involved fitting data in a small number of dimension=-=s [53, 44, 45]-=-. More recently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages an... |

239 |
Computational Vision and Regularization Theory
- Poggio, Koch, et al.
- 1985
(Show Context)
Citation Context ...head and Loewe. Of course, RKHS had been pioneered by Parzen and Wahba ([37, 53]) for applications closely related to learning, including data smoothing (for image processing and computer vision, see =-=[4, 42]). A-=- Bayesian interpretation The learning algorithm Equation 4 has an interesting Bayesian interpretation [52, 53]: the data term – that is the first term with the quadratic loss function – is a model... |

230 |
Multiclass cancer diagnosis using tumor gene expression signatures
- Ramaswamy, Tamayo, et al.
(Show Context)
Citation Context ...of the type of cancer from a set of measurements of the expression level of many thousands human genes in a biopsy of the tumor measured with a cDNA microarray containing probes for a number of genes =-=[46]-=-. Again, the software learns the classification rule from a set of examples, that is from examples of expression patterns in a number of patients with known diagnoses. The challenge, in this case, is ... |

224 | On the mathematical foundations of learning
- Cucker, Smale
(Show Context)
Citation Context ... this space by setting 〈Kx,Kxj 〉 = K(x, xj) � and extend linearly to r j=1 ajKxj . The completion of the space in the associated norm is the RKHS, that is a Hilbert space HK with the norm �f��=-=�2 K (see [10, 2]). Not-=-e that 〈f,Kx〉 = f(x) for f ∈HK (just let f = Kxj and extend linearly). To minimize the functional in Equation 4 we take the functional derivative with respect to f, apply it to an element f of t... |

205 | Scale-sensitive dimensions, uniform convergence, and learnability
- Alon, Ben-David, et al.
- 1997
(Show Context)
Citation Context ...class of functions f. F is a uniform Glivenko-Cantelli class if for every ε>0 lim m→∞ sup IP{ sup |Eρmf − Eρf| >ε} =0. (21) ρ f∈F where ρn is the empirical measure supported on a set x1,=-= ..., xn. 3In [1] – fol-=-lowing [51, 17] – a necessary and sufficient condition is proved for uniform convergence of |Iemp[f] − Iexp[f]|, in terms of the finiteness for all γ>0 of a combinatorial quantity called Vγ dime... |

203 | An equivalence between sparse approximation and support vector machines
- Girosi
- 1998
(Show Context)
Citation Context ...o minimizing ||w||2 , which corresponds to minimizing the RKHS norm. The regularization algorithm and learning theory The Mercer theorem was introduced in learning theory by Vapnik and RKHS by Girosi =-=[22]-=- and later by Vapnik [9, 50]. Poggio and Girosi [41, 40, 23] had introduced Tikhonov regularization in learning theory (the reformulation of Support Vector Machines as a special case of regularization... |

179 | Ill-posed problems in early vision
- Bertero, Poggio, et al.
- 1987
(Show Context)
Citation Context ...head and Loewe. Of course, RKHS had been pioneered by Parzen and Wahba ([37, 53]) for applications closely related to learning, including data smoothing (for image processing and computer vision, see =-=[4, 42]). A-=- Bayesian interpretation The learning algorithm Equation 4 has an interesting Bayesian interpretation [52, 53]: the data term – that is the first term with the quadratic loss function – is a model... |

166 |
The Theory of Radial Basis Function Approximation
- Powell
- 1992
(Show Context)
Citation Context ...e predicted label is then {−1, +1}, depending on the sign of the function f of Equation 2. Regression applications are the oldest. Typically they involved fitting data in a small number of dimension=-=s [53, 44, 45]-=-. More recently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages an... |

165 | Stability and generalization
- Bousquet, Elisseeff
(Show Context)
Citation Context ...y of the solution. In the learning problem, this condition refers to stability of the solution of ERM with respect to small changes of the training set Sm. In a similar way, the condition number (see =-=[6]-=- and especially [29]) characterizes the stability of the solution of Equation 3. Is it possible that some specific form of stability may be necessary and sufficient for consistency of ERM? Such a resu... |

157 | Trainable videorealistic speech animation
- Ezzat, Geiger, et al.
(Show Context)
Citation Context ...cently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages and videos =-=[38, 5, 20]-=-. The inverse problem of estimating facial expression and object pose from an image is another successful application [25]. Still another case is the control of mechanical arms. There are also applica... |

138 | Data compression and harmonic analysis
- Donoho, Vetterli, et al.
- 1998
(Show Context)
Citation Context |

115 |
Approximation and estimation bounds for artificial neural networks
- Barron, R
- 1994
(Show Context)
Citation Context ...xity of the hypothesis space for a given number of training data. In the case of the regularization algorithm described in this paper this tradeoff corresponds to an optimum value for γ as studied by=-= [11, 35, 3]-=-. In empirical work, the optimum value is often found through cross-validation techniques [53]. This tradeoff between approximation error and sample error is probably the most critical issue in determ... |

113 |
Image representations for visual learning
- Beymer, Poggio
(Show Context)
Citation Context ...cently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages and videos =-=[38, 5, 20]-=-. The inverse problem of estimating facial expression and object pose from an image is another successful application [25]. Still another case is the control of mechanical arms. There are also applica... |

112 | Proximal Support Vector Machine Classifiers,” Knowledge Discovery and Data Mining
- Fung, Mangasarian
- 2001
(Show Context)
Citation Context ...classification via the loss function V (f(x,y)) = (f(x) − y) 2 . This classification scheme was used at least as early as 1989 (for reviews see [7, 40] and then rediscovered again by many others (se=-=e [21, 49]), inclu-=-ding Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square SVMs”). Rifkin ( [47]) has confirmed the interesting empi... |

105 | A Nonparametric Approach to Pricing and Hedging Derivative Securities via Learning Networks
- Hutchinson, Lo, et al.
- 1994
(Show Context)
Citation Context ...irst principles) by learning the map from an input space (volatility, underlying stock price, time to expiration of the option etc.) to the output space (the price of the option) from historical data =-=[27]-=-. Binary classification applications abound. The algorithm was used to perform binary classification on a number of problems [7, 34]. It was also used to perform visual object recognition in a view-in... |

88 | Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning
- Rifkin
- 2002
(Show Context)
Citation Context ...vered again by many others (see [21, 49]), including Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square SVMs”). R=-=ifkin ( [47]) ha-=-s confirmed the interesting empirical results by Mangasarian and Suykens: “classical” square loss regularization works well also for binary classification (examples are in tables 1 and 2). In refe... |

82 |
A network that learns to recognize 3D objects
- Poggio, Edelman
- 1990
(Show Context)
Citation Context ...ssification on a number of problems [7, 34]. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images =-=[39, 8]-=-. Other applications span bioinformatics for classification of human cancer from microarray data, text summarization, sound classification1 Surprisingly, it has been realized quite recently that the s... |

62 | Categorization by learning and combining object parts
- Heisele, Serre, et al.
- 2001
(Show Context)
Citation Context ...gorithms in computer graphics for synthesizing new 4simages and videos [38, 5, 20]. The inverse problem of estimating facial expression and object pose from an image is another successful application =-=[25]-=-. Still another case is the control of mechanical arms. There are also applications in finance, as, for instance, the estimation of the price of derivative securities, such as stock options. In this c... |

48 | On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions
- Xiyogi, Girosi
- 1996
(Show Context)
Citation Context ...ssified documents. From [26] for a given number of training data. In the case of the regularization algorithm described in this paper this tradeoff corresponds to an optimum value for γ as studied by =-=[7, 20]-=-. In empirical work, the optimum value is often found through crossvalidation techniques [31]. This tradeoff between approximation error and sample error is probably the most critical issue in determi... |

44 |
Uniform and universal Glivenko-Cantelli classes
- Dudley, Giné, et al.
- 1991
(Show Context)
Citation Context ... which uses the metric entropy of F defined as Hm(ɛ, F) =supxm∈Xmlog N (ɛ, F,xm), where N (ɛ, F,xm) is the ɛ-covering of F wrt l∞ xm ( l∞ xm is the l∞ distance on the points xm): Theorem (=-=Dudley, see [18]). F is -=-a uniform Glivenko-Cantelli class iff limm→∞ Hm(ɛ,F) =0for all m ɛ>0. 13sWe saw earlier that the regularization algorithm Equation 4 ensures (through the resulting compactness of the “effectiv... |

43 | Almost-everywhere algorithmic stability and generalization error
- Kutin, Niyogi
- 2002
(Show Context)
Citation Context ...In the learning problem, this condition refers to stability of the solution of ERM with respect to small changes of the training set Sm. In a similar way, the condition number (see [6] and especially =-=[29]-=-) characterizes the stability of the solution of Equation 3. Is it possible that some specific form of stability may be necessary and sufficient for consistency of ERM? Such a result would be surprisi... |

41 |
Estimating the approximation error in learning theory
- Smale, Zhou
- 2003
(Show Context)
Citation Context ... where K is a Mercer kernel and � LKf(x) = X f(x ′ )K(x, x ′ ) (15) and we have taken the square root of the operator LK. In this case E is HK as above. Details and proofs may be found in [10] a=-=nd in [48]-=-. 3.3 Sample and approximation error for the regularization algorithm The previous discussion depends upon a compact hypothesis space H from which the experimental optimum fz and the true optimum fH a... |

40 | choices for regularization parameters in learning theory: on the bias-variance problem
- Cucker, Smale, et al.
(Show Context)
Citation Context ...r a given m. In our case, this bias-variance problem is to minimize S(γ) +A(γ) over γ>0. There is a unique solution – a best γ – for the choice in Equation 4. For this result and its consequen=-=ces see [11]-=-. 4 Remarks The tradeoff between sample complexity and hypothesis space complexity For a given, fixed hypothesis space H only the sample error component of the error of fz can be be controlled (in Equ... |

40 | Multiclass least squares support vector machines
- Suykens, Vandewalle
- 1999
(Show Context)
Citation Context ...classification via the loss function V (f(x,y)) = (f(x) − y) 2 . This classification scheme was used at least as early as 1989 (for reviews see [7, 40] and then rediscovered again by many others (se=-=e [21, 49]), inclu-=-ding Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square SVMs”). Rifkin ( [47]) has confirmed the interesting empi... |

38 |
Universal Donsker classes and metric entropy
- Dudley
- 1987
(Show Context)
Citation Context ...ns f. F is a uniform Glivenko-Cantelli class if for every ε>0 lim m→∞ sup IP{ sup |Eρmf − Eρf| >ε} =0. (21) ρ f∈F where ρn is the empirical measure supported on a set x1, ..., xn. 3In [1=-=] – following [51, 17] – a-=- necessary and sufficient condition is proved for uniform convergence of |Iemp[f] − Iexp[f]|, in terms of the finiteness for all γ>0 of a combinatorial quantity called Vγ dimension of F (which is ... |

38 |
The covering number in learning theory
- Zhou
(Show Context)
Citation Context ...KHS) with a smooth K (or in a Sobolev space). In this context, R plays an analogous role to VC dimension[50]. Estimates for the covering numbers in these cases were provided by Cucker, Smale and Zhou =-=[10, 54, 55]. -=-The proof of Theorem 3.1 starts from Hoeffding inequality (which can be regarded as an exponential version of Chebyshev’s inequality of probability theory). One applies this estimate to the function... |

37 |
A survey of optimal recovery
- Micchelli, Rivlin
- 1977
(Show Context)
Citation Context |

36 | Improving the sample complexity using global data
- Mendelson
- 2002
(Show Context)
Citation Context ...t be estimated in probability over z and the estimate is called the sample errror (sometime also the estimation error). It is naturally studied in the theory of probability and of empirical processes =-=[16, 30, 31]-=-. The second term (A) is dealt with via approximation theory (see [15] and [12, 14, 13, 32, 33]) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent,... |

31 | Generalization bounds for function approximation from scattered noisy data
- Niyogi, Girosi
- 1999
(Show Context)
Citation Context ...xity of the hypothesis space for a given number of training data. In the case of the regularization algorithm described in this paper this tradeoff corresponds to an optimum value for γ as studied by=-= [11, 35, 3]-=-. In empirical work, the optimum value is often found through cross-validation techniques [53]. This tradeoff between approximation error and sample error is probably the most critical issue in determ... |

31 | A novel approach to graphics
- Poggio, Brunelli
- 1992
(Show Context)
Citation Context ...cently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new 4simages and videos =-=[38, 5, 20]-=-. The inverse problem of estimating facial expression and object pose from an image is another successful application [25]. Still another case is the control of mechanical arms. There are also applica... |

30 |
An approach to time series analysis
- Parzen
- 1961
(Show Context)
Citation Context ...tion can be found in [19]). Earlier, Gaussian Radial Basis Functions were proposed as an alternative to neural networks by Broomhead and Loewe. Of course, RKHS had been pioneered by Parzen and Wahba (=-=[37, 53]-=-) for applications closely related to learning, including data smoothing (for image processing and computer vision, see [4, 42]). A Bayesian interpretation The learning algorithm Equation 4 has an int... |

21 | T: HyberBF networks for real object recognition
- Brunelli, Poggio
(Show Context)
Citation Context ...ssification on a number of problems [7, 34]. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images =-=[39, 8]-=-. Other applications span bioinformatics for classification of human cancer from microarray data, text summarization, sound classification1 Surprisingly, it has been realized quite recently that the s... |

11 | Mathematical challenges from genomics and molecular biology
- Karp
- 2002
(Show Context)
Citation Context ...y classification (examples are in tables 1 and 2). In references to supervised learning the Support Vector Machine method is often described (see for instance a recent issue of the Notices of the AMS =-=[28]) ac-=-cording to the “traditional” approach, introduced by Vapnik and followed by almost everybody else. In this approach, one starts with the concepts of separating hyperplanes and margin. Given the da... |