## A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split (1996)

### Cached

### Download Links

- [www.informatik.uni-bonn.de]
- [www.cis.upenn.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 24 - 0 self |

### BibTeX

@INPROCEEDINGS{Kearns96abound,

author = {Michael Kearns},

title = {A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split},

booktitle = {Neural Computation},

year = {1996},

pages = {183--189},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

: We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis parameters), and the estimation rate (the deviation between the training and generalization errors as a function of the number of hypothesis parameters). The approximation rate captures the complexity of the target function with respect to the hypothesis model, and the estimation rate captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of cross validation. The bound clearly shows the tradeoffs involved with making fl --- the fraction of data saved for testing --- too large or too small. By optimizing the bound with respect to fl, we then argue (through a combination of formal analysis, plotting, and ...

### Citations

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...is a class of boolean functions of d parameters, each function being a mapping from some input space X into f0; 1g. For simplicity, in this paper we assume that the Vapnik-Chervonenkis (VC) dimension =-=[10, 9]-=- of the class H d is O(d). To remove this assumption, one simply replaces all occurrences of d in our bounds by the VC dimension of H d . We assume that we have in our possession a learning algorithm ... |

835 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ... Barron and Cover [2], who introduced the idea of bounding the error of a model selection method (MDL in their case) in terms of a quantity known as the index of resolvability; and the work of Vapnik =-=[9]-=-, who provides extremely powerful and general tools for uniformly bounding the deviations between training and generalization errors. We combine these methods to give a new and general analysis of cro... |

829 |
Cross-validatory choice and assessment of statistically prediction (with discussion
- Stone
- 1974
(Show Context)
Citation Context ...izes asymptotic statistical properties, or the exact calculation of the generalization error for simple models. (The literature is too large to survey here; foundational papers include those of Stone =-=[7, 8]-=-.) Our approach here is somewhat different, and is primarily inspired by two sources: the work of Barron and Cover [2], who introduced the idea of bounding the error of a model selection method (MDL i... |

406 |
Universal Approximation Bounds for Superpositions of a Sigmoid Function
- Barron
- 1993
(Show Context)
Citation Context ...paper we concentrate on the simplest version of cross validation. Unlike the methods mentioned above, which use the entire sample for training the h d , in cross validation we choose a parameter fl 2 =-=[0; 1]-=-, which determines the split between training and test data. Given the input sample S of m examples, let S 0 be the subsample consisting of the first (1 \Gamma fl)m examples in S, and S 00 the subsamp... |

214 |
Minimum complexity density estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ...rature is too large to survey here; foundational papers include those of Stone [7, 8].) Our approach here is somewhat different, and is primarily inspired by two sources: the work of Barron and Cover =-=[2]-=-, who introduced the idea of bounding the error of a model selection method (MDL in their case) in terms of a quantity known as the index of resolvability; and the work of Vapnik [9], who provides ext... |

111 | An experimental and theoretical comparison of model selection methods
- Kearns, Mansour, et al.
- 1997
(Show Context)
Citation Context ...ed with simplifying the calculation. 8 A nice feature of the intervals problem is the fact that training error minimization can be performed in almost linear time using a dynamic programming approach =-=[4]-=-. would be interesting to verify this prediction experimentally, perhaps on a different problem where the predicted effect is more pronounced. 7 POWER LAW DECAY AND THE PERCEPTRON PROBLEM For the case... |

53 | Rigorous learning curve bounds from statistical mechanics, preprint
- Haussler, Kearns, et al.
- 1994
(Show Context)
Citation Context ... the problem: f , D and the structure. Indeed, much of the recent workon the statistical physics theory of learning curves has documented the wide variety of behaviors that such deviations may assume =-=[6, 3]-=-. However, for many natural problems it is both convenient and accurate to rely on a universal estimation rate bound provided by the powerful theory of uniform convergence: Namely, for any f , D and a... |

49 |
Asymptotics for and against cross-validation
- Stone
- 1977
(Show Context)
Citation Context ...izes asymptotic statistical properties, or the exact calculation of the generalization error for simple models. (The literature is too large to survey here; foundational papers include those of Stone =-=[7, 8]-=-.) Our approach here is somewhat different, and is primarily inspired by two sources: the work of Barron and Cover [2], who introduced the idea of bounding the error of a model selection method (MDL i... |

46 |
Stochastic complexity in statistical inquiry, volume 15
- Rissanen
- 1989
(Show Context)
Citation Context ...ting of m random examples drawn according to D and labeled by f (with the labels possibly corrupted by noise). In many model selection methods (such as Rissanen's Minimum Description Length Principle =-=[5]-=- and Vapnik's Guaranteed Risk Minimization [9]), for each value of d = 1; 2; 3; : : : we give the entire sample S and d to the learning algorithm L to obtain the function h d minimizing the training e... |

46 |
Statistical Mechanics of Learning from Examples
- Seung, Sompolinsky, et al.
- 1993
(Show Context)
Citation Context ... the target function is a function in H s with all s nonzero weights equal to 1, then it can be shown that the approximation rate function ffl g (d) is ffl g (d) = (1=) cos \Gamma1 ( p d=N) for d ! s =-=[6]-=-, and of course ffl g (d) = 0 for dss. This problem provides a nice contrast to the intervals problem, since here the behavior of the approximation rate for small d is concave down: as long as d ! s, ... |