## Inference for the generalization error (2003)

### Cached

### Download Links

- [www.iro.umontreal.ca]
- [www.cirano.qc.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 153 - 4 self |

### BibTeX

@INPROCEEDINGS{Scientifique03inferencefor,

author = {Série Scientifique and École Des Hautes Études Commerciales and École Polytechnique and Université Concordia and Université De Montréal and Université Laval and Université Mcgill and Bell Québec and Claude Nadeau and Claude Nadeau and Yoshua Bengio and Yoshua Bengio},

title = {Inference for the generalization error},

booktitle = {Machine Learning},

year = {2003},

pages = {239--281},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

CIRANO Le CIRANO est un organisme sans but lucratif constitué en vertu de la Loi des compagnies du Québec. Le financement de son infrastructure et de ses activités de recherche provient des cotisations de ses organisationsmembres, d�une subvention d�infrastructure du ministère de l�Industrie, du Commerce, de la Science et de la Technologie, de même que des subventions et mandats obtenus par ses équipes de recherche. CIRANO is a private non-profit organization incorporated under the Québec Companies Act. Its infrastructure and research activities are funded through fees paid by member organizations, an infrastructure grant from the Ministère de l�Industrie, du Commerce, de la Science et de la Technologie, and grants and research mandates obtained by its research teams.

### Citations

4517 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...st relative to perturbations in the training set. For instance, one could argue that the support vector machine (Burges, 1998) would tend to fall in this category. Classication and regression trees (=-=Breiman, Friedman, Olshen, & Stone, 1984-=-) however will typically not have this property as a slight modication in data may lead to substantially dierent tree growths so that for two dierent training sets, the corresponding decision funct... |

3110 |
UCI repository of machine learning databases
- Blake, Keogh, et al.
- 1998
(Show Context)
Citation Context ...f interest, namely n1, (n1; n2) = 2(n1;n2) 1(n1;n2) and r. 3. Classication of letters We consider the problem of estimating generalization errors in the Letter Recognition classication problem (=-=Blake, Keogh, & Merz, 1998-=-). The learning algorithms are (A) Classication tree We train a classication tree (Breiman et al., 1984) 12 to obtain its decision function FA(ZS)(X). Here the classication loss function LA(j; i) =... |

2530 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...ng set (for instance a parametric model that is not too complex). The algorithm is robust relative to perturbations in the training set. For instance, one could argue that the support vector machine (=-=Burges, 1998-=-) would tend to fall in this category. Classi cation and regression trees (Breiman et al., 1984) however will typically not have this property as a slight modi cation in data may lead to substantially... |

1076 | A Probabilistic Theory of Pattern Recognition - Devroye, Györfi, et al. - 1996 |

869 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...tor. There are many variants of cross-validation, and the above variant is close to the popular K-fold cross-validation estimator, which has been found more reliable than the leave-one-out estimator (=-=Kohavi, 1995-=-). It should be noted that our goal in this paper is not to compare algorithms in order to perform model selection (i.e. to choose exactly one among several learning algorithms for a particular task, ... |

841 |
Estimation of Dependences Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ... is a test error or a di erence of test errors, and when the learning algorithms have a nite capacity, it can be shown that n1 is bounded with a given high probability by a decreasing function of n1 (=-=Vapnik, 1982-=-), converging to the asymptotic training error (which is both the training error and the expected generalization error when n1 ! 1). This argument is based on bounds on the cumulative distribution of ... |

593 | Approximate statistical tests for comparing supervised classification learning algorithms
- Dietterich
- 1998
(Show Context)
Citation Context ... of variability such as the choice of the training set (Breiman, 1996) or initial conditions of a learning algorithm (Kolen and Pollack, 1991). A notable e ort in that direction is Dietterich's work (=-=Dietterich, 1998-=-). Building upon this work, in this paper we takeinto account the variability due to the choice of training sets and test examples. Speci cally, aninvestigation of the variance to be estimated allows ... |

576 |
Maximum likelihood estimation of misspecified models. Econometrica
- White
- 1982
(Show Context)
Citation Context ...estimates. Since we generate 1000 independent data sets, we have 1000 independent instances of such vectors. As may be seen in the Appendix A.3, appropriate use of the theory of estimating functions (=-=White, 1982-=-) then yields approximate condence intervals for n1 and (n1; n2). Condence intervals for r = Corr[(̂(m) − ̂c(m))2; (̂(m0) − ̂c(m0))2], dened in Section 3, are obtained in the same manner we g... |

258 | No Free Lunch Theorems for Search
- Wolpert, Macready
- 1995
(Show Context)
Citation Context ... which to train them). The use of cross-validation estimators for model selection has sparked a debate in the last few years (Zhu & Rohwer, 1996; Goutte, 1997) related to the \no free lunch theorem" (=-=Wolpert & Macready, 1995-=-), since cross-validation model selection often works well in practice but it is probably not a universally good procedure. This paper does not address the issue of model selection but rather that of ... |

194 |
Classi cation and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...st relative to perturbations in the training set. For instance, one could argue that the support vector machine (Burges, 1998) would tend to fall in this category. Classi cation and regression trees (=-=Breiman et al., 1984-=-) however will typically not have this property as a slight modi cation in data may lead to substantially 7sdi erent tree growths so that for two di erent training sets, the corresponding decision fun... |

170 |
Heuristics of instability and stabilization in model selection
- Breiman
- 1996
(Show Context)
Citation Context ...e intervals are possible, is more di cult, especially, aspointed out in (Hinton et al., 1995), if one wants to take into account various sources of variability such as the choice of the training set (=-=Breiman, 1996-=-) or initial conditions of a learning algorithm (Kolen and Pollack, 1991). A notable e ort in that direction is Dietterich's work (Dietterich, 1998). Building upon this work, in this paper we takeinto... |

148 |
The analysis of contingency tables
- Everitt
- 1977
(Show Context)
Citation Context ...nce of L(1; i) conditional on the training set ZS1 6. That is so because, given ZS1 , the L(1; i)’s are independent variates. 5We note that this statistic is closely related to the McNemar statistic (=-=Everitt, 1977-=-) when the problem at hand is the comparison of two classication algorithms, i.e. L is of the form (4) with Q of the form (2). Indeed, let LA−B(1; i) = LA(1; i) − LB(1; i) where LA(1; i) indicates wh... |

112 | Algorithmic stability and sanity-check bounds for leave-one-out cross-validation
- Kearns, Ron
- 1999
(Show Context)
Citation Context ... does n1 aect thesrst source of variation? It is not unreasonable to expect that the decision function yielded by a \stable" learning algorithm is less variable when the training set is larger. See (=-=Kearns & Ron, 1997-=-) showing that for a large class of algorithms including those minimizing training error, cross-validation estimators are not much worse than the training error estimator (which itself improves in O(V... |

94 | Back propagation of sensitive to initial conditions
- Kolen, Pollack
- 1991
(Show Context)
Citation Context ...ed out in (Hinton et al., 1995), if one wants to take into account various sources of variability such as the choice of the training set (Breiman, 1996) or initial conditions of a learning algorithm (=-=Kolen and Pollack, 1991-=-). A notable e ort in that direction is Dietterich's work (Dietterich, 1998). Building upon this work, in this paper we takeinto account the variability due to the choice of training sets and test exa... |

83 |
An introduction to the bootstrap, Monographs on statistics and applied probability
- EFRON, TIBSHIRANI
- 1993
(Show Context)
Citation Context ... small relatively to n (the total number of examples available). One may use n2 = n for instance provided 10 J is not smallish. 5. Bootstrap To estimate the variance of ^ = n2 n1 ^J by the bootstrap (=-=Efron and Tibshirani, 1993-=-), we must obtain R other instances of that random variable, by redoing the computation with di erent splits; call these 1;:::; R. Thus, in total, (R +1)J training and testing sets are needed here. Th... |

29 | Note on free lunches and cross-validation
- Goutte
- 1997
(Show Context)
Citation Context ... algorithms for a particular task, given a data set on which to train them). The use of cross-validation estimators for model selection has sparked a debate in the last few years (Zhu & Rohwer, 1996; =-=Goutte, 1997-=-) related to the \no free lunch theorem" (Wolpert & Macready, 1995), since cross-validation model selection often works well in practice but it is probably not a universally good procedure. This paper... |

4 |
DELVE team members
- Hinton, Neal, et al.
- 1995
(Show Context)
Citation Context ... trivial through cross-validation. Providing a variance estimate of that estimation, so that hypothesis testing and/or con dence intervals are possible, is more di cult, especially, aspointed out in (=-=Hinton et al., 1995-=-), if one wants to take into account various sources of variability such as the choice of the training set (Breiman, 1996) or initial conditions of a learning algorithm (Kolen and Pollack, 1991). A no... |

3 |
Inference for the generalisation error
- Nadeau, Bengio
- 1999
(Show Context)
Citation Context ... four simulations we performed for the regression problem. For instance, in Simulation 1, we generated 1000 samples of size 200, with x = 10, 2X = 1, a = 100, b = 1 and 2Y jX = 97. It is shown in (=-=Nadeau & Bengio, 1999-=-) that n1A = n1+1 n1 (2Y jX + b 22X) and n1B = n1+1 n1 n1−2 n1−3 2 Y jX . Thus thesrst and third simulation correspond to cases where the two algorithms generalize equally well (for n1 = n2 ); in... |

1 |
The analysis of contengency tables
- Everitt
- 1977
(Show Context)
Citation Context ... i = n2 3 +( 0, 3) 0, 3 n1 ^1] >1; so this approach leads to liberal inference. This phenomenon grows worse as n2 increases. 5 We note that this statistic is closely related to the McNemar statistic (=-=Everitt, 1977-=-) when the problem at hand is the comparison of two classi cation algorithms, i.e. L is of the form (4) with Q of the form (2). Indeed, let LA,B(1;i) = LA(1;i),LB(1;i) where LA(1;i) indicates whether ... |