## The supervised learning no-free-lunch Theorems (2001)

Venue: | In Proc. 6th Online World Conference on Soft Computing in Industrial Applications |

Citations: | 31 - 0 self |

### BibTeX

@INPROCEEDINGS{Wolpert01thesupervised,

author = {David H. Wolpert},

title = {The supervised learning no-free-lunch Theorems},

booktitle = {In Proc. 6th Online World Conference on Soft Computing in Industrial Applications},

year = {2001},

pages = {25--42}

}

### OpenURL

### Abstract

Abstract This paper reviews the supervised learning versions of the no-free-lunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.

### Citations

722 |
Statistical Analysis of Finite Mixture Distributions
- Titterington, Smith, et al.
- 1985
(Show Context)
Citation Context ...d functions (i.e., I will rewrite (x) as h(x), and (x)as f(x)). It will be convenient to write P (dY jdX;f)as (d f), of f. For the views of conventional statistics on this issue, see also Titterington=-=[5]-=- and, in particular, the Dawid references therein concerning the \predictive" paradigm and the \diagnostic" paradigm. 7si.e., (d f) =1ifdlies on f, 0 otherwise. The notation is motivated by viewing f ... |

580 | Stacked generalization
- Wolpert
- 1992
(Show Context)
Citation Context ...thout h, the EBF could not encompass the non-Bayesian frameworks like computational learning theory (e.g., PAC and the VC framework) and sampling theory statistics (e.g., con - dence intervals). (See =-=[13]-=- for an overview of those frameworks.) To understand this, note that one's learning algorithm (or \generalizer") is given by the conditional probability distribution P (hjd). There is no direct analog... |

379 |
Computer Systems That Learn
- Weiss, Kulikowski
- 1991
(Show Context)
Citation Context ...O -Training-Set Error Many introductory supervised learning texts take the view that \the overall objective:::is to learn from samples and to generalize to new, as yet unseen cases" (italics mine|see =-=[7]-=- for example). Similarly, it is common practice to try to avoid tting the training set exactly, i.e., to try to avoid \overtraining." One of the major rationales given for this is that if one overtrai... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ... P (hjd) are nonzero) is always 1. One should not confuse the error function with the \error surface" found in techniques like backpropagation. In the standard Bayesian formulation of backpropagation,=-=[1, 10]-=- the error surface is (the log of) P (wjd), where w is the weight vector parametrizing f. So, for example, the term in that error surface that equals the squared error on the training set simply re ec... |

61 |
On the connection between in-sample testing and generalization error
- Wolpert
- 1992
(Show Context)
Citation Context ...st rather than on parametrizations of that object. 3s1.2 Overview In Section 2 of this paper I present a framework capable of addressing o -trainingset error, the \Extended Bayesian Framework" (EBF | =-=[9, 10, 14]-=-). This framework has the other major advantage that it encompasses the conventional frameworks, illustrating the subtleties of how they are related, and suggesting variants of them. In Section 3 I pr... |

41 | The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework
- Wolpert
- 1994
(Show Context)
Citation Context ...t one can ignore the distinction between the two kinds of error. However, this heuristic is often wrong. See, for example, the discussion on the \statistical physics supervised learning framework" in =-=[11]-=-. None of this means that one should never allow test sets to overlap with training sets. Rather it means that o -training-set testing is an issue of major impor2stance which warrants scrutiny. Howeve... |

24 |
Priors for infinite networks
- Neal
- 1994
(Show Context)
Citation Context ...eurons, etc. Often, for lack of an alternative, they do this without taking into account the ultimate effect on the direct object of interest, the input-output functions parametrized by those weights =-=[10, 6]-=-. One advantage of the framework presented below is that it is nonparametric and, therefore, helps focus attention directly on the object of interest rather than on parametrizations of that object. 3s... |

22 | On bias plus variance
- Wolpert
- 1997
(Show Context)
Citation Context ...st rather than on parametrizations of that object. 3s1.2 Overview In Section 2 of this paper I present a framework capable of addressing o -trainingset error, the \Extended Bayesian Framework" (EBF | =-=[9, 10, 14]-=-). This framework has the other major advantage that it encompasses the conventional frameworks, illustrating the subtleties of how they are related, and suggesting variants of them. In Section 3 I pr... |

13 | Bayesian backpropagation over I-O functions rather than weights
- Wolpert
- 1994
(Show Context)
Citation Context ...neurons, etc. Often, for lack of an alternative, they do this without taking into account the ultimate e ect on the direct object of interest, the input-output functions parametrized by those weights =-=[10, 6]-=-. One advantage of the framework presented below is that it is nonparametric and, therefore, helps focus attention directly on the object of interest rather than on parametrizations of that object. 3s... |

12 | Reconciling Bayesian and non-Bayesian analysis
- Wolpert
- 1996
(Show Context)
Citation Context ...g lower-case letters. For the purposes of this paper, there is no reason to be concerned with quasi-philosophical distinctions between random variables and \parameters." (See the discussion in Wolpert=-=[15]-=- on prior information.) Whenever possible, \P " notation will be used: the arguments of a \P " indicates if it is a probability or a density or a mixture of the two and, if it involves densities, what... |

7 |
The lack of a prior distinctions between learning algorithms and the existence of a priori distinctions between learning algorithms, Neural Computation 8
- Wolpert
- 1996
(Show Context)
Citation Context ... 1 for all such experiments rather than randomly to choose a new member of the Gi to use for each experiment. Intuitively, this is because such an average reduces variance without changing bias | see =-=[12]-=-. (Note though that this in no way implies that using G 0 for all the experiments is better than using any particular single G 2fGigfor all the experiments.) Though important, such geometry-based dist... |

5 |
Filter likelihoods and exhaustive learning
- Wolpert
- 1994
(Show Context)
Citation Context ...tional frameworks implicitly assumes that g and f are independent. That is why that assumption is adopted here. 2 The fact that they themselves parameterize distributions does not forbid 2 See Wolpert=-=[8]-=- for a discussion of the rami cations of the assumption that g is independent 6seither f or g from being arguments of probability distributions. For example, it is perfectly meaningful to write P (f j... |

2 |
Priors for in nite networks
- Neal
- 1994
(Show Context)
Citation Context ...neurons, etc. Often, for lack of an alternative, they do this without taking into account the ultimate e ect on the direct object of interest, the input-output functions parametrized by those weights =-=[10, 6]-=-. One advantage of the framework presented below is that it is nonparametric and, therefore, helps focus attention directly on the object of interest rather than on parametrizations of that object. 3s... |

1 |
et al. Occam's razor
- Blumer
- 1987
(Show Context)
Citation Context ...rks use language that implies that their goal is understanding o -training-set behavior, even when they use a test set that can overlap with the training set. For example, in a paper by Blumer et al.,=-=[4]-=- in the context of noisefree supervised learning, we read that \the real value of a scienti c explanation lies not in its ability toexplain [what one has already seen], but in predicting events that h... |

1 |
5] Titterington et al. Statistical Analysis of Finite Mixture Distributions
- Blumer
- 1987
(Show Context)
Citation Context ...ks use language that implies that their goal is understanding off-training-set behavior, even when they use a test set that can overlap with the training set. For example, in a paper by Blumer et al.,=-=[4]-=- in the context of noisefree supervised learning, we read that "the real value of a scientific explanation lies not in its ability to explain [what one has already seen], but in predicting events that... |