Results 11  20
of
42
The Relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework
"... This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those fram ..."
Abstract

Cited by 40 (7 self)
 Add to MetaCart
This paper discusses the intimate relationships between the supervised learning frameworks mentioned in the title. In particular, it shows how all those frameworks can be viewed as particular instances of a single overarching formalism. In doing this many commonly misunderstood aspects of those frameworks are explored. In addition the strengths and weaknesses of those frameworks are compared, and some novel frameworks are suggested (resulting, for example, in a "correction" to the familiar biasplusvariance formula).
On Overfitting Avoidance As Bias
 SFI TR
, 1993
"... In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions to data. ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions to data. In a recent paper, Schaffer (1993) presents experimental evidence disputing these claims. The current paper consists of a formal analysis of these contentions of Schaffer's. It proves that his contentions are valid, although some of his experiments must be interpreted with caution.
Evaluation and Selection of Biases in Machine Learning
 ACM Computing Surveys
, 1995
"... In this introduction, we define the term bias as it is used in machine learning systems. We motivate the importance of automated methods for evaluating and selecting biases using a framework of bias selection as sem'ch in bias and metabias spaces. Recent research in the field of mac}fine learning b ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
In this introduction, we define the term bias as it is used in machine learning systems. We motivate the importance of automated methods for evaluating and selecting biases using a framework of bias selection as sem'ch in bias and metabias spaces. Recent research in the field of mac}fine learning bias is stmmarized.
The supervised learning nofreelunch Theorems
 In Proc. 6th Online World Conference on Soft Computing in Industrial Applications
, 2001
"... Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning. ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Abstract This paper reviews the supervised learning versions of the nofreelunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.
How to Shift Bias: Lessons from the Baldwin Effect
, 1996
"... An inductive learning algorithm takes a set of data as input and generates a hypothesis as output. A set of data is typically consistent with an infinite number of hypotheses; therefore, there must be factors other than the data that determine the output of the learning algorithm. In machine learnin ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
An inductive learning algorithm takes a set of data as input and generates a hypothesis as output. A set of data is typically consistent with an infinite number of hypotheses; therefore, there must be factors other than the data that determine the output of the learning algorithm. In machine learning, these other factors are called the bias of the learner. Classical learning algorithms have a fixed bias, implicit in their design. Recently developed learning algorithms dynamically adjust their bias as they search for a hypothesis. Algorithms that shift bias in this manner are not as well understood as classical algorithms. In this paper, we show that the Baldwin effect has implications for the design and analysis of bias shifting algorithms. The Baldwin effect was proposed in 1896, to explain how phenomena that might appear to require Lamarckian evolution (inheritance of acquired characteristics) can arise from purely Darwinian evolution. Hinton and Nowlan presented a computational model of the Baldwin effect in 1987. We explore a variation on their model, which we constructed explicitly to illustrate the lessons that the Baldwin effect has for research in bias shifting algorithms. The main lesson is that it appears that a good strategy for shift of bias in a learning algorithm is to begin with a weak bias and gradually shift to a strong bias.
OffTraining Set Error And a Priori Distinctions Between . . .
, 1995
"... This paper uses offtraining set (OTS) error to investigate the assumptionfree relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as vicever ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
This paper uses offtraining set (OTS) error to investigate the assumptionfree relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as viceversa, for loss functions like zeroone loss. In particular, this is true if A is crossvalidation and B is "anticrossvalidation" (choose the generalizer with largest crossvalidation error). On the other hand, for loss functions other than zeroone (e.g., quadratic loss), there are a priori distinctions between algorithms. However even for such loss functions, any algorithm is equivalent on average to its "randomized" version, and in this still has no first principles justification in terms of average error. Nonetheless, it may be that (for example) crossvalidation has better minimax properties than anticrossvalidation, even for zeroone loss. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors. Accordingly they prove, as a particular example, that crossvalidation can not be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anticrossvalidation rather than crossvalidation (!). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can not say: if empirical misclassification rate is low; the VC dimension of your generalizer is small; and the training set is large, then with high probability your OTS error is small. Other implications for "membership queries " algorithms and "punting" algorithms are also discussed.
Bayesian Backpropagation Over IO Functions Rather Than Weights
 Advances in Neural Information Processing Systems 6
, 1994
"... 1 INTRODUCTION In the conventional Bayesian view of backpropagation (BP) (Buntine and Weigend, 1991; Nowlan and Hinton, 1994; MacKay, 1992; Wolpert, 1993), one starts with the "likelihood" conditional distribution P(training set = t  weight vector w) and the "prior" distribution P(w). As an exampl ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
1 INTRODUCTION In the conventional Bayesian view of backpropagation (BP) (Buntine and Weigend, 1991; Nowlan and Hinton, 1994; MacKay, 1992; Wolpert, 1993), one starts with the "likelihood" conditional distribution P(training set = t  weight vector w) and the "prior" distribution P(w). As an example, in regression one might have a "Gaussian likelihood", P(t  w) µ exp[c 2 (w, t)] º P i exp [{net(w, t X (i))  t y (i)} 2 / 2s 2 ] for some constant s. (t X (i) and t Y (i) are the successive input and output values in the training set respectively, and net(w, .) is the function, induced by w, taking input neuron values to output neuron values.) As another example, the "weight decay" (Gaussian) prior is P(w) µ exp(a(w 2 )) for some constant a. Bayes' theorem tells us that P(w  t) µ P(t  w) P(w). Accordingly, the most probable weight given the data  the "maximum a posteriori" (MAP) w  is the mode over w of P(t  w) P(w), which equals the mode over w of the "cost function" ...
Combining Generalizers Using Partitions Of The Learning Set
 1992 Lectures in Complex Systems
, 1992
"... : For any realworld generalization problem, there are always many generalizers which could be applied to the problem. This paper discusses some algorithmic techniques for dealing with this multiplicity of possible generalizers. All of these techniques rely on partitioning the provided learning set ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
: For any realworld generalization problem, there are always many generalizers which could be applied to the problem. This paper discusses some algorithmic techniques for dealing with this multiplicity of possible generalizers. All of these techniques rely on partitioning the provided learning set in two, many different times. The first technique discussed is crossvalidation, which is a winnertakesall strategy (based on the behavior of the generalizers on the partitions of the learning set, it picks one single generalizer from amongst the set of candidate generalizers, and tells you to use that generalizer). The second technique discussed, the one this paper concentrates on, is an extension of crossvalidation called stacked generalization. As opposed to crossvalidation's winnertakes all strategy, stacked generalization uses the partitions of the learning set to combine the generalizers, in a nonlinear manner, via another generalizer (hence the term "stacked generalization"). Af...
Reconciling Bayesian And NonBayesian Analysis
 In Maximum Entropy and Bayesian Methods
, 1994
"... This paper shows that when one extends Bayesian analysis to distinguish the truth from one's guess for the truth, one gains a broader perspective which allows the inclusion of nonBayesian formalisms. This perspective shows how it is possible for nonBayesian techniques to perform well, despite thei ..."
Abstract

Cited by 12 (8 self)
 Add to MetaCart
This paper shows that when one extends Bayesian analysis to distinguish the truth from one's guess for the truth, one gains a broader perspective which allows the inclusion of nonBayesian formalisms. This perspective shows how it is possible for nonBayesian techniques to perform well, despite their handicaps. It also highlights some difficulties with the "degree of belief" interpretation of probability. 1 Introduction Why should one want to reconcile Bayesian and nonBayesian analysis? Bayesian analysis forces one to makes one's assumptions explicit; it ensures selfconsistency; it provides a single unified approach to all inference problems; if one is very sure of the prior (e.g., as an extreme, you constructed the datagenerating mechanism yourself) it is essentially impossible to beat; and in some ways most important of all (sociologically speaking), Bayesian analysis is in some senses more elegant than nonBayesian analysis. For these very reasons I have used Bayesian techniques...
Combining Stacking With Bagging To Improve A Learning Algorithm
, 1996
"... In bagging [Bre94a] one uses bootstrap replicates of the training set [Efr79, ET93] to improve a learning algorithm's performance, often by tens of percent. This paper presents several ways that stacking [Wol92b, Bre92] can be used in concert with the bootstrap procedure to achieve a further improve ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
In bagging [Bre94a] one uses bootstrap replicates of the training set [Efr79, ET93] to improve a learning algorithm's performance, often by tens of percent. This paper presents several ways that stacking [Wol92b, Bre92] can be used in concert with the bootstrap procedure to achieve a further improvement on the performance of bagging for some regression problems. In particular, in some of the work presented here, one first converts a single underlying learning algorithm into several learning algorithms. This is done by bootstrap resampling the training set, exactly as in bagging. The resultant algorithms are then combined via stacking. This procedure can be viewed as a variant of bagging, where stacking rather than uniform averaging is used to achieve the combining. The stacking improves performance over simple bagging by up to a factor of 2 on the tested problems, and never resulted in worse performance than simple bagging. In other work presented here, there is no step of converting t...