Results 11 - 20
of
36
On Overfitting Avoidance As Bias
- SFI TR
, 1993
"... In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that cross-validation is an effective way to choose amongst algorithms for fitting functions to data. ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that cross-validation is an effective way to choose amongst algorithms for fitting functions to data. In a recent paper, Schaffer (1993) presents experimental evidence disputing these claims. The current paper consists of a formal analysis of these contentions of Schaffer's. It proves that his contentions are valid, although some of his experiments must be interpreted with caution.
Generalization in Interactive Networks: The Benefits of Inhibitory Competition and Hebbian Learning
- Neural Computation
, 2001
"... Computational models in cognitive neuroscience should ideally use biological properties and powerful computational principles to produce behavior consistent with psychological findings. Error-driven backpropagation is computationally powerful, and has proven useful for modeling a range of psycholo ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
Computational models in cognitive neuroscience should ideally use biological properties and powerful computational principles to produce behavior consistent with psychological findings. Error-driven backpropagation is computationally powerful, and has proven useful for modeling a range of psychological data, but is not biologically plausible. Several approaches to implementing backpropagation in a biologically plausible fashion converge on the idea of using bidirectional activation propagation in interactive networks to convey error signals. This paper demonstrates two main points about these error-driven interactive networks: (a) they generalize poorly due to attractor dynamics that interfere with the network's ability to systematically produce novel combinatorial representations in response to novel inputs; and (b) this generalization problem can be remedied by adding two widely used mechanistic principles, inhibitory competition and Hebbian learning, that can be independent...
The supervised learning no-free-lunch Theorems
- In Proc. 6th Online World Conference on Soft Computing in Industrial Applications
, 2001
"... Abstract This paper reviews the supervised learning versions of the no-free-lunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning. ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Abstract This paper reviews the supervised learning versions of the no-free-lunch theorems in a simplified form. It also discusses the significance of those theorems, and their relation to other aspects of supervised learning.
How to Shift Bias: Lessons from the Baldwin Effect
, 1996
"... An inductive learning algorithm takes a set of data as input and generates a hypothesis as output. A set of data is typically consistent with an infinite number of hypotheses; therefore, there must be factors other than the data that determine the output of the learning algorithm. In machine learnin ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
An inductive learning algorithm takes a set of data as input and generates a hypothesis as output. A set of data is typically consistent with an infinite number of hypotheses; therefore, there must be factors other than the data that determine the output of the learning algorithm. In machine learning, these other factors are called the bias of the learner. Classical learning algorithms have a fixed bias, implicit in their design. Recently developed learning algorithms dynamically adjust their bias as they search for a hypothesis. Algorithms that shift bias in this manner are not as well understood as classical algorithms. In this paper, we show that the Baldwin effect has implications for the design and analysis of bias shifting algorithms. The Baldwin effect was proposed in 1896, to explain how phenomena that might appear to require Lamarckian evolution (inheritance of acquired characteristics) can arise from purely Darwinian evolution. Hinton and Nowlan presented a computational model of the Baldwin effect in 1987. We explore a variation on their model, which we constructed explicitly to illustrate the lessons that the Baldwin effect has for research in bias shifting algorithms. The main lesson is that it appears that a good strategy for shift of bias in a learning algorithm is to begin with a weak bias and gradually shift to a strong bias.
Off-Training Set Error And a Priori Distinctions Between . . .
, 1995
"... This paper uses off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as vice-ver ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
This paper uses off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. It is shown, loosely speaking, that for any two algorithms A and B, there are as many targets (or priors over targets) for which A has lower expected OTS error than B as vice-versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is "anti-cross-validation" (choose the generalizer with largest cross-validation error). On the other hand, for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms. However even for such loss functions, any algorithm is equivalent on average to its "randomized" version, and in this still has no first principles justification in terms of average error. Nonetheless, it may be that (for example) cross-validation has better minimax properties than anti-cross-validation, even for zero-one loss. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors. Accordingly they prove, as a particular example, that cross-validation can not be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-crossvalidation rather than cross-validation (!). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one can not say: if empirical misclassification rate is low; the VC dimension of your generalizer is small; and the training set is large, then with high probability your OTS error is small. Other implications for "membership queries " algorithms and "punting" algorithms are also discussed.
Bayesian Backpropagation Over I-O Functions Rather Than Weights
- Advances in Neural Information Processing Systems 6
, 1994
"... 1 INTRODUCTION In the conventional Bayesian view of backpropagation (BP) (Buntine and Weigend, 1991; Nowlan and Hinton, 1994; MacKay, 1992; Wolpert, 1993), one starts with the "likelihood" conditional distribution P(training set = t | weight vector w) and the "prior" distribution P(w). As an exampl ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
1 INTRODUCTION In the conventional Bayesian view of backpropagation (BP) (Buntine and Weigend, 1991; Nowlan and Hinton, 1994; MacKay, 1992; Wolpert, 1993), one starts with the "likelihood" conditional distribution P(training set = t | weight vector w) and the "prior" distribution P(w). As an example, in regression one might have a "Gaussian likelihood", P(t | w) µ exp[-c 2 (w, t)] º P i exp [-{net(w, t X (i)) - t y (i)} 2 / 2s 2 ] for some constant s. (t X (i) and t Y (i) are the successive input and output values in the training set respectively, and net(w, .) is the function, induced by w, taking input neuron values to output neuron values.) As another example, the "weight decay" (Gaussian) prior is P(w) µ exp(-a(w 2 )) for some constant a. Bayes' theorem tells us that P(w | t) µ P(t | w) P(w). Accordingly, the most probable weight given the data - the "maximum a posteriori" (MAP) w - is the mode over w of P(t | w) P(w), which equals the mode over w of the "cost function" ...
Combining Generalizers Using Partitions Of The Learning Set
- 1992 Lectures in Complex Systems
, 1992
"... : For any real-world generalization problem, there are always many generalizers which could be applied to the problem. This paper discusses some algorithmic techniques for dealing with this multiplicity of possible generalizers. All of these techniques rely on partitioning the provided learning set ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
: For any real-world generalization problem, there are always many generalizers which could be applied to the problem. This paper discusses some algorithmic techniques for dealing with this multiplicity of possible generalizers. All of these techniques rely on partitioning the provided learning set in two, many different times. The first technique discussed is cross-validation, which is a winner-takes-all strategy (based on the behavior of the generalizers on the partitions of the learning set, it picks one single generalizer from amongst the set of candidate generalizers, and tells you to use that generalizer). The second technique discussed, the one this paper concentrates on, is an extension of cross-validation called stacked generalization. As opposed to cross-validation's winnertakes -all strategy, stacked generalization uses the partitions of the learning set to combine the generalizers, in a non-linear manner, via another generalizer (hence the term "stacked generalization"). Af...
Reconciling Bayesian And Non-Bayesian Analysis
- In Maximum Entropy and Bayesian Methods
, 1994
"... This paper shows that when one extends Bayesian analysis to distinguish the truth from one's guess for the truth, one gains a broader perspective which allows the inclusion of non-Bayesian formalisms. This perspective shows how it is possible for non-Bayesian techniques to perform well, despite thei ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
This paper shows that when one extends Bayesian analysis to distinguish the truth from one's guess for the truth, one gains a broader perspective which allows the inclusion of non-Bayesian formalisms. This perspective shows how it is possible for non-Bayesian techniques to perform well, despite their handicaps. It also highlights some difficulties with the "degree of belief" interpretation of probability. 1 Introduction Why should one want to reconcile Bayesian and non-Bayesian analysis? Bayesian analysis forces one to makes one's assumptions explicit; it ensures self-consistency; it provides a single unified approach to all inference problems; if one is very sure of the prior (e.g., as an extreme, you constructed the data-generating mechanism yourself) it is essentially impossible to beat; and in some ways most important of all (sociologically speaking), Bayesian analysis is in some senses more elegant than non-Bayesian analysis. For these very reasons I have used Bayesian techniques...
Combining Stacking With Bagging To Improve A Learning Algorithm
, 1996
"... In bagging [Bre94a] one uses bootstrap replicates of the training set [Efr79, ET93] to improve a learning algorithm's performance, often by tens of percent. This paper presents several ways that stacking [Wol92b, Bre92] can be used in concert with the bootstrap procedure to achieve a further improve ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In bagging [Bre94a] one uses bootstrap replicates of the training set [Efr79, ET93] to improve a learning algorithm's performance, often by tens of percent. This paper presents several ways that stacking [Wol92b, Bre92] can be used in concert with the bootstrap procedure to achieve a further improvement on the performance of bagging for some regression problems. In particular, in some of the work presented here, one first converts a single underlying learning algorithm into several learning algorithms. This is done by bootstrap resampling the training set, exactly as in bagging. The resultant algorithms are then combined via stacking. This procedure can be viewed as a variant of bagging, where stacking rather than uniform averaging is used to achieve the combining. The stacking improves performance over simple bagging by up to a factor of 2 on the tested problems, and never resulted in worse performance than simple bagging. In other work presented here, there is no step of converting t...
Constructing New Attributes for Decision Tree Learning
, 1996
"... A well-known fundamental limitation of selective induction algorithms is that when tasksupplied attributes are not adequate for, or directly relevant to, describing hypotheses, their performance in terms of prediction accuracy and/or theory complexity is poor. One solution to this problem is constru ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
A well-known fundamental limitation of selective induction algorithms is that when tasksupplied attributes are not adequate for, or directly relevant to, describing hypotheses, their performance in terms of prediction accuracy and/or theory complexity is poor. One solution to this problem is constructive induction. It constructs, by using task-supplied attributes, new attributes that are expected to be more appropriate than the task-supplied attributes for describing the target concepts. This thesis focuses on constructive induction with decision trees as the theory description language. It explores: (1) novel approaches to constructing new binary attributes using existing constructive operators, and (2) novel methods of constructing new nominal and new continuous-valued attributes based on a newly proposed constructive operator. The thesis investigates a fixed rule-based approach to constructing new binary attributes for decision tree learning. It generates conjunctions from producti...

