## A Model of Inductive Bias Learning (2000)

### Cached

### Download Links

Venue: | Journal of Artificial Intelligence Research |

Citations: | 150 - 0 self |

### BibTeX

@ARTICLE{Baxter00amodel,

author = {Jonathan Baxter},

title = {A Model of Inductive Bias Learning},

journal = {Journal of Artificial Intelligence Research},

year = {2000},

volume = {12},

pages = {149--198}

}

### Years of Citing Articles

### OpenURL

### Abstract

A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Exp...

### Citations

1734 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ... methods for automatically learning the bias. In this paper we introduce and analyze a formal model of bias learning that builds upon the PAC model of machine learning and its variants (Vapnik, 1982; =-=Valiant, 1984-=-; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989; Haussler, 1992). These models typically take the following general form: the learner is supplied with a hypothesis spacesand training data drawn indep... |

1349 | Bayesian data analysis - Gelman, Carlin, et al. - 1995 |

1320 |
Statistical Decision Theory and Bayesian Analysis. Second Edition
- Berger
- 1985
(Show Context)
Citation Context ...utions. See Thrun and Pratt (1997, chapter 1) for a more comprehensive treatment. Hierarchical Bayes. The earliest approaches to bias learning come from Hierarchical Bayesian � methods in statistics=-= (Berger, 1985-=-; Good, 1980; Gelman, Carlin, Stern, & Rubim, 1995). In contrast to the Bayesian methodology, the present paper takes an essentially empirical process approach to modeling the problem of bias learning... |

1023 | A probabilistic theory of pattern recognition - Devroye, Gyorfi, et al. - 1996 |

826 |
Estimation of Dependences Based on Empirical Data
- Vapnik, Kotz
- 2006
(Show Context)
Citation Context ... to search for methods for automatically learning the bias. In this paper we introduce and analyze a formal model of bias learning that builds upon the PAC model of machine learning and its variants (=-=Vapnik, 1982-=-; Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989; Haussler, 1992). These models typically take the following general form: the learner is supplied with a hypothesis spacesand training d... |

635 |
Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...esis spaces � available to the bias learner. For Boolean learning problems (pattern classification) these parameters are the bias learning analogue of the Vapnik-Chervonenkis dimension (Vapnik, 1982=-=; Blumer et al., 1989-=-). As an application of the general theory, the problem of learning an appropriate set of neuralnetwork features for an environment of related tasks is formulated as a bias learning problem. In the ca... |

598 | Convergence of Stochastic Processes - Pollard - 1984 |

494 | Multitask learning
- Caruana
- 1997
(Show Context)
Citation Context ...ive bias learning. Preliminary results with this approach on a chess domain are reported in Khan, Muggleton, and Parson (1998). Improving performance on a fixed reference task. “Multi-task learning�=-=�� (Caruana, 1997) -=-� trains extra neural network outputs to match related tasks in order to improve generalization performance on a fixed reference task. Although this approach does not explicitly identify the extra b... |

485 | Real Analysis and Probability - Dudley - 1989 |

380 |
Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...introduce and analyze a formal model of bias learning that builds upon the PAC model of machine learning and its variants (Vapnik, 1982; Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989; =-=Haussler, 1992). These models-=- typically take the following general form: the learner is supplied with a hypothesis spacesand training data drawn independently according to some underlying distribution � ¡£¢¥¤§¦©¨��... |

351 | Classical descriptive set theory - Kechris - 1994 |

320 | Neural Network Learning: Theoretical Foundations - Anthony, Bartlett - 1999 |

256 |
Probability measures on metric spaces
- Parthasarathy
- 1967
(Show Context)
Citation Context ...ogy of weak convergence � on . If we assume that � � and are separable metric spaces, � then is also a separable metric space in the Prohorov metric (which metrizes the topology of weak conver=-=gence) (Parthasarathy, 1967), so there is-=- no problem with the existence of measures on ������� . See Appendix D for further discussion, particularly the proof of part 5 in Lemma 32. 157 issBAXTER We define the goal of a bias le... |

243 |
On the density of families of sets
- Sauer
- 1972
(Show Context)
Citation Context ...kis dimension ��������� ¦s� is the size of the largest set shattered bys��������� ¦s� � ¢ ����� ¤�� � ��� ¦©����¢ ���=-=�� An important result in the theory of learning Boolean functions is Sauer’s Lemma (Sauer, 1972), of which we will also make use. Lemma 9 (Sauer’s Lemma). For a Boolean function -=-classswith ��������� ¦s��¢ � , for all positive integers � . � � ¦©����� � ���� � � � � � ��� � � � � �£� We now gene... |

181 | The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network
- Bartlett
- 1998
(Show Context)
Citation Context ...hey assign to class labels), it can be shown that there is an optimal metric or distance measure to use for vector quantization and onenearest-neighbour classification (Baxter, 1995a, 1997b; Baxter & =-=Bartlett, 1998-=-). This metric can be learnt by sampling from a subset of tasks from the environment, and then used as a distance measure when learning novel tasks drawn from the same environment. Bounds on the numbe... |

165 | Transfer of learning by composing solutions for elemental sequential tasks - Singh - 1992 |

146 | Is learning the n-th thing any easier than learning the first - Thrun - 1996 |

122 |
Shift of Bias for Inductive Concept Learning
- Utgoff
- 1984
(Show Context)
Citation Context ... “VBMS” or Variable Bias � Management System was introduced as a mechanism for selecting amongst different learning algorithms when tackling a new learning problem. “STABB” or Shift To a Bet=-=ter Bias (Utgoff, 1986) wa-=-s another early scheme for adjusting bias, but unlike VBMS, STABB was not primarily focussed on searching for bias applicable to large problem domains. Our use of an “environment of related tasks”... |

105 |
Learning to learn
- Thrun, Pratt
- 1998
(Show Context)
Citation Context ...ly many training tasks it produces a hypothesis space that with high probability contains good solutions to novel tasks. Another term that has been used for this process is Learning to Learn (Thrun & =-=Pratt, 1997).-=- Our main theorems are stated in an agnostic setting (that is, � does not necessarily contain a hypothesis space with solutions to all the problems in the environment), but we also give improved bou... |

102 | Finding structure in reinforcement learning - Thrun, Schwartz - 1995 |

93 | Discovering structure in multiple learning tasks: The TC algorithm - Thrun, O’Sullivan - 1996 |

91 | Learning internal representations
- Baxter
- 1995
(Show Context)
Citation Context ...the conditional probabilities they assign to class labels), it can be shown that there is an optimal metric or distance measure to use for vector quantization and onenearest-neighbour classification (=-=Baxter, 1995-=-a, 1997b; Baxter & Bartlett, 1998). This metric can be learnt by sampling from a subset of tasks from the environment, and then used as a distance measure when learning novel tasks drawn from the same... |

84 |
A course on empirical processes
- Dudley
- 1984
(Show Context)
Citation Context ...ssibility” on the hypothesis space family � . Permissibility was introduced by Pollard (1984) for ordinary hypothesis classes � . His definition is very similar to Dudley’s “image admissible=-= Suslin” (Dudley, 1984). We will be extending this d-=-efinition to cover hypothesis space families. Throughout this section we assume all functions � map from (the complete separable metric � space) into � ������� . � � ��� ... |

76 | A bayesian/information theoretic model of learning to learn via multiple task sampling - Baxter - 1997 |

76 | Adapting bias by gradient descent: an incremental version of delta-bar-delta - Sutton - 1992 |

62 | Learning one more thing - Thrun, Mitchell - 1995 |

46 | The Use of Knowledge in Analogy and Induction - Russell - 1989 |

43 | Learning from hints - Abu-Mostafa - 1994 |

43 | Discriminability-based transfer between neural networks - Pratt - 1992 |

34 |
Some History of the Hierarchical Bayesian Methodology
- Good
- 1980
(Show Context)
Citation Context ...run and Pratt (1997, chapter 1) for a more comprehensive treatment. Hierarchical Bayes. The earliest approaches to bias learning come from Hierarchical Bayesian � methods in statistics (Berger, 1985=-=; Good, 1980-=-; Gelman, Carlin, Stern, & Rubim, 1995). In contrast to the Bayesian methodology, the present paper takes an essentially empirical process approach to modeling the problem of bias learning. However, a... |

29 | Symbolic-neural systems and the use of hints for developing complex systems - Suddarth, Holden - 1991 |

27 | The parallel transfer of task knowledge using dynamic learning rates based on a measure of relatedness - Silver, Mercer - 1996 |

25 | Rule injection hints as a means of improving network performance and learning time - Suddarth, Kergoisien - 1990 |

20 | Layered concept learning and dynamically-variable bias management - Rendell, Seshu, et al. - 1987 |

18 | How to make a low-dimensional representation suitable for diverse tasks - Intrator, Edelman - 1996 |

17 | The canonical distortion measure for vector quantization and function approximation - Baxter - 1998 |

16 | Solving a huge number of similar tasks: a combination of multi-task learning and hierarchical Bayesian modeling - Heskes - 1998 |

15 | Adaptive Generalisation and the Transfer of Knowledge,” University of Exeter: R257 - Sharkey, Sharkey - 1992 |

14 |
Distribution inequalities for the binomial law
- Slud
- 1977
(Show Context)
Citation Context ...� � ¢ � if � � ������� � � ¢ � ¤¥¤¥¤ � � ������� ��� ��� � , and otherwise. Hence, � ¨ � � ��� � � �����=-= � � � � ¢ ��� � which is half the probability that a binomial � ����������� Slud’s inequality (Slud, 1977), � � � � � � � � � � � � � ��� � � ����� � � � � � � � ¨ � � � ¢ � � � � � ��� � � � � � � � � ¢ �-=- � � � � � � � � � � ��� � � � ��� � � � � � � � � � � � � � � � � � � ��� � � � � � � � � � � rando... |

10 |
The Need for Biases in Learning Generalisations
- Mitchell
- 1980
(Show Context)
Citation Context ...task is the initial choice of hypothesis space; it has to be large enough to contain a solution to the problem at hand, yet small enough to ensure good generalization from a small number of examples (=-=Mitchell, 1991-=-). Once a suitable bias has been found, the actual learning task is often straightforward. Existing methods of bias generally require the input of a human expert in the form of heuristics and domain k... |

8 | Repeat learning using predicate invention - Khan, Muggleton, et al. - 1998 |

4 |
The canonical distortion measure in feature space and 1-NN classification
- Baxter, Bartlett
- 1998
(Show Context)
Citation Context ...hey assign to class labels), it can be shown that there is an optimal metric or distance measure to use for vector quantization and onenearest-neighbour classification (Baxter, 1995a, 1997b; Baxter & =-=Bartlett, 1998-=-). This metric can be learnt by sampling from a subset of tasks from the environment, and then used as a distance measure when learning novel tasks drawn from the same environment. Bounds on the numbe... |

2 | The process of learning - Langford - 1989 |

1 | Lower bounds on the VC-dimension of multi-layer threshold networks - Bartlett - 1993 |

1 | Continual Learning in Reinforcement Environments. R. Oldenbourg Verlag - Ring - 1995 |

1 |
On a double inequality of the normal distribution
- Tate
- 1953
(Show Context)
Citation Context ...� � � � � � � � � � � ��� � � � ��� � � � � � � � � � � � � � � � � � � ��� � � � � � � � � � � ra=-=ndom variable is at least ����� . By � � ��� � � � � � � where is normal � ����� . Tate’s inequality (Tate, 1953) states that for all � ����� , � ��� � ��� � � � ��-=-� ��� ����� � � Combining the last two inequalities completes the proof. � � Let all � distributions � � � such that � � ����� � th row of � , and for... |