## A Bayesian/information theoretic model of learning to learn via multiple task sampling (1997)

### Cached

### Download Links

- [www.csail.mit.edu]
- [people.csail.mit.edu]
- [www.ai.mit.edu]
- [www.eecs.berkeley.edu]
- [www0.cs.ucl.ac.uk]
- [www.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 79 - 2 self |

### BibTeX

@INPROCEEDINGS{Baxter97abayesian/information,

author = {Jonathan Baxter},

title = {A Bayesian/information theoretic model of learning to learn via multiple task sampling},

booktitle = {Machine Learning},

year = {1997},

pages = {7--39}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. A Bayesian model of learning to learn by sampling from multiple tasks is presented. The multiple tasks are themselves generated by sampling from a distribution over an environment of related tasks. Such an environment is shown to be naturally modelled within a Bayesian context by the concept of an objective prior distribution. It is argued that for many common machine learning problems, although in general we do not know the true (objective) prior for the problem, we do have some idea of a set of possible priors to which the true prior belongs. It is shown that under these circumstances a learner can use Bayesian inference to learn the true prior by learning sufficiently many tasks from the environment. In addition, bounds are given on the amount of information required to learn a task when it is simultaneously learnt with several other tasks. The bounds show that if the learner has little knowledge of the true prior, but the dimensionality of the true prior is small, then sampling multiple tasks is highly advantageous. The theory is applied to the problem of learning a common feature set or equivalently a low-dimensional-representation (LDR) for an environment of related tasks.

### Citations

8565 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...DK(P Θ n |π ∗ �P Θ n |π) = −log EΠe −n∆K(π,π ∗ ) ≤ −log EΠe −nα∆H(π,π ∗ ) . The penultimate line follows because the KL divergence is additive over the product of independent distributions (see e.g. (=-=Cover & Thomas, 1991-=-)). Lemma 11 If dimPΠ(π ∗ ) exists then for any 0 <α<∞, −log EΠe lim n→∞ −nα∆H(π,π ∗ ) log n = dimPΠ(π∗ ) . (A.5) 2 Proof: The arguments used in the proof of lemma 11 are similar to those used in (Hau... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ... there the training sets are not generated independently for each task. The Bayesian aspect of the model presented here is a special case of what is known as hierarchical Bayesian inference (see e.g (=-=Berger, 1985-=-, Berger, 1986, Good, 1980)). To the best of my knowledge the asymptotic analysis given in this paper for these models is new, as is the consideration of the effect of the difference in the number of ... |

232 |
Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition,” Neurocomputing—Algorithms, Architectures and Applications
- Bridle
- 1989
(Show Context)
Citation Context ...n p(x) is not modeled, only the conditional distribution on class labels p(y|x). Denoting the output of a network with weights θ by fθ(x), and interpreting fθ(x) as p(y =1|x), it can easily be shown (=-=Bridle, 1989-=-) that the probability of data set z n =(x1,y1),...,(xn,yn)given weights θ is (7)s14 J. BAXTER where p(z n |θ)= E(z n ; θ)= n� i=1 p(xi)e −E(zn ;θ) (10) n� yi log(fθ(xi))+(1−yi) log(fθ(xi)). (11) i=1 ... |

204 | Approximation capabilities of multilayer feedforward networks - Hornik - 1991 |

153 | The evidence framework applied to classification networks
- MacKay
- 1992
(Show Context)
Citation Context ...f these results to representation or feature-map learning with neural networks. Hierarchical Bayesian inference has also been discussed in the context of neural networks by several authors (see e.g. (=-=Mackay, 1991-=-), although the techniques presented there are not explicitly identified as hierarchical Bayes). As fars12 J. BAXTER as I know the idea of an objective prior has not been employed previously in Bayesi... |

107 | Information-theoretic asymptotics of bayes methods
- Clarke, Barron
- 1990
(Show Context)
Citation Context ...log n + o(log n). (3) 2 The form of this equation as log n multiplied by the dimension of the space of possible priors around the true prior π∗ is similar to results from ordinary Bayesian inference (=-=Clarke & Barron, 1990-=-). The results of section 3 are purely concerned with the amount of information required to learn each task within an n task training set, they do not address the problem of how the information is obt... |

90 | Learning internal representations
- Baxter
- 1995
(Show Context)
Citation Context ...this expression to the upper bound on the number of examples required per task for good generalisation in a PAC sense of O(WOUT + WLDR/n) is noteworthysA BAYESIAN/INFORMATION THEORETIC MODEL 23 (see (=-=Baxter, 1995-=-b) for a derivation of the latter expression). Note how the amount of information required to learn each task decays to kWOUT as the number of tasks being learnt increases. kWOUT is the minimum amount... |

87 |
Learning from hints in neural networks
- Abu-Mostafa
- 1990
(Show Context)
Citation Context ...stribution is given in section 4.4. 1.1. Related Work Several authors have made empirical studies of the idea that learning multiple related tasks should improve performance, see e.g. (Caruana, 1993, =-=Abu-Mostafa, 1989-=-, Mitchell & Thrun, 1994). Experimental verification of this for feedforward nets was also reported in (Baxter, 1995b). The additional assumption that the tasks are distributed according to an objecti... |

61 | Fat-shattering and the learnability of real-valued functions - Bartlett, Long, et al. - 1996 |

61 | Learning one more thing
- Thrun, Mitchell
- 1995
(Show Context)
Citation Context ... in section 4.4. 1.1. Related Work Several authors have made empirical studies of the idea that learning multiple related tasks should improve performance, see e.g. (Caruana, 1993, Abu-Mostafa, 1989, =-=Mitchell & Thrun, 1994-=-). Experimental verification of this for feedforward nets was also reported in (Baxter, 1995b). The additional assumption that the tasks are distributed according to an objective distribution is what ... |

56 | Learning many related tasks at the same time with backpropagation
- Caruana
- 1995
(Show Context)
Citation Context ... of a normal distribution is given in section 4.4. 1.1. Related Work Several authors have made empirical studies of the idea that learning multiple related tasks should improve performance, see e.g. (=-=Caruana, 1993-=-, Abu-Mostafa, 1989, Mitchell & Thrun, 1994). Experimental verification of this for feedforward nets was also reported in (Baxter, 1995b). The additional assumption that the tasks are distributed acco... |

33 |
Some History of the Hierarchical Bayesian Methodology
- Good
- 1980
(Show Context)
Citation Context ... not generated independently for each task. The Bayesian aspect of the model presented here is a special case of what is known as hierarchical Bayesian inference (see e.g (Berger, 1985, Berger, 1986, =-=Good, 1980-=-)). To the best of my knowledge the asymptotic analysis given in this paper for these models is new, as is the consideration of the effect of the difference in the number of hyper-parameters and model... |

29 | Function learning from interpolation
- Anthony, Bartlett
(Show Context)
Citation Context ...d by making extra assumptions, such as that the function values are corrupted by noise (Bartlett, Long & Williamson, 1994), or that every algorithm within a certain class of algorithms performs well (=-=Anthony & Bartlett, 1995-=-). Without such assumptions it is possible to construct (albeit artificial) scenarios in which every function within some class encodes its identity at every point (Bartlett, Long & Williamson, 1994).... |

16 | Learning model bias - Baxter - 1996 |

14 | General bounds on the mutual information between a parameter and n conditionally independent observations
- Haussler, Opper
- 1995
(Show Context)
Citation Context ...h Euclidean models given in section 4.2 could also be derived more directly from the results of (Clarke & Barron, 1990). The motivation behind the approach taken here (which is based on the ideas in (=-=Haussler & Opper, 1995-=-a)) is that it provides results for general metric spaces, not just Euclidean models, although this is at the expense of losing lower order terms in the asymptotic estimates. Theorem 1 can also be der... |

12 |
A bayesian/information theoretic model of bias learning
- Baxter
- 1996
(Show Context)
Citation Context ...ing the bias can be viewed as a form of learning to learn. In this paper a Bayesian model of bias learning is introduced, based on the VC/PACtype models of bias learning introduced in (Baxter, 1995b, =-=Baxter, 1996-=-b). The central assumption of all these models (including that of the present paper) is that the learner is embedded within an environment of related tasks. The learner is able to sample from the envi... |

10 |
The Need for Biases in Learning Generalisations
- Mitchell
- 1980
(Show Context)
Citation Context ...ory 1. Introduction Hume’s analysis shows that there is no a priori basis for induction. In a machine learning context, this means that a learner must be biased in some way for it to generalise well (=-=Mitchell, 1990-=-). Typically such bias is introduced by hand through the skill and insights of experts, but despite many notable successes, this process is clearly limited by the experts’ abilities. Hence a desirable... |

6 |
Jeffreys' Prior is Asymptotically Least Favourable under Entropy Risk
- Clarke, Barron
- 1994
(Show Context)
Citation Context ...ermine the conditions under which Jeffrey’s prior is the optimal hyper-prior to use for the hierarchical models discussed here. This question has only recently been settled for ordinary Bayes models (=-=Barron & Clarke, 1994-=-). Another important question is to what extent the assumption of realizability (i.e. π ∗ ∈ Π) can be relaxed. Also, the results of (Haussler & Opper, 1995b) can be used to derive asymptotic bounds on... |

4 |
A model for bias learning
- Baxter
- 1997
(Show Context)
Citation Context ...m easier, learning the bias can be viewed as a form of learning to learn. In this paper a Bayesian model of bias learning is introduced, based on the VC/PACtype models of bias learning introduced in (=-=Baxter, 1995-=-b, Baxter, 1996b). The central assumption of all these models (including that of the present paper) is that the learner is embedded within an environment of related tasks. The learner is able to sampl... |

2 |
Multivariate Estimation
- Berger
- 1986
(Show Context)
Citation Context ...ining sets are not generated independently for each task. The Bayesian aspect of the model presented here is a special case of what is known as hierarchical Bayesian inference (see e.g (Berger, 1985, =-=Berger, 1986-=-, Good, 1980)). To the best of my knowledge the asymptotic analysis given in this paper for these models is new, as is the consideration of the effect of the difference in the number of hyper-paramete... |

1 | Reconstructing a neural network from its output - Fefferman - 1994 |