## Ensemble learning in Bayesian neural networks (1998)

Venue: | Neural Networks and Machine Learning |

Citations: | 23 - 5 self |

### BibTeX

@INPROCEEDINGS{Barber98ensemblelearning,

author = {David Barber and Christopher M. Bishop},

title = {Ensemble learning in Bayesian neural networks},

booktitle = {Neural Networks and Machine Learning},

year = {1998},

pages = {215--237},

publisher = {Springer}

}

### Years of Citing Articles

### OpenURL

### Abstract

Bayesian treatments of learning in neural networks are typically based either on a local Gaussian approximation to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. The original derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and hence was unable to capture the posterior correlations between parameters. In this chapter we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. One of the benefits of our approach is that it yields a strict lower bound on the marginal likelihood, in contrast to other approximate procedures. 1

### Citations

8090 | D.: Maximum likelihood from incomplete data via the em algorithm - Dempster, Laird, et al. - 1977 |

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...yields a posterior distribution of network parameters, P (w|D), conditional on the training data D, and predictions are expressed in terms of expectations with respect to this posterior distribution (=-=Bishop 1995-=-). However, the corresponding integrals over weight space are analytically intractable. One well-established procedure for approximating these integrals, known as Laplace’s method, is to model the pos... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...ine prior distributions of the hyperparameters and then integrate them out. Since exact integration is analytically intractable, MacKay (1992) uses an approximation called type-II maximum likelihood (=-=Berger 1985-=-) which involves estimating specific values for the hyper-parameters by maximizing the marginal likelihood P (D|β, A) with respect to β and A. The marginal likelihood is given by � P (D|β, A) = P (D|w... |

831 | An introduction to variational methods for graphical models - Jordan, Ghahramani, et al. - 1999 |

764 | A view of the EM algorithm that justifies incremental sparse and other variants - Neal, Hinton - 1998 |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...trast to other approximate procedures. 1 Introduction Bayesian techniques have been successfully applied to neural networks in the context of both regression and classification problems (MacKay 1992; =-=Neal 1996-=-). In contrast to the maximum likelihood approach, which finds a single estimate for the regression parameters, the Bayesian approach yields a posterior distribution of network parameters, P (w|D), co... |

400 | A practical bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...ihood, in contrast to other approximate procedures. 1 Introduction Bayesian techniques have been successfully applied to neural networks in the context of both regression and classification problems (=-=MacKay 1992-=-; Neal 1996). In contrast to the maximum likelihood approach, which finds a single estimate for the regression parameters, the Bayesian approach yields a posterior distribution of network parameters, ... |

139 | Probable networks and plausible predictions -- a review of practical Bayesian methods for supervised neural networks - MacKay - 1995 |

128 | Keeping the neural networks simple by minimizing the description length of the weights - Hinton, Camp - 1993 |

69 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ...njugate gradients is used to find a mode w∗ of the log posterior distribution. The local Hessian H can then be evaluated efficiently using an extension of the back-propagation procedure (Bishop 1992; =-=Pearlmutter 1994-=-) or by using one of several approximation schemes (Bishop 1995). So far in this discussion of Laplace’s method we have assumed that the hyper-parameters β and A are fixed. In a fully Bayesian treatme... |

36 |
Ensemble learning for Multi-Layer Networks
- Barber, Bishop
- 1998
(Show Context)
Citation Context ...this chapter we show that the ensemble learning approach can be extended to allow a Gaussian approximating distribution with a general covariance matrix, while still leading to a tractable algorithm (=-=Barber and Bishop 1998-=-). Our focus is on the essential principles of the approach, with the mathematical details relegated to the Appendix. 1.1 Bayesian Neural Networks Consider a two-layer feed-forward network having H hi... |

29 | Approximating posterior distributions in belief networks using mixtures - Bishop, Lawrence, et al. - 1998 |

7 | Approximating posteriors via mixture models - Jaakkola, Jordan - 1998 |

7 | Mixture representations for inference and learning in boltzmann machines - Lawrence, Bishop, et al. - 1998 |

6 | Tractable undirected approximations for graphical models - Barber, Wiegerinck - 1998 |

4 | Radial Basis Functions: a Bayesian treatment - Barber, Schottky - 1997 |

4 | Variational learning in graphical models and neural networks
- Bishop
- 1998
(Show Context)
Citation Context ...r we show that the ensemble learning approach can be extended to allow a Gaussian approximating distribution with a general covariance matrix, while still leading to a tractable algorithm (Barber and =-=Bishop 1998-=-). Our focus is on the essential principles of the approach, with the mathematical details relegated to the Appendix. 1.1 Bayesian Neural Networks Consider a two-layer feed-forward network having H hi... |

1 |
Latent variables, mixture distributions and topographic mappings
- Bishop
- 1998
(Show Context)
Citation Context ...r we show that the ensemble learning approach can be extended to allow a Gaussian approximating distribution with a general covariance matrix, while still leading to a tractable algorithm (Barber and =-=Bishop 1998-=-). Our focus is on the essential principles of the approach, with the mathematical details relegated to the Appendix. 1.1 Bayesian Neural Networks Consider a two-layer feed-forward network having H hi... |