## Bayesian Neural Networks and Density Networks (1994)

Venue: | Nuclear Instruments and Methods in Physics Research, A |

Citations: | 40 - 8 self |

### BibTeX

@INPROCEEDINGS{MacKay94bayesianneural,

author = {David J.C. MacKay},

title = {Bayesian Neural Networks and Density Networks},

booktitle = {Nuclear Instruments and Methods in Physics Research, A},

year = {1994},

pages = {73--80}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper reviews the Bayesian approach to learning in neural networks, then introduces a new adaptive model, the density network. This is a neural network for which target outputs are provided, but the inputs are unspecied. When a probability distribution is placed on the unknown inputs, a latent variable model is dened that is capable of discovering the underlying dimensionality of a data set. A Bayesian learning algorithm for these networks is derived and demonstrated. 1 Introduction to the Bayesian view of learning A binary classier is a parameterized mapping from an input x to an output y 2 [0; 1]); when its parameters w are specied, the classier states the probability that an input x belongs to class t = 1, rather than the alternative t = 0. Consider a binary classier which models the probability as a sigmoid function of x: P (t = 1jx; w;H) = y(x; w;H) = 1 1 + e wx (1) This form of model is known to statisticians as a linear logistic model, and in the neural networks ...

### Citations

524 | Hidden Markov models in computational biology
- Krogh, Brown
- 1994
(Show Context)
Citation Context ...ucture might be elucidated by a model capable of discovering suspicious long-range correlations. The only probabilistic model that has so far been applied to protein families is a hidden Markov model =-=[13-=-]. This model is not inherently capable of discovering long-range correlations, as Markov models, by denition, produce no correlations between observables, given a hidden state sequence. The next-door... |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...esian learning algorithm for these networks is derived and demonstrated. 1 Introduction to the Bayesian view of learning A binary classier is a parameterized mapping from an input x to an output y 2 [=-=0;-=- 1]); when its parameters w are specied, the classier states the probability that an input x belongs to class t = 1, rather than the alternative t = 0. Consider a binary classier which models the prob... |

283 |
Learning and relearning in Boltzmann machines
- Hinton, Sejnowski
- 1986
(Show Context)
Citation Context ...rvable quantities is constructed. Multilayer perceptrons have not conventionally been used to create density models (though belief networks [8] and other neural networks such as the Boltzmann machine =-=[9] d-=-o dene density models). Various interesting research problems in thisseld relate to the di��culty of dening a full probabilistic model with an MLP. For example, if some inputs in a regression prob... |

260 |
Learning representations by back-propagating errors. Nature
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ... P (Djw; H) = N X n=1 log y t (n) (x; w): (7) The derivative of this function with respect to w is simple to evaluate for the models described above using the `backpropagation' algorithm (chain rule) =-=[7]-=-. This paper now describes a new Bayesian neural network model. 2 Density Modelling The most popular supervised neural networks, multilayer perceptrons (MLPs), are well established as probabilistic mo... |

242 | RNA sequence analysis using covariance models
- Eddy, Durbin
- 1994
(Show Context)
Citation Context ... given a hidden state sequence. The next-door neighbour of proteins, RNA, has been modelled with a `covariance model' capable of capturing correlations between base-pairs in anti-parallel RNA strands =-=[14]-=-. Here I model the protein families using a density network containing one softmax group for each column. Toy data is shown in table 1. Real data describing 400 proteins in the globin family was recei... |

197 |
Sequential updating of conditional probabilities on directed graphical structures
- Spiegelhalter, Lauritzen
- 1990
(Show Context)
Citation Context ...ame for modelling tasks in which a density over all the observable quantities is constructed. Multilayer perceptrons have not conventionally been used to create density models (though belief networks =-=[8] a-=-nd other neural networks such as the Boltzmann machine [9] do dene density models). Various interesting research problems in thisseld relate to the di��culty of dening a full probabilistic model w... |

132 |
An introduction to latent variable models
- Everitt
- 1984
(Show Context)
Citation Context ...ereas only two independent degrees of freedom are really present. These observations motivate the development of density models that have components rather than categories as their `latent variables' =-=[1-=-0]. Let us denote the observables by t. If a density is dened on the latent variables x, and a parameterized mapping is dened from these latent variables to a probability distribution over the observa... |

110 | Autoencoders, minimum description length, and Helmholtz free energy
- Hinton, Zemel
- 1994
(Show Context)
Citation Context ...d Zemel describe the use of the recognition mapping to compute an approximate distribution over the latent variables that is used to train an autoencoder by an elegant free energy minimization method =-=[18-=-]. The connection between the `MDL' approach that they use and the Bayesian viewpoint is explained in ref. [6]. The main dierences between this work and Hinton and Zemel's are the types of network stu... |

58 |
Bayesian Methods for Backpropagation Networks
- MacKay
- 1993
(Show Context)
Citation Context ...that represent the posterior distribution by a set of Monte Carlo samples from it [3]. The former approach has been successfully applied to practical problems, as described elsewhere [4, 5]. See ref. =-=[-=-6] for a review. In the general case of a classication problem with multiple classes i = 1 : : : I, a `softmax' classier is a natural form of model. This assigns probabilities to the alternative class... |

38 |
Dimensionality-reduction using connectionist networks
- Saund
- 1989
(Show Context)
Citation Context ...toencoders One class of MLPs relates closely to density networks: the autoencoding network is trained to reproduce the input vector at its output after mapping it through a low-dimensional bottleneck =-=[16, 17]-=-. The density network is like the second `generative' half of the autoencoder | from the bottleneck to the output. Thesrst mapping in an autoencoder, the `recognition' mapping from the input to the hi... |

35 |
Static and dynamic error propagation networks with applications to speech coding
- Robinson, Fallside
- 1988
(Show Context)
Citation Context ...toencoders One class of MLPs relates closely to density networks: the autoencoding network is trained to reproduce the input vector at its output after mapping it through a low-dimensional bottleneck =-=[16, 17]-=-. The density network is like the second `generative' half of the autoencoder | from the bottleneck to the output. Thesrst mapping in an autoencoder, the `recognition' mapping from the input to the hi... |

27 |
Bayesian learning via stochastic dynamics
- Neal
- 1993
(Show Context)
Citation Context ...rst by approximating the posterior distribution of w (2) by a Gaussianstted at the optimum w [1, 2]; and by methods that represent the posterior distribution by a set of Monte Carlo samples from it [=-=3-=-]. The former approach has been successfully applied to practical problems, as described elsewhere [4, 5]. See ref. [6] for a review. In the general case of a classication problem with multiple classe... |

20 |
Bayesian non-linear modelling for the 1993 energy prediction competition
- MacKay
- 1995
(Show Context)
Citation Context ...; and by methods that represent the posterior distribution by a set of Monte Carlo samples from it [3]. The former approach has been successfully applied to practical problems, as described elsewhere =-=[4, -=-5]. See ref. [6] for a review. In the general case of a classication problem with multiple classes i = 1 : : : I, a `softmax' classier is a natural form of model. This assigns probabilities to the alt... |

14 | Ace of Bayes: Application of Neural Networks with Pruning. Roskilde, The Danish Meat Research Institute
- Thodberg
- 1993
(Show Context)
Citation Context ...; and by methods that represent the posterior distribution by a set of Monte Carlo samples from it [3]. The former approach has been successfully applied to practical problems, as described elsewhere =-=[4, -=-5]. See ref. [6] for a review. In the general case of a classication problem with multiple classes i = 1 : : : I, a `softmax' classier is a natural form of model. This assigns probabilities to the alt... |

10 | Duality between learning machines: a bridge between supervised and unsupervised learning
- Nadal, Parga
- 1994
(Show Context)
Citation Context ...roblem, on the other hand, the hidden vector x is unknown, and the parameters w are conditionallysxed, for the purposes of the evidence evaluation. This is an example of the duality discussed in ref. =-=[11]-=-. Learning: the derivative of the evidence with respect to w The derivative of the log of the evidence (equation 11) is: @ @w log P (t (n) jw; H) = 1 P (t (n) jw; H) Z d H x exp(G (n) (x; w))P (xjH) @... |

7 |
The evidence framework applied to classi networks
- MacKay
- 1992
(Show Context)
Citation Context ...w P (t (N+1) jw; H)P (wjD; ; H)P (jD; H): (5) The Bayesian framework has been implemented in two ways:srst by approximating the posterior distribution of w (2) by a Gaussianstted at the optimum w [1,=-= 2]-=-; and by methods that represent the posterior distribution by a set of Monte Carlo samples from it [3]. The former approach has been successfully applied to practical problems, as described elsewhere ... |

6 |
Automatic relevance determination for neural networks
- MacKay, Neal
- 1994
(Show Context)
Citation Context ...ollections of weights assume small values is to use multiple undetermined regularization constants f c g, each one associated with a class of weights (cf. the automatic relevance determination model [=-=5, 15]-=-). For example, a weight class could consist of all the weights from one latent input to one softmax group. This prior would then favour solutions in which one latent input has non-zero connections to... |