## Ensemble Learning and Evidence Maximization (1995)

Venue: | Proc. NIPS |

Citations: | 18 - 2 self |

### BibTeX

@TECHREPORT{MacKay95ensemblelearning,

author = {David J.C. MacKay and F \gamma},

title = {Ensemble Learning and Evidence Maximization},

institution = {Proc. NIPS},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

Ensemble learning by variational free energy minimization is a tool introduced to neural networks by Hinton and van Camp in which learning is described in terms of the optimization of an ensemble of parameter vectors. The optimized ensemble is an approximation to the posterior probability distribution of the parameters. This tool has now been applied to a variety of statistical inference problems. In this paper I study a linear regression model with both parameters and hyperparameters. I demonstrate that the evidence approximation for the optimization of regularization constants can be derived in detail from a free energy minimization viewpoint. 1 Ensemble Learning by Free Energy Minimization A new tool has recently been introduced into the field of neural networks and statistical inference. In traditional approaches to neural networks, a single parameter vector w is optimized by maximum likelihood or penalized maximum likelihood. In the Bayesian interpretation, these optimized param...

### Citations

764 | A view of the EM algorithm that justifies incremental sparse and other variants - Neal, Hinton - 1998 |

522 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...ers are viewed as defining the mode of a posterior probability distribution P (wjD; H) (given data D and model assumptions H), which can be approximated, with a Gaussian distribution ~ P for example (=-=MacKay 1992-=-b), in order to obtain predictive distributions and optimize model control parameters. The new concept introduced by Hinton and van Camp (1993) is to work in terms of an approximating ensemble Q(w; `)... |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...ers are viewed as defining the mode of a posterior probability distribution P (wjD; H) (given data D and model assumptions H), which can be approximated, with a Gaussian distribution ~ P for example (=-=MacKay 1992-=-b), in order to obtain predictive distributions and optimize model control parameters. The new concept introduced by Hinton and van Camp (1993) is to work in terms of an approximating ensemble Q(w; `)... |

128 | Keeping the neural networks simple by minimizing the description length of the weights - Hinton, Camp - 1993 |

120 |
Statistical Mechanics
- Feynman
- 1972
(Show Context)
Citation Context ...ins this value for Q(w; `) = P (wjD; H). F (`) can be viewed as the sum of \Gamma log P (DjH) and the Kullback1 Variational free energy minimization is a well-established tool in statistical physics (=-=Feynman 1972-=-); `mean field theory' is an important special case. The free energy can also be described in terms of description lengths. Leibler divergence between Q(w; `) and P (wjD; H). For certain models and ce... |

110 | Autoencoders, minimum description length, and Helmholtz free energy - Hinton, Zemel - 1994 |

79 |
Bayesian inductive inference and maximum entropy
- GULL
- 1988
(Show Context)
Citation Context ...ed by a regularization constant ff. The variables ff and fi are known as hyperparameters. Problems for which models can be written in the form (2) include linear interpolation with a fixed basis set (=-=Gull 1988-=-; MacKay 1992a), non-linear regression with a neural network (MacKay 1992b), and image deconvolution (Gull 1989). In the simplest case (linear models, Gaussian noise), the first factor in (2), the lik... |

73 |
Developments in Maximum entropy data analysis
- GULL
- 1989
(Show Context)
Citation Context ...ht decay parameters. 2 Inference of parameters and hyperparameters There has been a debate over the appropriateness of the generalized maximum likelihood method, also known as the evidence framework (=-=Gull 1989-=-; MacKay 1992a), for controlling hyperparameters in linear and non-linear regression models (Wolpert 1993; MacKay 1994). In this section I demonstrate that, for linear models, a simple free energy min... |

20 |
The helmholtz machine. Neural computation
- Dayan, Hinton, et al.
- 1995
(Show Context)
Citation Context ...ting distribution Q for this distribution P . A single objective function F can then be defined for simultaneous optimization of the generative model and the recognition model. The Helmholtz machine (=-=Dayan et al. 1995-=-) is a further generalization of these ideas. In a broader statistical context, Neal and Hinton (1993) have shown that it is possible to view the Expectation-Maximization (EM) algorithm in terms of a ... |

18 | Hyperparameters: Optimize or integrate out
- Mackay
- 1996
(Show Context)
Citation Context ...mation: find the self-consistent solution fw MPjff MP ; ff MP g such that w MPjff MP maximizes P (wjD; ff MP ; H) and ff MP satisfies equation (9). Justifications for this approximation are given in (=-=MacKay 1995-=-b; MacKay 1994), where correction terms of order 1= p fl are also given. 2.3 Free energy approximation Let us consider approximating the joint distribution of w and ff given the data, P (w; ffjD; H) =... |

16 | Free-energy minimization algorithm for decoding and cryptoanalysis. Electron Letters 31:445–47. [aAC] MacKay,D.M.(1956)Theepistemologicalproblemforautomata.In:Automata studies
- MacKay, C
- 1995
(Show Context)
Citation Context ...mation: find the self-consistent solution fw MPjff MP ; ff MP g such that w MPjff MP maximizes P (wjD; ff MP ; H) and ff MP satisfies equation (9). Justifications for this approximation are given in (=-=MacKay 1995-=-b; MacKay 1994), where correction terms of order 1= p fl are also given. 2.3 Free energy approximation Let us consider approximating the joint distribution of w and ff given the data, P (w; ffjD; H) =... |

2 |
Hyperparameters: optimize, or integrate out? submitted to Neural Computation
- Mackay
- 1995
(Show Context)
Citation Context ...of the generalized maximum likelihood method, also known as the evidence framework (Gull 1989; MacKay 1992a), for controlling hyperparameters in linear and non-linear regression models (Wolpert 1993; =-=MacKay 1994-=-). In this section I demonstrate that, for linear models, a simple free energy minimization approximation reproduces the method of the evidence framework precisely. This demonstration further clarifie... |