## Ace of Bayes: Application of Neural Networks with Pruning (1993)

Venue: | The Danish Meat Research Institute, Maglegaardsvej 2, DK-4000 |

Citations: | 14 - 0 self |

### BibTeX

@TECHREPORT{Thodberg93aceof,

author = {Hans Henrik Thodberg},

title = {Ace of Bayes: Application of Neural Networks with Pruning},

institution = {The Danish Meat Research Institute, Maglegaardsvej 2, DK-4000},

year = {1993}

}

### OpenURL

### Abstract

MacKay's Bayesian framework for backpropagation is a practical and powerful means of improving the generalisation ability of neural networks. The framework is reviewed and extended in a pedagogical way. The notation is simplified using the ordinary weight decay parameter, and the noise parameter fi is shown to be nothing more than an overall scale. A detailed and explicit procedure for adjusting several weight decay parameters is given. Pruning is incorporated into the Bayesian framework. Appropriate symmetry factors on sparse architectures are deduced. Bayesian weight decay is demonstrated using artificial data generated by a sparsely connected network. Pruning yields computational advantages: by removing unimportant weights the posterior weight distribution becomes Gaussian, and pruning removes zero-modes of the Hessian and redundant hidden units. In addition, pruning improves generalisation. The Bayesian evidence is used as a stop criterion for pruning. Bayesian backprop is applied ...

### Citations

1320 |
Statistical Decision Theory and Bayesian Analysis. Second Edition
- Berger
- 1985
(Show Context)
Citation Context ...nd the approach is both theoretical and applied. However, the maximum entropy principle plays a minor if any role in Bayesian backprop. Bayesian backprop is based on the Bayesian school of statistics =-=[6, 7]. This is -=-distinct from the main-stream "sampling theory" statistics, where the concept of probability must be attached to frequencies of samples drawn from a distribution. In contrast, Bayesians also... |

560 | Bayesian interpolation
- MacKay
- 1992
(Show Context)
Citation Context ...ing to the evidence. The evidence can also be used as a stop criterion for network growing or pruning. 1.2 The Proper Treatment of Bayesian Backprop Bayesian backprop was introduced by MacKay in 1991 =-=[1, 2, 3, 4]-=- as a radically different approach to the problem of overfitting and model comparison. The Bayesian framework for backprop has its origin in the field of Maximum Entropy [5] which develops better mode... |

421 | A Practical Bayesian Framework for Backpropagation Networks
- MacKay
- 1992
(Show Context)
Citation Context ...ing to the evidence. The evidence can also be used as a stop criterion for network growing or pruning. 1.2 The Proper Treatment of Bayesian Backprop Bayesian backprop was introduced by MacKay in 1991 =-=[1, 2, 3, 4]-=- as a radically different approach to the problem of overfitting and model comparison. The Bayesian framework for backprop has its origin in the field of Maximum Entropy [5] which develops better mode... |

341 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ...ing to the evidence. The evidence can also be used as a stop criterion for network growing or pruning. 1.2 The Proper Treatment of Bayesian Backprop Bayesian backprop was introduced by MacKay in 1991 =-=[1, 2, 3, 4]-=- as a radically different approach to the problem of overfitting and model comparison. The Bayesian framework for backprop has its origin in the field of Maximum Entropy [5] which develops better mode... |

180 | order derivatives for network pruning: Optimal brain surgeon
- Hassibi, Stark, et al.
- 1996
(Show Context)
Citation Context ...ements the optimal trade-off between data fit and model simplicity. There are several examples in the literature demonstrating that sparse networks can generalise better than fully connected networks =-=[11, 12, 13, 14, 15, 16]-=-. 3.1 The Prior for Sparsely Connected Networks There is a problem with the factor SH in the Ockham Factor for the weights (24). To see this, consider a fully connected network H. Assume that ED is in... |

174 | The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...rror of Prediction) on test samples grouped according to the predicted error bars (see figure 11). The observed SEPs are given with standard errors computed from �� 2 -statistics. 6 Moody's GPE Mo=-=ody [23]-=- proposed an estimator of the generalisation error for neural networks. It is a generalisation of Akaike's Final Prediction Error and is called the Generalised Prediction Error. GPE predicts the test ... |

165 | The evidence framework applied to classification networks
- MacKay
- 1992
(Show Context)
Citation Context |

159 | Bayesian Methods for Adaptive Models
- MacKay
- 1992
(Show Context)
Citation Context ...pends on the output or input. A possible solution is to transform the output variable or to introduce a noise level which depends on the output. A treatment of input dependent noise level is given in =-=[25]-=-. 7.3 Testing for Non-linearities Neural networks are well suited to test whether a data set defines a linear or a nonlinear regression. Linear models and neural networks are trained and the evidences... |

51 |
Exact calculation of the Hessian matrix for the multilayer perceptron
- Bishop
- 1992
(Show Context)
Citation Context ...etermined parameters in group g 12 : �� i g = X j2Gg S 2 ji fl g = X isisi + �� MP �� i g ! g = fl g k g 11 The Hessian can be evaluated analytically using an extension to backprop develop=-=ed by Bishop[21]. Th-=-is involves some elaborate programming but the Hessian is evaluated in just h epochs. 12 The equations for ��gMP and fl g are coupled and it may seem natural to solve these iteratively, but this t... |

21 | On the use of evidence in neural networks
- Wolpert
- 1993
(Show Context)
Citation Context ...ay's papers display a good balance between these two aspects. Other workers in the field have dealt exclusively with the second aspect and presented alternative Bayesian approaches to neural networks =-=[8, 9]-=-. Wolpert [9] argues on purely theoretical grounds that his alternative treatment of the weight priors is more correct. However, he presents no simulations, so irrespective of the correctness of his t... |

21 | Bayesian model comparison and backprop nets
- DHAENE, MacKay
(Show Context)
Citation Context ...timator but is a quality measure based on metaprinciples. Performance. For the spectroscopic data GPE is a poorer selector of good networks than is the evidence. However, for the robot arm problem in =-=[24]-=- and the artificial data of section 5.3 the GPE is as good as the evidence. This suggests that the GPE performs no better than the evidence in model comparison. Properties. GPE has a simpler structure... |

18 |
Improving generalization of neural networks through pruning
- Thodberg
- 1991
(Show Context)
Citation Context ...ements the optimal trade-off between data fit and model simplicity. There are several examples in the literature demonstrating that sparse networks can generalise better than fully connected networks =-=[11, 12, 13, 14, 15, 16]-=-. 3.1 The Prior for Sparsely Connected Networks There is a problem with the factor SH in the Ockham Factor for the weights (24). To see this, consider a fully connected network H. Assume that ED is in... |

5 |
Rumelhart: "Prediction the future: A Connectionist Approach", Int
- Weigend, Huberman, et al.
- 1990
(Show Context)
Citation Context ...ements the optimal trade-off between data fit and model simplicity. There are several examples in the literature demonstrating that sparse networks can generalise better than fully connected networks =-=[11, 12, 13, 14, 15, 16]-=-. 3.1 The Prior for Sparsely Connected Networks There is a problem with the factor SH in the Ockham Factor for the weights (24). To see this, consider a fully connected network H. Assume that ED is in... |

4 |
R.Doursat, "Neural Networks and the Bias/Variance Dilemma
- Geman, Bienenstock
- 1992
(Show Context)
Citation Context ...its a lot from pruning. On the other hand, pruning benefits from the Bayesian method which provides the evidence as a stop criterion for pruning. 3.4 Alleviating the Bias/Variance Dilemma Geman et al =-=[17]-=- introduced the Bias/Variance Dilemma for neural networks. They argue as follows: Neural networks are nearly un-biased , they can realise any mapping, but the price is a large number of parameters. Wi... |

3 |
P.Salamon, "Neural Network Ensembles
- Hansen
- 1990
(Show Context)
Citation Context ...s how this uncertainty can be estimated. The committee gives two advantages. ffl Predictions using the average of the committee gives a better generalisation than the average network in the committee =-=[18]-=-. ffl The degree of dissent within the committee contributes to the uncertainty of the predictions. By including this we get more reliable error bars on the predictions. This is treated in section 4.4... |

2 |
Bayesian Learning via Stochastic Dynamics", Neural Information Processing Systems, Vol.5 ed. C.L.Giles, S.J.Hanson and J.D.Cowan
- Neal
- 1993
(Show Context)
Citation Context ... predicted quantity at the mode (the maximum) is not always a good approximation to the integral over the posterior distribution. To overcome these problems Neal has developed a Monte Carlo technique =-=[19]-=-. ffl The evidence as a quality measure could reflect a mixture of virtues, of which the generalisation error is just one. Other virtues could be correct architecture, explanatory power and other conc... |

2 |
H.H.Thodberg, "Optimal Minimal Neural Interpretation of Spectra", Analytic Chemistry 64
- Borggaard
- 1992
(Show Context)
Citation Context ...to a real-life application from the meat industry. The data were recorded by a Tecator near-infrared spectrometer which measured the spectrum of light transmitted through a sample of minced pork meat =-=[22]-=-. The spectrum gives the absorbance at 100 wavelengths in the region 850-1050 nm. We want to calibrate the spectrometer to determine the fat content from the spectrum. The target values of the fat con... |

1 |
G.Hinton, "Simplifying Neural Networks by Soft WeightSharing
- Nowlan
- 1992
(Show Context)
Citation Context |

1 |
paper on Bayesian hyperparameters
- MacKay
- 1993
(Show Context)
Citation Context ...ternative treatment of the weight priors is more correct. However, he presents no simulations, so irrespective of the correctness of his theory, it is unclear what kind of problems it is suitable for =-=[20]-=-. There is no doubt that MacKay's aim was that the Bayesian techniques should be used in real-life applications. However, this has not yet happened in the case of neural networks. This paper is intend... |