## Bayesian Regularisation and Pruning using a Laplace Prior (1994)

Venue: | Neural Computation |

Citations: | 19 - 0 self |

### BibTeX

@ARTICLE{Williams94bayesianregularisation,

author = {Peter M. Williams},

title = {Bayesian Regularisation and Pruning using a Laplace Prior},

journal = {Neural Computation},

year = {1994},

volume = {7},

pages = {117--143}

}

### Years of Citing Articles

### OpenURL

### Abstract

Standard techniques for improved generalisation from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy indicates a Laplace rather than a Gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error (2) those failing to achieve this sensitivity and which therefore vanish. Since the critical value is determined adaptively during training, pruning---in the sense of setting weights to exact zeros---becomes a consequence of regularisation alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a Gaussian regulariser. 1 Introduction Neural networks designed for regression or classification need to be trained using some form of stabilisation or re...

### Citations

1515 | Practical Optimization - Gill, Murray, et al. - 1981 |

1199 |
Practical Methods of Optimization
- Fletcher
- 1981
(Show Context)
Citation Context ...[12]. Alternatively f 00 (0) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f 00 (0) is sometimes negative or if the quadratic assumption is poor =-=[4, 24, 13]. that �-=-���* is determined in this way. All that is required is an iterative procedure that moves at each step some distance along a search direction s from w to w + ��s, together with some preferred wa... |

927 |
Solutions of Ill-Posed Problems
- Tikhonov, Arsenin
- 1977
(Show Context)
Citation Context ... would be to assume each jw j j has a log-normal distribution or a mixture of a log-normal and an exponential distribution, compare [16]. For an approach to formal stabilisation, more in the style of =-=[21]-=-, see Bishop [1]. The weight prior in (2) depends on ff and can be written P (wjff) = ZW (ff) \Gamma1 exp \GammaffE W (10) where ff is now considered as a nuisance parameter. If a prior P (ff) is assu... |

574 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...ference for a particular type of model. From a Bayesian point of view the regulariser corresponds to a prior probability distribution over free parameters w of the model. Using the notation of MacKay =-=[9, 10]-=- the regularised cost function can be written as M(w) = fiE D (w) + ffE W (w) (1) where ED measures the data misfit, EW is the penalty term and ff; fi ? 0 are regularising parameters determining a bal... |

443 | Optimal brain damage
- Cun, Denker, et al.
- 1989
(Show Context)
Citation Context ... stabilisation is exemplified in polynomial curve fitting by explicitly limiting the degree of the polynomial. Examples relating to neural networks are found in the pruning algorithms of le Cun et al =-=[8]-=- and Hassibi & Stork [6]. These use second-order information to determine which weight can be eliminated next at the cost of minimum increase in data misfit. They do not by themselves, however, give a... |

427 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...ference for a particular type of model. From a Bayesian point of view the regulariser corresponds to a prior probability distribution over free parameters w of the model. Using the notation of MacKay =-=[9, 10]-=- the regularised cost function can be written as M(w) = fiE D (w) + ffE W (w) (1) where ED measures the data misfit, EW is the penalty term and ff; fi ? 0 are regularising parameters determining a bal... |

182 | order derivatives for network pruning: Optimal Brain Surgeon
- Hassibi, Stork
- 1992
(Show Context)
Citation Context ...ified in polynomial curve fitting by explicitly limiting the degree of the polynomial. Examples relating to neural networks are found in the pruning algorithms of le Cun et al [8] and Hassibi & Stork =-=[6]-=-. These use second-order information to determine which weight can be eliminated next at the cost of minimum increase in data misfit. They do not by themselves, however, give a criterion for when to s... |

145 |
A.: Generalization by weightelimination with application to forecasting
- Weigend, Rumelhart, et al.
- 1991
(Show Context)
Citation Context ...e real numbers, but the same ideas can be applied to classification networks where the targets are exclusive class labels. 3 A penalty function w 2 =(1 +w 2 ) similar to log (1 +w 2 ) is the basis of =-=[23]-=-. that constraining the mean of the signed weights to be zero is not an adequate expression of the intrinsic symmetry in the signs of the weights. A zero mean distribution need not be symmetric and a ... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...iation oe, the likelihood of the data is N Y p=1 1 p 2��oe exp \Gamma 1 2 ` y p \Gamma t p oe ' 2 which implies that ED = 1 2 N X p=1 (y p \Gamma t p ) 2 (3) 1 The notation is somewhat schematic. =-=See [2, 9, 15] for-=- more explicit notations. according to (2) with fi = 1=oe 2 and ZD = (2��=fi) N=2 . As ff ! 0 we have the improper uniform prior over w so that P (wjD) / P (Djw) and M is proportional to ED . This... |

83 | Experiments on Learning by Back Propagation
- Plaut, Nowlan
(Show Context)
Citation Context ...l stabilisation. Formal stabilisation involves adding an extra term to the cost function that penalises more complex models. In the neural network literature this often takes the form of weight decay =-=[18]-=- using the penalty function P j w 2 j where summation is over components of the weight vector. Structural stabilisation is exemplified in polynomial curve fitting by explicitly limiting the degree of ... |

79 |
Large automatic learning, rule extraction and generalisation
- Denker, Schwartz, et al.
- 1987
(Show Context)
Citation Context ...ion or regularisation if they are to generalise well beyond the original training set. This means finding a balance between complexity of the network and information content of the data. Denker et al =-=[3]-=- distinguish formal and structural stabilisation. Formal stabilisation involves adding an extra term to the cost function that penalises more complex models. In the neural network literature this ofte... |

65 | Rational, Descriptions, Decisions and Designs - Tribus - 1969 |

48 |
A Scaled Conjugate Gradient Algorithm for Fast
- Mller
- 1993
(Show Context)
Citation Context ...[12]. Alternatively f 00 (0) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f 00 (0) is sometimes negative or if the quadratic assumption is poor =-=[4, 24, 13]. that �-=-���* is determined in this way. All that is required is an iterative procedure that moves at each step some distance along a search direction s from w to w + ��s, together with some preferred wa... |

40 | Bayesian training of backpropagation networks by the Hybrid Monte Carlo method
- Neal
- 1992
(Show Context)
Citation Context ... 5 This means assuming that log ff is uniformly distributed or equivalently that log ff is uniformly distributed for anys? 0 and jj ? 0. The same results can be obtained as the limit of a Gamma prior =-=[14, 26]-=-. 6 The 1 2 comes from the fact that ED is measured in squared units. Assuming Laplacian noise this term becomes N log ED with ED = P p jyp \Gamma tp j. 5 Priors, regularisation classes and initialisa... |

29 |
Bayesian learning via stochastic dynamics
- Neal
- 1993
(Show Context)
Citation Context ...iation oe, the likelihood of the data is N Y p=1 1 p 2��oe exp \Gamma 1 2 ` y p \Gamma t p oe ' 2 which implies that ED = 1 2 N X p=1 (y p \Gamma t p ) 2 (3) 1 The notation is somewhat schematic. =-=See [2, 9, 15] for-=- more explicit notations. according to (2) with fi = 1=oe 2 and ZD = (2��=fi) N=2 . As ff ! 0 we have the improper uniform prior over w so that P (wjD) / P (Djw) and M is proportional to ED . This... |

21 | On the use of evidence in neural networks
- Wolpert
- 1993
(Show Context)
Citation Context ...ns the differences in results, when using these two factors with Laplace regularisation, are not sufficiently clear to decide the matter empirically and it needs to be settled on grounds of principle =-=[11, 27]-=-. In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the Gaussian regularis... |

14 |
Hyperparameters: optimise or integrate out
- MacKay
- 1993
(Show Context)
Citation Context ...ns the differences in results, when using these two factors with Laplace regularisation, are not sufficiently clear to decide the matter empirically and it needs to be settled on grounds of principle =-=[11, 27]-=-. In the present context, this paper prefers the method of integrating over hyperparameters for reasons of simplicity. Its main purpose, however, is to advocate the Laplace over the Gaussian regularis... |

14 |
A Marquardt Algorithm for Choosing the Step-size in Back-Propagation Learning with Conjugate Gradients
- Williams
- 1992
(Show Context)
Citation Context ...[12]. Alternatively f 00 (0) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f 00 (0) is sometimes negative or if the quadratic assumption is poor =-=[4, 24, 13]. that �-=-���* is determined in this way. All that is required is an iterative procedure that moves at each step some distance along a search direction s from w to w + ��s, together with some preferred wa... |

11 |
Curvature-driven smoothing in backpropagation neural networks,” in Theory and Applications of Neural Networks
- Bishop
- 1992
(Show Context)
Citation Context ...me each jw j j has a log-normal distribution or a mixture of a log-normal and an exponential distribution, compare [16]. For an approach to formal stabilisation, more in the style of [21], see Bishop =-=[1]-=-. The weight prior in (2) depends on ff and can be written P (wjff) = ZW (ff) \Gamma1 exp \GammaffE W (10) where ff is now considered as a nuisance parameter. If a prior P (ff) is assumed, ff can be i... |

11 |
Adaptive soft weight tying using gaussian mixtures
- Nowlan, Hinton
- 1991
(Show Context)
Citation Context ... [2]. A comparison is made in the Appendix. 4 A possible alternative would be to assume each jw j j has a log-normal distribution or a mixture of a log-normal and an exponential distribution, compare =-=[16]-=-. For an approach to formal stabilisation, more in the style of [21], see Bishop [1]. The weight prior in (2) depends on ff and can be written P (wjff) = ZW (ff) \Gamma1 exp \GammaffE W (10) where ff ... |

8 |
Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in O(N) Time
- Mller
- 1993
(Show Context)
Citation Context ...L and f 00 (0) = s \Delta rrL s where rrL is the Hessian of L. 10 The new weight vector is then w+��*s. It is not required, however, 10 The matrix-vector product rrL s can be calculated using [17]=-= or [12]-=-. Alternatively f 00 (0) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f 00 (0) is sometimes negative or if the quadratic assumption is poor [4, ... |

2 |
probabilities
- Prior
- 1968
(Show Context)
Citation Context ...ce) (5) where 1=ff is the mean absolute value. Another possibility is the Cauchy distribution EW = (1=ff) W X j=1 log i 1 + ff 2 w 2 j j (Cauchy) (6) where 1=ff is the median absolute value. 3 Jaynes =-=[7]-=- offers two principles---transformation groups and maximum entropy---for setting up probability distributions in the absence of frequency data. These can be applied as follows. For any feed-forward ne... |

2 |
Thodberg. Ace of bayes : Application of neural networks with pruning
- Henrik
- 1993
(Show Context)
Citation Context ...erical approximations need be made and the method can be applied exactly to small noisy data sets where the ratio of free parameters to data points may approach unity. Appendix The evidence framework =-=[9, 10, 20]-=- proposes to set the regularising parameters ff and fi by maximising P (D) = Z P (Djw)P (w) dw (23) considered as a function of ff and fi. This quantity is interpreted as the evidence for the overall ... |

2 |
Aeromagnetic compensation using neural networks
- Williams
- 1993
(Show Context)
Citation Context ...e or with a strict reduction in P c W c . Since each W c is finite the compound process must terminate. 9 Examples Examples of Laplace regularisation applied to problems in geophysics can be found in =-=[25]-=- and [26]. This section compares results obtained using the Laplace regulariser with those of MacKay [10] using the Gaussian regulariser and the evidence framework. The problem concerns a simple two j... |

1 |
A Pearlmutter. Fast exact multiplication by the Hessian
- Barak
- 1994
(Show Context)
Citation Context ...\Delta rL and f 00 (0) = s \Delta rrL s where rrL is the Hessian of L. 10 The new weight vector is then w+��*s. It is not required, however, 10 The matrix-vector product rrL s can be calculated us=-=ing [17]-=- or [12]. Alternatively f 00 (0) can be calculated by differencing first derivatives. Levenberg-Marquardt methods can be used in case f 00 (0) is sometimes negative or if the quadratic assumption is p... |

1 |
Improved generalization and network pruning using adaptive Laplace regularization
- Williams
- 1993
(Show Context)
Citation Context ... 5 This means assuming that log ff is uniformly distributed or equivalently that log ff is uniformly distributed for anys? 0 and jj ? 0. The same results can be obtained as the limit of a Gamma prior =-=[14, 26]-=-. 6 The 1 2 comes from the fact that ED is measured in squared units. Assuming Laplacian noise this term becomes N log ED with ED = P p jyp \Gamma tp j. 5 Priors, regularisation classes and initialisa... |