## Computing Second Derivatives in Feed-Forward Networks: a Review (1994)

Venue: | IEEE Transactions on Neural Networks |

Citations: | 27 - 4 self |

### BibTeX

@ARTICLE{Buntine94computingsecond,

author = {Wray Buntine and Andreas S. Weigend},

title = {Computing Second Derivatives in Feed-Forward Networks: a Review},

journal = {IEEE Transactions on Neural Networks},

year = {1994},

volume = {5},

pages = {480--488}

}

### OpenURL

### Abstract

. The calculation of second derivatives is required by recent training and analyses techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We here review and develop exact and approximate algorithms for calculating second derivatives. For networks with jwj weights, simply writing the full matrix of second derivatives requires O(jwj 2 ) operations. For networks of radial basis units or sigmoid units, exact calculation of the necessary intermediate terms requires of the order of 2h + 2 backward/forward-propagation passes where h is the number of hidden units in the network. We also review and compare three approximations (ignoring some components of the second derivative, numerical differentiation, and scoring). Our algorithms apply to arbitrary activation functions, networks, and error functions (for instance, with connections that skip layers, or radial basis functions, or ...

### Citations

2723 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...ognitive Science, University of Colorado, Boulder, CO 80309-0430, USA. Buntine & Weigend Computing Second Derivatives: : : 2 1 Introduction In the last decade, error backpropagation (Rumelhart et al. =-=[RHW86]-=-) has emerged as the most popular training method for connectionist neural networks. Several of the more recent variations of backpropagation require second order derivatives in addition to the first ... |

1869 | Numerical Recipes in C: The Art of Scientific Computing - Press, Teukolsky, et al. - 1992 |

1363 |
Generalized linear models
- McCullagh, Nelder
- 1990
(Show Context)
Citation Context ...cond Derivatives: : : 14 for the probability density function p(zj`) parameterized by a vector of parameters `. This approximation is used in "Fisher's scoring method" for maximum likelihood=-= training [MN89]-=-. It is best used during search when fast estimates of second derivatives are required (see, for instance, [PFTV86, Section 14.4]). The approximation would be misleading when good approximations for e... |

419 | Optimal brain damage
- LeCun, Denker, et al.
- 1990
(Show Context)
Citation Context ...appear. Buntine & Weigend Computing Second Derivatives: : : 3 iteration, the efficiency of their computation is less crucial here than in the previous two cases. For example, Le Cun, Denker and Solla =-=[LDS90] ( "O-=-ptimal Brain Damage") use the Hessian of the error (calculated with the Becker and Le Cun approximation referred to above) to simplify the network by pruning weights in order to achieve good gene... |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...of the precision of the network outputs (i.e., confidence intervals or error bars), as well as the comparison of different networks trained on the same data, see Buntine and Weigend [BW91] and MacKay =-=[Mac92]-=-. The network pruning strategy given by Buntine and Weigend [BW91] has the advantage over Hassibi and Stork in that it does not require calculation of the inverse of the Hessian. Mackay [Mac92] uses n... |

339 |
Increased rates of convergence through learning rate adaptat ion
- Jacobs
- 1988
(Show Context)
Citation Context ...earning: Analysis. It is well known that the speed of training in least mean square algorithms is related to the ratio of the largest to the smallest eigenvalues. A good description is given by Jacob =-=[Jac88]-=-, and further analysis is presented by Le Cun, Kanter and Solla [LKS91]. This ratio is called the condition number and is also associated with the accuracy to which the minimum can be calculated. The ... |

252 |
Fast-learning variations on back-propagation: An empirical study
- Fahlman
- 1989
(Show Context)
Citation Context ...ection 4.1 we review several approximations: Becker and Le Cun [BL88] suggest a simple diagonal approximation to the Hessian. El-Jaroudi and Makhoul [EJM90] make a block matrix approximation. Fahlman =-=[Fah88]-=- uses in the Quickprop algorithm a simple diagonal approximation and also uses numerical differentiation from the error derivatives of the previous cycle to approximate the diagonal terms of the Hessi... |

232 |
Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition,” Neurocomputing—Algorithms, Architectures and Applications
- Bridle
- 1989
(Show Context)
Citation Context ...ith an offset (or bias or "0-weight") w n;0 which can be thought of as the weight for a constant activation u 0 = 1. Seemingly an exception to these two classes are so-called "Softmax&q=-=uot; units (Bridle, [Bri89]-=-). They are a convenient choice for the 1-of-N classification tasks: The constraints on probabilities to be non-negative and add up to one are modeled by using exponential units that are normalized, a... |

188 |
Predicting the Future: Connectionist Approach
- Weigend, Huberman, et al.
- 1990
(Show Context)
Citation Context ...fferentiation to calculate the Hessian of the error. An approximation of the Minimum Descripton Length principle that does not require calculation of the second derivatives is given by Weigend et al. =-=[WHR90] ("We-=-ight Elimination"), see also Barron and Barron, [BB88]. ffl After learning: Network analysis. One of the striking differences between connectionist modeling and traditional statistics is the larg... |

170 | Second order derivatives for network pruning: Optimal Brain Surgeon
- Stork, Hassibi
- 1993
(Show Context)
Citation Context ... of the error (calculated with the Becker and Le Cun approximation referred to above) to simplify the network by pruning weights in order to achieve good generalization performance. Hassibi and Stork =-=[HS93] ("Op-=-timal Brain Surgery") apply the Sherman-Morrison-Woodbury formula for iterative matrix inversion [GV89, PFTV86] to find the inverse of an approximate Hessian. In Section 4.2 of this paper, we sug... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...statistics is the larger number of parameters in neural networks. However, a key feature is that the potential number of parameters is often much larger than the effective number of parameters. Moody =-=[Moo92]-=- uses the Hessian of the error in a regularization framework to estimate the effective number of parameters of the network. Whereas Moody only considers the effective number of parameters at the end o... |

123 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...f the network and of the precision of the network outputs (i.e., confidence intervals or error bars), as well as the comparison of different networks trained on the same data, see Buntine and Weigend =-=[BW91]-=- and MacKay [Mac92]. The network pruning strategy given by Buntine and Weigend [BW91] has the advantage over Hassibi and Stork in that it does not require calculation of the inverse of the Hessian. Ma... |

97 |
Improving the Convergence of back-propagation Learning with Second order Methods
- Becker, leCun
- 1988
(Show Context)
Citation Context ...pdate, the speed of the computation is here crucial. For large networks, calculation of the full Hessian was considered prohibitive. In Section 4.1 we review several approximations: Becker and Le Cun =-=[BL88]-=- suggest a simple diagonal approximation to the Hessian. El-Jaroudi and Makhoul [EJM90] make a block matrix approximation. Fahlman [Fah88] uses in the Quickprop algorithm a simple diagonal approximati... |

93 |
Backpropagation: The basic theory
- Rumelhart, Durbin, et al.
- 1993
(Show Context)
Citation Context ... pattern depends on the outputs and targets for that pattern; typical error functions are mean squared error and cross-entropy. Details are given by Buntine and Weigend [BW91] and by Rumelhart et al. =-=[RDGC93]. For many-=- uses mentioned in the introduction we are interested in second derivatives for the sum (or average) error over the entire training set (the so-called "batch mode"), or at least a reasonable... |

69 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ...y large networks). Here second derivatives are used during line search, so the full Hessian is not required, just the product of the Hessian and a given vector. M��ller [Mo93b, Mo93a] and Pearlmut=-=ter [Pea93]-=- have both independently suggest numerical differentiation and exact calculation to compute this product. The calculation can also be used iteratively in the power method 2 [GV89] to efficiently appro... |

49 |
Statistical Learning Networks: A Unifying View
- Barron, Barron
- 1988
(Show Context)
Citation Context ...oximation of the Minimum Descripton Length principle that does not require calculation of the second derivatives is given by Weigend et al. [WHR90] ("Weight Elimination"), see also Barron an=-=d Barron, [BB88]-=-. ffl After learning: Network analysis. One of the striking differences between connectionist modeling and traditional statistics is the larger number of parameters in neural networks. However, a key ... |

47 |
Exact calculation of the hessian matrix for the multilayer perceptron
- Bishop
- 1992
(Show Context)
Citation Context .... Instead we use the notion of derivatives local to a unit and its immediate inputs/outputs, and derivatives global to the network, since these correspond to the key facets of the computation. Bishop =-=[Bis92]-=- avoided this complication by restricting results to sigmoidal units with error a simple sum and with no connections skipping layers. The goal of this paper is to present exact and efficient methods f... |

47 | Supervised learning of probability distributions by neural networks - Baum, Wilczek - 1988 |

44 | A scaled conjugate gradient algorithm for fast supervised learning - Mller - 1993 |

24 |
Second order properties of error surfaces
- LeCun, Kanter, et al.
- 1991
(Show Context)
Citation Context ... mean square algorithms is related to the ratio of the largest to the smallest eigenvalues. A good description is given by Jacob [Jac88], and further analysis is presented by Le Cun, Kanter and Solla =-=[LKS91]-=-. This ratio is called the condition number and is also associated with the accuracy to which the minimum can be calculated. The condition number can be approximated by approximating the largest and s... |

12 |
Rumelhart: Generalization through Minimal Networks with Application to Forecasting. Interface 91
- Weigend, E
- 1991
(Show Context)
Citation Context ...ve number of parameters of the network. Whereas Moody only considers the effective number of parameters at the end of the training process, when the error has reached a minimum, Weigend and Rumelhart =-=[WR91]-=- analyze the effective network size during training (via the dimension of the space spanned by the activations of the hidden units.) The gradual increase in network size with training time provides a ... |

11 |
A new error criterion for posterior probability estimation with neural nets
- EL-JAROUDI, MAKOUL
- 1990
(Show Context)
Citation Context ...f the full Hessian was considered prohibitive. In Section 4.1 we review several approximations: Becker and Le Cun [BL88] suggest a simple diagonal approximation to the Hessian. El-Jaroudi and Makhoul =-=[EJM90]-=- make a block matrix approximation. Fahlman [Fah88] uses in the Quickprop algorithm a simple diagonal approximation and also uses numerical differentiation from the error derivatives of the previous c... |

7 | Exact calculation of the product of the Hessian matrix of feed-forward network error functions and a vector in O(N) time - M��ller - 1993 |

6 |
Automatic learning rate maximization in large adaptive machines
- LeCun, Simard, et al.
- 1993
(Show Context)
Citation Context ...n can also be used iteratively in the power method 2 [GV89] to efficiently approximate the principle eigenvectors of the Hessian. The principle eigenvectors are used by Le Cun, Simard and Pearlmutter =-=[LSP93]-=- to speed up gradient descent. ffl Within learning: Analysis. It is well known that the speed of training in least mean square algorithms is related to the ratio of the largest to the smallest eigenva... |

6 |
Neural networks, system identification, and control in the chemical industries
- Werbos
- 1992
(Show Context)
Citation Context ...ule for differentiation? Yes---but there are many ways, with varying degrees of efficiency and accuracy. To handle the complication of second derivatives over a complex network, Werbos, McAvoy and Su =-=[WMS92]-=- introduced the notion of the ordered derivative. Instead we use the notion of derivatives local to a unit and its immediate inputs/outputs, and derivatives global to the network, since these correspo... |