Results 1 
8 of
8
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
(Show Context)
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Efficient Training of FeedForward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Exact Calculation of the Product of the Hessian Matrix of FeedForward Network Error Functions and a Vector in O(N) Time
"... Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learni ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrix times a vector in O(N ) time, where N is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques. 1 Introduction The second derivative information of the error function associated with feedforward neural networks forms an N \Theta N matrix, which is usually referred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient a...
Performance Evaluation of Feedforward Networks Using Computational Methods
 In Proceedings of NEURAP'95. NEURAP
, 1996
"... . We will demonstrate that the performance evaluation of feedforward neural networks using computational methods may result in overly pessimistic estimates of the prediction error. In fact, they capture unwanted variability in the distribution of weights, introduced by local maxima in the likelihood ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
. We will demonstrate that the performance evaluation of feedforward neural networks using computational methods may result in overly pessimistic estimates of the prediction error. In fact, they capture unwanted variability in the distribution of weights, introduced by local maxima in the likelihood function in connection with deficiencies of gradient based learning procedures. An analysis of the influence of local maxima is hampered due to a nontrivial algebraic structure of the weight space; we will show that typical feedforward networks exhibit a large number of symmetries due to a nontrivial symmetry group acting on the weight space. We will present an algorithm which divides out these symmetries. In the resulting much smaller effective weight space, clustering algorithms may be used to improve the assessment of prediction errors. We will demonstrate that this method can be successfully applied. 1 Introduction Feedforward networks can be interpreted as a form of nonlinear regressi...
by
, 1997
"... Since the discovery of the backpropagation method, many modied and new algorithms have been proposed for training of feedforward neural networks. The problem with slow convergence rate has, however, not been solved when the training is on large scale problems. There is still a need for more ecien ..."
Abstract
 Add to MetaCart
Since the discovery of the backpropagation method, many modied and new algorithms have been proposed for training of feedforward neural networks. The problem with slow convergence rate has, however, not been solved when the training is on large scale problems. There is still a need for more ecient algorithms. This Ph.D. thesis describes dierent approaches to improve convergence. The main results of the thesis is the development of the Scaled Conjugate Gradient Algorithm and the stochastic version of this algorithm. Other important results are the development of methods that can derive and use Hessian information in an ecient way. The main part of this thesis is the 5 papers presented in appendices AE. Chapters 16 give an overview of learning in feedforward neural networks, put these papers in perspective and present the most important results. The conclusion of this thesis is: Conjugate gradient algorithms are very suitable for training of feedforward networks. Use of second order information by calculations on the Hessian matrix can be used to improve convergence.
Aarhus University
"... Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learni ..."
Abstract
 Add to MetaCart
Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrix times a vector inO(N) time, whereN is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques. 1
parameters such as cr and A, that is Pr(lw , or) = Pr(). Posterior probabilities of network weights are as follows. For regression with Gaussian error and unknown a,
"... This paper has covered Bayesian theory relevant to the problem of training feedforward connectionist networks. We now sketch out how this might be put together in practice, assuming a standard gradient descent algorithm as used during search ..."
Abstract
 Add to MetaCart
This paper has covered Bayesian theory relevant to the problem of training feedforward connectionist networks. We now sketch out how this might be put together in practice, assuming a standard gradient descent algorithm as used during search