Results 1  10
of
10
Efficient BackProp
, 1998
"... . The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers expl ..."
Abstract

Cited by 125 (24 self)
 Add to MetaCart
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Training Neural Nets with the Reactive Tabu Search
"... In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrategy to real ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrategy to realize a discrete dynamical system that discourages limit cycles and the confinement of the search trajectory in a limited portion of the search space. The possible cycles are discouraged by prohibiting (i.e., making tabu) the execution of moves that reverse the ones applied in the most recent part of the search, for a prohibition period that is adapted in an automated way. The confinement is avoided and a proper exploration is obtained by activating a diversification strategy when too many configurations are repeated excessively often. The RTS method is applicable to nondifferentiable functions, it is robust with respect to the random initialization and effective in continuing the search after local minima. Three tests of the technique on feedforward and feedback systems are presented.
Efficient Training of FeedForward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Backpropagation Convergence Via Deterministic Nonmonotone Perturbed Minimization
, 1994
"... The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established that every accumulation point of the online BP iterates is a stationary point of the BP error function. The results presented cover serial and parallel online BP, modified BP with a momentum term, and BP with weight decay.
Exact Calculation of the Product of the Hessian Matrix of FeedForward Network Error Functions and a Vector in O(N) Time
"... Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learni ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Several methods for training feedforward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrix times a vector in O(N ) time, where N is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques. 1 Introduction The second derivative information of the error function associated with feedforward neural networks forms an N \Theta N matrix, which is usually referred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient a...
Scaled Conjugate Gradients for Maximum likelihood: An Empirical Comparison with the EM Algorithm
 Proceedings of the First European Workshop on Probabilistic Graphical Models (PGM02
, 2002
"... Abstract. To learn Bayesian networks, one must estimate the parameters of the network from the data. EM (ExpectationMaximization) and gradientbased algorithms are the two best known techniques to estimate these parameters. Although the theoretical properties of these two frameworks are wellstudie ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. To learn Bayesian networks, one must estimate the parameters of the network from the data. EM (ExpectationMaximization) and gradientbased algorithms are the two best known techniques to estimate these parameters. Although the theoretical properties of these two frameworks are wellstudied, it remains an open question as to when and whether EM is to be preferred over gradients. We will answer this question empirically. More specifically, we first adapt scaled conjugate gradients wellknown from neural network learning. This accelerated conjugate gradient avoids the time consuming line search of more traditional methods. Secondly, we empirically compare scaled conjugate gradients with EM. The experiments show that accelerated conjugate gradients are competitive with EM. Although, in general EM is the domain independent method of choice, gradientbased methods can be superior. 1 Introduction Bayesian networks are one of the most important, efficient and elegant frameworks for representing and reasoning with probabilistic models. They specify joint probability distributions over finite sets of random variables, and have been applied to many realworld problems in diagnosis, forecasting, automated vision, sensor fusion and manufacturing control. Over the past years, there has been much interest in the problem of learning Bayesian networks from data to avoid the problems of knowledge elicitation. For learning Bayesian networks, parameter estimation is a fundamental task not only because of the inability of humans to reliably estimate the parameters, but also because it forms the basis for the overall learning problem [10].
Parametric Regression
"... 8.39> w, the set of parameters for which the model is the closest to the original mapping, or system, f . 1.2 Learning and optimisation The fit of the model to the system in a given point x is measured using a criterion representing the distance from the model prediction b y to the system, e (y; f w ..."
Abstract
 Add to MetaCart
8.39> w, the set of parameters for which the model is the closest to the original mapping, or system, f . 1.2 Learning and optimisation The fit of the model to the system in a given point x is measured using a criterion representing the distance from the model prediction b y to the system, e (y; f w (x)). This is the local risk . The performance of the model is measured by the expected risk : G (w) = E x;y (e (y; f w (x))) = Z Z e (y; f w (x)) p (yjx) p (x) dxdy (1.1) This quantity represents the ability to yield good performance f
Learning with First, Second, and No Derivatives: a Case Study in High Energy Physics
 Neurocomputing
, 1994
"... this paper different algorithms for training multilayer perceptron architecture are applied to a significant discrimination task in High Energy Physics. The OneStep Secant technique is compared with OnLine Backpropagation, the "Bold Driver" batch version and Conjugate Gradient methods. In addition ..."
Abstract
 Add to MetaCart
this paper different algorithms for training multilayer perceptron architecture are applied to a significant discrimination task in High Energy Physics. The OneStep Secant technique is compared with OnLine Backpropagation, the "Bold Driver" batch version and Conjugate Gradient methods. In addition, a new algorithm (Affine Shaker) is proposed that uses stochastic search based on function values and affine transformations of the local search region. Although the Affine Shaker requires more CPU time to reach the maximum generalization, the technique can be interesting for specialpurpose VLSI implementations and for nondifferentiable functions