Results 1 - 10
of
10
Efficient Back Prop
, 1996
"... HINE Parameters X0, X1, ....Xp Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp Input w w0 w1 AT&T Laboratories (c) COST FUNCTION Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp X0, X1, ....Xp Input Parameters w B R A COMPUTING THE GRADIENT WITH BACKPROPAGATIO ..."
Abstract
-
Cited by 93 (16 self)
- Add to MetaCart
HINE Parameters X0, X1, ....Xp Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp Input w w0 w1 AT&T Laboratories (c) COST FUNCTION Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp X0, X1, ....Xp Input Parameters w B R A COMPUTING THE GRADIENT WITH BACKPROPAGATION O = A(I1, I2) dI1 = dO ¶ A ¶ I1 dI2 = dO ¶ A ¶ I2 - The learning machine is composed of modules (e.g. layers) - Each module can do two things: 1- compute its outputs from its inputs (FPROP) 2- compute gradient vectors at its inputs from gradient vectors at its outputs (BPROP) A O, dO I1, dI1 I2, dI2 AT&T Laboratories (c) AN INTERESTING SPECIAL CASE: MULTILAYER NETWORKS X0, X1, ....Xp Output Desired Output D0, D1,...Dp Y0, Y1,...Yp Input || D - Y || 2 2 1 WX F() WX F() Mean Square Error Parameters (weights + biases) w Weight matrix E0, E1,....Ep Sigmoids + Biase
Fast Exact Multiplication by the Hessian
- Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)|_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Training Neural Nets with the Reactive Tabu Search
, 1995
"... In this paper the task of training sub-symbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search (RTS) proposed by the authors and based on F. Glover's Tabu Search. An iterative optimization process based on a "modified gr ..."
Abstract
-
Cited by 29 (7 self)
- Add to MetaCart
In this paper the task of training sub-symbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search (RTS) proposed by the authors and based on F. Glover's Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a meta-strategy to realize a discrete dynamical system that discourages limit cycles and the confinement of the search trajectory in a limited portion of the search space. The possible cycles are discouraged by prohibiting (i.e., making tabu) the execution of moves that reverse the ones applied in the most recent part of the search, for a prohibition period that is adapted in an automated way. The confinement is avoided and a proper exploration is obtained by activating a diversification strategy when too many configurations are repeated excessively often. The RTS method is applicable to non-di#erentiable functions, it is robust with respect to the...
Efficient Training of Feed-Forward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Backpropagation Convergence Via Deterministic Nonmonotone Perturbed Minimization
, 1994
"... The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established that every accumulation point of the online BP iterates is a stationary point of the BP error function. The results presented cover serial and parallel online BP, modified BP with a momentum term, and BP with weight decay.
Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in O(N) Time
"... Several methods for training feed-forward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learni ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Several methods for training feed-forward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrix times a vector in O(N ) time, where N is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques. 1 Introduction The second derivative information of the error function associated with feed-forward neural networks forms an N \Theta N matrix, which is usually referred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient a...
Scaled Conjugate Gradients for Maximum likelihood: An Empirical Comparison with the EM Algorithm
- Proceedings of the First European Workshop on Probabilistic Graphical Models (PGM-02
, 2002
"... Abstract. To learn Bayesian networks, one must estimate the parameters of the network from the data. EM (Expectation-Maximization) and gradient-based algorithms are the two best known techniques to estimate these parameters. Although the theoretical properties of these two frameworks are well-studie ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. To learn Bayesian networks, one must estimate the parameters of the network from the data. EM (Expectation-Maximization) and gradient-based algorithms are the two best known techniques to estimate these parameters. Although the theoretical properties of these two frameworks are well-studied, it remains an open question as to when and whether EM is to be preferred over gradients. We will answer this question empirically. More specifically, we first adapt scaled conjugate gradients well-known from neural network learning. This accelerated conjugate gradient avoids the time consuming line search of more traditional methods. Secondly, we empirically compare scaled conjugate gradients with EM. The experiments show that accelerated conjugate gradients are competitive with EM. Although, in general EM is the domain independent method of choice, gradient-based methods can be superior. 1 Introduction Bayesian networks are one of the most important, efficient and elegant frameworks for representing and reasoning with probabilistic models. They specify joint probability distributions over finite sets of random variables, and have been applied to many real-world problems in diagnosis, forecasting, automated vision, sensor fusion and manufacturing control. Over the past years, there has been much interest in the problem of learning Bayesian networks from data to avoid the problems of knowledge elicitation. For learning Bayesian networks, parameter estimation is a fundamental task not only because of the inability of humans to reliably estimate the parameters, but also because it forms the basis for the overall learning problem [10].
Parametric Regression
"... 8.39> w, the set of parameters for which the model is the closest to the original mapping, or system, f . 1.2 Learning and optimisation The fit of the model to the system in a given point x is measured using a criterion representing the distance from the model prediction b y to the system, e (y; f w ..."
Abstract
- Add to MetaCart
8.39> w, the set of parameters for which the model is the closest to the original mapping, or system, f . 1.2 Learning and optimisation The fit of the model to the system in a given point x is measured using a criterion representing the distance from the model prediction b y to the system, e (y; f w (x)). This is the local risk . The performance of the model is measured by the expected risk : G (w) = E x;y (e (y; f w (x))) = Z Z e (y; f w (x)) p (yjx) p (x) dxdy (1.1) This quantity represents the ability to yield good performance f
Learning with First, Second, and No Derivatives: a Case Study in High Energy Physics
- Neurocomputing
, 1994
"... this paper different algorithms for training multi-layer perceptron architecture are applied to a significant discrimination task in High Energy Physics. The OneStep Secant technique is compared with On-Line Backpropagation, the "Bold Driver" batch version and Conjugate Gradient methods. In addition ..."
Abstract
- Add to MetaCart
this paper different algorithms for training multi-layer perceptron architecture are applied to a significant discrimination task in High Energy Physics. The OneStep Secant technique is compared with On-Line Backpropagation, the "Bold Driver" batch version and Conjugate Gradient methods. In addition, a new algorithm (Affine Shaker) is proposed that uses stochastic search based on function values and affine transformations of the local search region. Although the Affine Shaker requires more CPU time to reach the maximum generalization, the technique can be interesting for special-purpose VLSI implementations and for non-differentiable functions

