## Local Gain Adaptation in Stochastic Gradient Descent (1999)

Venue: | In Proc. Intl. Conf. Artificial Neural Networks |

Citations: | 58 - 12 self |

### BibTeX

@INPROCEEDINGS{Schraudolph99localgain,

author = {Nicol N. Schraudolph},

title = {Local Gain Adaptation in Stochastic Gradient Descent},

booktitle = {In Proc. Intl. Conf. Artificial Neural Networks},

year = {1999},

pages = {569--574}

}

### Years of Citing Articles

### OpenURL

### Abstract

Gain adaptation algorithms for neural networks typically adjust learning rates by monitoring the correlation between successive gradients. Here we discuss the limitations of this approach, and develop an alternative by extending Sutton's work on linear systems to the general, nonlinear case. The resulting online algorithms are computationally little more expensive than other acceleration techniques, do not assume statistical independence between successive training patterns, and do not require an arbitrary smoothing parameter. In our benchmark experiments, they consistently outperform other acceleration methods, and show remarkable robustness when faced with noni. i.d. sampling of the input space.

### Citations

640 | A direct adaptive method for faster backpropagation learning: The Rprop algorithm. Paper presented at
- Riedmiller, Braun
- 1993
(Show Context)
Citation Context ...ncrementally adapt the gain [5]. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training =-=[8, 9, 10, 11, 12, 13]-=-, or both [14, 15, 16, 17]. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst... |

416 | A learning algorithm for continually running fully recurrent neural networks
- William, Zipser
(Show Context)
Citation Context ...p eects. By contrast, Sutton [5] models the longterm eect of ~ p on future weight updates in a linear system by carrying the relevant partials forward through time (cf. real-time recurrent learning [2=-=4-=-]). This results in an iterative update rule for ~v, which we extend here to nonlinear systems. As before, we dierentiate (1) with respect to ln ~p, but we now consider the change in ~ p to have occur... |

339 |
Increased rates of convergence through learning rate adaptat ion
- Jacobs
- 1988
(Show Context)
Citation Context ...ncrementally adapt the gain [5]. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training =-=[8, 9, 10, 11, 12, 13]-=-, or both [14, 15, 16, 17]. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst... |

247 | Exponentiated gradient versus gradient descent for linear predictors
- KIVINEN, WARMUTH
- 1997
(Show Context)
Citation Context ... its parameters ~ w by stochastic gradient descent: ~ w t+1 = ~ w t + ~p t ~st ; (1) where ~st @f ~ w t (~x t ) @ ~ w The local learning rates ~ p are best adapted by exponentiated gradient descent [=-=21,-=- 22], so that they can cover a wide dynamic range while staying strictly positive: ln ~p t = ln ~ p t 1 @f ~ w t (~x t ) @ ln ~ p ~p t = ~ p t 1 exp( ~st ~v t ) ; (2) where ~v t @ ~ w t @ ln ~p and ... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ... its parameters ~ w by stochastic gradient descent: ~ w t+1 = ~ w t + ~p t ~st ; (1) where ~st @f ~ w t (~x t ) @ ~ w The local learning rates ~ p are best adapted by exponentiated gradient descent [=-=21,-=- 22], so that they can cover a wide dynamic range while staying strictly positive: ln ~p t = ln ~ p t 1 @f ~ w t (~x t ) @ ln ~ p ~p t = ~ p t 1 exp( ~st ~v t ) ; (2) where ~v t @ ~ w t @ ln ~p and ... |

123 |
Accelerating the Convergence of the Back-Propagation Method. Biological Cybernetics 59
- Vogl, Mangis, et al.
- 1988
(Show Context)
Citation Context .... Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [5, 6], can be used only in batch training [7, 8, 9, 10, 11, 12], or both =-=[13, 14, 15, 16]-=-. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. The first such algorithms have re... |

97 |
Improving the Convergence of back-propagation Learning with Second order Methods
- Becker, leCun
- 1988
(Show Context)
Citation Context ...1]. stochastic gradient descent remains the algorithm of choice. The central problem here is how to set the local learning rate, or gain of the algorithm, for rapid convergence.sNormalization methods =-=[2, 3, 4] calc-=-ulatesthe optimal gain under simplifying assumptions | which may or may not model a given situation well. Even such \optimal" algorithms as Kalmansltering can thus be outperformed by adaptation m... |

82 | Advanced Supervised Learning in Multi-Layer Perceptron from Back Propagation to Adaptative Learning Algorithms
- Riedmiller
- 1994
(Show Context)
Citation Context ...ncrementally adapt the gain [5]. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training =-=[8, 9, 10, 11, 12, 13]-=-, or both [14, 15, 16, 17]. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst... |

75 | Adapting bias by gradient descent: an incremental version of delta-bar-delta - Sutton - 1992 |

69 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ...where H t denotes the instantaneous Hessian of f ~ w (~x) at time t. Note that there is an efficientsO(n) algorithm to calculate H t ~v t without ever having to compute or store the matrixsH t itself =-=[24]-=-. Meta-level conditioning. The gradient descent in ~ p at the meta-level (2) may of course suffer from ill-conditioning as much as the descent in ~ w at the main level (1); the meta-descent in fact sq... |

63 |
Supersab: fast adaptive backpropagation with good scaling properties
- tollenaere
- 1990
(Show Context)
Citation Context |

63 |
Training multilayer perceptrons with the extended kalman algorithm
- Singhal, Wu
- 1989
(Show Context)
Citation Context ...ptron, providing a computationally attractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the \four regions" classication t=-=ask [27]-=-, a well-known benchmark problem [17, 28, 29, 30]: a fully connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) a... |

47 |
Accelerated backpropagation learning: two optimization methods
- Battiti
- 1989
(Show Context)
Citation Context ... Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training [8, 9, 10, 11, 12, 13], or both =-=[14, 15, 16, 17]-=-. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst such algorithms have rece... |

47 |
Decoupled extended Kalman filter training of feedforward layered networks
- Puskorious, Feldkamp
- 1991
(Show Context)
Citation Context ...ractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the "four regions" classification task [26], a well-known benchmar=-=k problem [16, 27, 28, 29]: a fully -=-connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) as illustrated in Figure 1. We used "softmax" outp... |

42 | Gain adaptation beats least squares
- Sutton
- 1992
(Show Context)
Citation Context ...en situation well. Even such \optimal" algorithms as Kalmansltering can thus be outperformed by adaptation methods which measure the eects ofsnite step sizes in order to incrementally adapt the g=-=ain [5]-=-. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training [8, 9, 10, 11, 12, 13], or both... |

42 |
Speeding up backpropagation
- Silva, Almeida
- 1990
(Show Context)
Citation Context |

29 |
Optimal filtering algorithms for fast learning in feedforward neural networks
- Shah, Palmieri, et al.
- 1992
(Show Context)
Citation Context ...ractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the "four regions" classification task [26], a well-known benchmar=-=k problem [16, 27, 28, 29]: a fully -=-connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) as illustrated in Figure 1. We used "softmax" outp... |

27 | Adaptive on-line learning in changing environments
- Murata, Müller, et al.
- 1997
(Show Context)
Citation Context ...the eects ofsnite step sizes in order to incrementally adapt the gain [5]. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [=-=6, 7]-=-, can be used only in batch training [8, 9, 10, 11, 12, 13], or both [14, 15, 16, 17]. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for ... |

25 |
An adaptive training algorithm for back–propagation networks
- Chan, Fallside
- 1987
(Show Context)
Citation Context |

20 | Multi-player residual advantage learning with general function approximation (Tech
- Harmon, Baird
- 1996
(Show Context)
Citation Context ...own advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst such algorithms have recently been proposed =-=[18, 19]-=-; here we develop a more sophisticated alternative by extending Sutton's work on linear systems [5, 20] to the general, nonlinear case. The resulting stochastic meta-descent (SMD) algorithms support o... |

19 | A fast, compact approximation of the exponential function
- Schraudolph
- 1999
(Show Context)
Citation Context ...through the corresponding element of ~ w. With considerable variation, (2) forms the basis of most local rate adaptation methods found in the literature. In order to avoid an expensive exponentiation =-=[2-=-3] for each weight update, we typically use the linearization e u 1 + u, valid for small juj, giving ~ p t = ~p t 1 max(%; 1 + ~st ~v t ) ; (3) where we constrain the multiplier to be at least (typi... |

15 |
Parameter adaptation in stochastic optimization
- Almeida, Langlois, et al.
- 1999
(Show Context)
Citation Context ...own advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst such algorithms have recently been proposed =-=[18, 19]-=-; here we develop a more sophisticated alternative by extending Sutton's work on linear systems [5, 20] to the general, nonlinear case. The resulting stochastic meta-descent (SMD) algorithms support o... |

13 | Tempering backpropagation networks: Not all weights are created equal
- Schraudolph, Sejnowski
- 1996
(Show Context)
Citation Context ...1]. stochastic gradient descent remains the algorithm of choice. The central problem here is how to set the local learning rate, or gain of the algorithm, for rapid convergence.sNormalization methods =-=[2, 3, 4] calc-=-ulatesthe optimal gain under simplifying assumptions | which may or may not model a given situation well. Even such \optimal" algorithms as Kalmansltering can thus be outperformed by adaptation m... |

13 | Online local gain adaptation for multi-layer perceptrons”, Tech. Rep. IDSIA-09-98, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Corso Elvezia 36, 6900
- Schraudolph
- 1998
(Show Context)
Citation Context ... the neural network trained on it (right). where k t 1 1 + ~x T t (~p t ~x t ) (9) we obtain Sutton's K1 algorithm [5]: ~v t+1 = (~v t + k t ~ p t ~st )(1 k t ~ p t ~x 2 t ) (10) Our elk1 algorithm [26] extends K1 by adding a nonlinear function to the system 's output: f ~ w t (~x t ; y t ) 1 2 [ y t (a t )] 2 k t 1 1 + ~x T t (~p t ~x t ) 0 (a t ) 2 (11) ~v t+1 = (~v t + k t ~ p t ~st )[ 1... |

12 |
A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition
- Lapedes, Farber
- 1986
(Show Context)
Citation Context ... Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training [8, 9, 10, 11, 12, 13], or both =-=[14, 15, 16, 17]-=-. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst such algorithms have rece... |

7 |
Fast exact multiplication by the
- Pearlmutter
- 1994
(Show Context)
Citation Context ... , (4) where Ht denotes the instantaneous Hessian of f ⃗w(⃗x) at time t. Note that there is an efficient O(n) algorithm to calculate Ht⃗vt without ever having to compute or store the matrix Ht itself =-=[25]-=-. Meta-level conditioning. The gradient descent in ⃗p at the meta-level (2) may of course suffer from ill-conditioning as much as the descent in ⃗w at the main level (1); the meta-descent in fact squa... |

6 |
Automatic learning rate maximization in large adaptive machines
- LeCun, Simard, et al.
- 1993
(Show Context)
Citation Context ...the eects ofsnite step sizes in order to incrementally adapt the gain [5]. Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [=-=6, 7]-=-, can be used only in batch training [8, 9, 10, 11, 12, 13], or both [14, 15, 16, 17]. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for ... |

6 |
Optimal ltering algorithms for fast learning in feedforward neural networks, Neural Networks 5
- Shah, Palmieri, et al.
- 1992
(Show Context)
Citation Context ...tractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the \four regions" classication task [27], a well-known benchmark prob=-=lem [17, 28, 29, 30]: a f-=-ully connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) as illustrated in Figure 1. We used \softmax" outp... |

6 |
How to train neural networks”, in Neural Networks
- Neuneier, Zimmermann
- 1998
(Show Context)
Citation Context ...1]. stochastic gradient descent remains the algorithm of choice. The central problem here is how to set the local learning rate, or gain of the algorithm, for rapid convergence. Normalization methods =-=[2, 3, 4]-=- calculate the optimal gain under simplifying assumptions — which may or may not model a given situation well. Even such “optimal” algorithms as Kalman filtering can thus be outperformed by adaptation... |

3 |
Acceleration of backpropagation learning using optimised learning rate and momentum
- Yu, Chen, et al.
- 1993
(Show Context)
Citation Context ... Unfortunately most of the existing gain adaptation algorithms for neural networks adapt only a single, global learning rate [6, 7], can be used only in batch training [8, 9, 10, 11, 12, 13], or both =-=[14, 15, 16, 17]-=-. Given the well-known advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. Thesrst such algorithms have rece... |

3 |
Decoupled Extended Kalman Training of Feedforward Layered Networks
- Pushkorius, Feldkamp
- 1991
(Show Context)
Citation Context ...tractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the \four regions" classication task [27], a well-known benchmark prob=-=lem [17, 28, 29, 30]: a f-=-ully connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) as illustrated in Figure 1. We used \softmax" outp... |

3 |
Training neural networks using sequential-update forms of the extended Kalman filter
- Plumer
- 1995
(Show Context)
Citation Context ...tractive diagonal approximation to the full Hessian update for ~v given in (4). 4 Benchmark Results We evaluated our work on the \four regions" classication task [27], a well-known benchmark prob=-=lem [17, 28, 29, 30]: a f-=-ully connected feedforward network with 2 hidden layers of 10 units each (tanh nonlinearity) is to classify two continuous inputs (range [-1,1]) as illustrated in Figure 1. We used \softmax" outp... |

2 |
Fast exact multiplication by the Hessian", Neural Computation
- Pearlmutter
- 1994
(Show Context)
Citation Context ...4) where H t denotes the instantaneous Hessian of f ~ w (~x) at time t. Note that there is an ecient O(n) algorithm to calculate H t ~v t without ever having to compute or store the matrixsH t itself =-=[25-=-]. Meta-level conditioning. The gradient descent in ~ p at the meta-level (2) may of course suer from ill-conditioning as much as the descent in ~ w at the main level (1); the meta-descent in fact squ... |

2 |
III, "Multi-player residual advantage learning with general function approximation
- Harmon, Baird
- 1996
(Show Context)
Citation Context ...n advantages of stochastic gradient descent, it would be desirable to have online methods for local gain adaptation in nonlinear neural networks. The first such algorithms have recently been proposed =-=[17, 18]-=-; here we develop a more sophisticated alternative by extending Sutton's work on linear systems [4, 19] to the general, nonlinear case. The resulting stochastic meta-descent (SMD) algorithms support o... |

1 |
How to train neural networks", in Neural Networks
- Neuneier, Zimmermann
- 1998
(Show Context)
Citation Context ...1]. stochastic gradient descent remains the algorithm of choice. The central problem here is how to set the local learning rate, or gain of the algorithm, for rapid convergence.sNormalization methods =-=[2, 3, 4] calc-=-ulatesthe optimal gain under simplifying assumptions | which may or may not model a given situation well. Even such \optimal" algorithms as Kalmansltering can thus be outperformed by adaptation m... |

1 |
Ac6 celerating the convergence of the backpropagation method", Biological Cybernetics
- Vogl, Mangis, et al.
- 1988
(Show Context)
Citation Context |