## A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization (1993)

Venue: | In |

Citations: | 35 - 7 self |

### BibTeX

@INPROCEEDINGS{Cauwenberghs93afast,

author = {Gert Cauwenberghs},

title = {A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization},

booktitle = {In},

year = {1993},

pages = {244--251},

publisher = {Morgan Kaufmann}

}

### OpenURL

### Abstract

A parallel stochastic algorithm is investigated for error-descent learning and optimization in deterministic networks of arbitrary topology. No explicit information about internal network structure is needed. The method is based on the model-free distributed learning mechanism of Dembo and Kailath. A modified parameter update rule is proposed by which each individual parameter vector perturbation contributes a decrease in error. A substantially faster learning speed is hence allowed. Furthermore, the modified algorithm supports learning time-varying features in dynamical networks. We analyze the convergence and scaling properties of the algorithm, and present simulation results for dynamic trajectory learning in recurrent networks. 1 Background and Motivation We address general optimization tasks that require finding a set of constant parameter values p i that minimize a given error functional E(p). For supervised learning, the error functional consists of some quantitativ...

### Citations

3021 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...o be uniquely defined in the latter dynamic case, initial conditions x(t init ) need to be specified. A popular method for minimizing the error functional is steepest error descent (gradient descent) =-=[1]-=--[6] \Deltap = \Gammaj @E @p : (3) Iteration of (3) leads asymptotically to a local minimum of E(p), provided j is strictly positive and small. The computation of the gradient is often cumbersome, esp... |

438 | A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
- Williams, Zipser
- 1980
(Show Context)
Citation Context ...-[5], with a computational complexity scaling as either O(P ) per epoch for an off-line 3 method [2] (requiring history storage over the complete time interval of the error functional), or as O(P 2 ) =-=[3]-=- and recently as O(P 3=2 ) [4]-[5] per epoch for an on-line method (with only most current history storage). The stochastic error-descent algorithm provides an on-line alternative with an O(P ) per ep... |

170 |
Learning state space trajectories in recurrent neural networks
- Pearlmutter
- 1989
(Show Context)
Citation Context ...teration of (3) leads asymptotically to a local minimum of E(p), provided j is strictly positive and small. The computation of the gradient is often cumbersome, especially for time-dependent problems =-=[2]-=--[5], and is even ill-posed for analog hardware learning systems that unavoidably contain unknown process impurities. This calls for error descent methods avoiding calculation of the gradients but rat... |

68 | 30 years of adaptive neural networks
- Widrow, Lehr
- 1990
(Show Context)
Citation Context ... uniquely defined in the latter dynamic case, initial conditions x(t init ) need to be specified. A popular method for minimizing the error functional is steepest error descent (gradient descent) [1]-=-=[6]-=- \Deltap = \Gammaj @E @p : (3) Iteration of (3) leads asymptotically to a local minimum of E(p), provided j is strictly positive and small. The computation of the gradient is often cumbersome, especia... |

65 | Weight perturbation: An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks
- Jabri, Flower
- 1992
(Show Context)
Citation Context ... multi-perceptron network structure and requires access to internal nodes, are therefore excluded. Two typical methods which satisfy the above condition are illustrated below: ffl Weight Perturbation =-=[7]-=-, a simple sequential parameter perturbation technique. The method updates the individual parameters in sequence, by measuring the change in error resulting from a perturbation of a single parameter a... |

32 | A fixed size storage O(n ) time complexity learning algorithm for fully recurrent continually running networks
- Schmidhuber
- 1992
(Show Context)
Citation Context ...tion of (3) leads asymptotically to a local minimum of E(p), provided j is strictly positive and small. The computation of the gradient is often cumbersome, especially for time-dependent problems [2]-=-=[5]-=-, and is even ill-posed for analog hardware learning systems that unavoidably contain unknown process impurities. This calls for error descent methods avoiding calculation of the gradients but rather ... |

25 |
A VLSI Efficient Technique for Generating Multiple Uncorrelated Noise Sources and its Application to Stochastic Neural Net- works
- Alspector, Gannett, et al.
- 1991
(Show Context)
Citation Context ...qual probability for both polarities, simplifying the multiply operations in the parameter updates. In addition, powerful techniques exist to generate largescale streams of pseudo-random bits in VLSI =-=[15]. 3 Numeri-=-cal Simulations For a test of the learning algorithm on time-dependent problems, we selected dynamic trajectory learning (a "Figure 8") as a representative example [2]. Several exact gradien... |

24 |
A Parallel Gradient Descent Method for Learning
- Alspector, Meir, et al.
- 1993
(Show Context)
Citation Context ...ve been suggested and analyzed by P. Baldi [12]. As a matter of coincidence, independent derivations of basically the same algorithm but from different approaches are presented in this volume as well =-=[13]-=-,[14]. Rather than focussing on issues of originality, we proceed by analyzing the virtues and scaling properties of this method. We directly present the results below, and defer the formal derivation... |

24 | Summed weight neuron perturbation: An o(n) improvement over weight perturbation
- Flower, Jabri
- 1993
(Show Context)
Citation Context ...en suggested and analyzed by P. Baldi [12]. As a matter of coincidence, independent derivations of basically the same algorithm but from different approaches are presented in this volume as well [13],=-=[14]-=-. Rather than focussing on issues of originality, we proceed by analyzing the virtues and scaling properties of this method. We directly present the results below, and defer the formal derivations to ... |

18 |
Model-free distributed learning
- Dembo, Kailath
- 1990
(Show Context)
Citation Context ... components of the gradient sequentially, which for a complete knowledge of the gradient requires as many computation cycles as there are parameters in the system. ffl Model-Free Distributed Learning =-=[8], which is-=- based on the "M.I.T." rule in adaptive control [9]. Inspired by analog hardware, the distributed algorithm makes use of time-varying perturbation signalssi (t) supplied in parallel to the p... |

15 |
Learning a trajectory using adjoint functions and teacher forcing
- Toomarian, Barben
- 1992
(Show Context)
Citation Context ...plexity scaling as either O(P ) per epoch for an off-line 3 method [2] (requiring history storage over the complete time interval of the error functional), or as O(P 2 ) [3] and recently as O(P 3=2 ) =-=[4]-=--[5] per epoch for an on-line method (with only most current history storage). The stochastic error-descent algorithm provides an on-line alternative with an O(P ) per epoch complexity. As a consequen... |

9 |
Analog VLSI implementation of gradient descent
- Kirk, Kerns, et al.
- 1993
(Show Context)
Citation Context ...he perturbation and error signals prior to correlating them to construct the parameter increments. A complete demonstration of an analog VLSI system based on this approach is presented in this volume =-=[11]-=-. As a matter of fact, the modified noise-injection algorithm corresponds to a continuous-time version of the algorithm presented here, for networks and error functionals free of time-varying features... |

2 |
Using Noise Injection and Correlation in Analog Hardware to Estimate Gradients," submitted
- Anderson, Kerns
- 1992
(Show Context)
Citation Context ...process. However, the individual fluctuations satisfy the following desirable regularity: 1 An interesting noise-injection variant on the model-free distributed learning paradigm of [8], presented in =-=[10]-=-, avoids the bias due to the offset level E(p) as well, by differentiating the perturbation and error signals prior to correlating them to construct the parameter increments. A complete demonstration ... |

1 | An Adaptive System for the Control of Aircraft and Spacecraft - Whitaker - 1959 |

1 |
Learning in Dynamical Systems: Gradient Descent, Random Descent and Modular Approaches
- Baldi
- 1992
(Show Context)
Citation Context ...nforcement learning, and likely finds parallels in other fields as well. Random direction and line-search error-descent algorithms for trajectory learning have been suggested and analyzed by P. Baldi =-=[12]-=-. As a matter of coincidence, independent derivations of basically the same algorithm but from different approaches are presented in this volume as well [13],[14]. Rather than focussing on issues of o... |