## Gradient calculation for dynamic recurrent neural networks: a survey (1995)

Venue: | IEEE Transactions on Neural Networks |

Citations: | 132 - 3 self |

### BibTeX

@ARTICLE{Pearlmutter95gradientcalculation,

author = {Barak A. Pearlmutter},

title = {Gradient calculation for dynamic recurrent neural networks: a survey},

journal = {IEEE Transactions on Neural Networks},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract | We survey learning algorithms for recurrent neural networks with hidden units, and put the various techniques into a common framework. We discuss xedpoint learning algorithms, namely recurrent backpropagation and deterministic Boltzmann Machines, and non- xedpoint algorithms, namely backpropagation through time, Elman's history cuto, and Jordan's output feedback architecture. Forward propagation, an online technique that uses adjoint equations, and variations thereof, are also discussed. In many cases, the uni ed presentation leads to generalizations of various sorts. We discuss advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones, continue with some \tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks. We present somesimulations, and at the end, address issues of computational complexity and learning speed.

### Citations

3531 | Optimization by simulated annealing
- Kirkpatrick, Gelatt, et al.
- 1983
(Show Context)
Citation Context ...onstraint satisfaction task of the sort that neural networks are sometimes applied to, such as the traveling salesman problem [149]. Two competing techniques for such problems are simulated annealing =-=[150]-=-, [58] and mean field theory [92]. By providing a network with a noise source which can be modulated (by second order connections, say) we could see if the learning algorithm constructs a network that... |

2723 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...k at hand, such as symmetries or replicated structure [56], [57], and training procedures capable of exploiting hidden units, such as the Boltzmann machine learning procedure [58] and backpropagation =-=[59]-=-, [60], [61], [62], are behind much of the current excitement in the neural network field [63]. Also, training algorithms that do not operate with hidden units, such as the Widrow-Hoff LMS procedure [... |

2112 |
A New Approach to Linear Filtering and Prediction Problems
- Kalman
- 1960
(Show Context)
Citation Context ...n, and whose target output is the first derivative of this signal to be learned. F. Teacher Forcing, RTRL, and the Kalman Filter [125], [126] have pointed out that RTRL is related to a version of the =-=[127]-=- filter, in the extension that allows it to apply to nonlinear systems, namely the extended Kalman filter (EKF) [128], [107], [129]. The EKF has time and space complexity of the same order as those of... |

1542 | Finding structure in time
- Elman
- 1990
(Show Context)
Citation Context ... are not of interest. This technique, which explicitly modulates the behavior we only measured above, has not yet been applied in a control domain. VI. Other Non-fixedpoint Techniques A. "Elman N=-=ets" [118]-=- considers a version of backpropagation through time in discrete time in which the temporal history is cut off. Typically, only one or two timesteps are preserved, at the discretion of the architect. ... |

533 |
Adaptive switching circuits
- Widrow, Hoff
- 1960
(Show Context)
Citation Context ...], [60], [61], [62], are behind much of the current excitement in the neural network field [63]. Also, training algorithms that do not operate with hidden units, such as the Widrow-Hoff LMS procedure =-=[64]-=-, can be used to train recurrent networks without hidden units, so recurrent networks without hidden units reduce to nonrecurrent networks without hidden units, and therefore do not need special learn... |

509 |
Beyond Regression: New Tools for Prediction and Analysis
- Werbos
(Show Context)
Citation Context ...and, such as symmetries or replicated structure [56], [57], and training procedures capable of exploiting hidden units, such as the Boltzmann machine learning procedure [58] and backpropagation [59], =-=[60]-=-, [61], [62], are behind much of the current excitement in the neural network field [63]. Also, training algorithms that do not operate with hidden units, such as the Widrow-Hoff LMS procedure [64], c... |

506 |
Learning regular sets from queries and counterexamples
- Angluin
- 1987
(Show Context)
Citation Context ... [44], [45], [46], [47], [48], [49], typically involves recurrent neural networks as components of more complex systems, and also at present is inferior in practice to discrete algorithmic techniques =-=[50]-=-, [51]. Grammar learning is therefore beyond our scope here. Similarly, learning of multiscale phenomena, which again typically consists of larger systems containing recurrent networks as components [... |

483 |
Neural computation of decisions in optimization problems
- Hopfield, Tank
- 1985
(Show Context)
Citation Context ...er hand, we can turn the logic of section V around. Consider a difficult constraint satisfaction task of the sort that neural networks are sometimes applied to, such as the traveling salesman problem =-=[149]-=-. Two competing techniques for such problems are simulated annealing [150], [58] and mean field theory [92]. By providing a network with a noise source which can be modulated (by second order connecti... |

457 |
Identification and control of dynamical systems using neural networks,"(1990
- Narendra, Zafiriou
(Show Context)
Citation Context ...paper is concerned with learning algorithms for recurrent networks themselves, and not with recurrent networks as elements of larger systems, such as specialized architectures for control [36], [37], =-=[38]-=-, [39]. Also, since we are concerned with learning, we will not discuss the computational power of recurrent networks considered as abstract machines [40], [41], [42]. Although we consider techniques ... |

431 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...t regularities of the task at hand, such as symmetries or replicated structure [56], [57], and training procedures capable of exploiting hidden units, such as the Boltzmann machine learning procedure =-=[58]-=- and backpropagation [59], [60], [61], [62], are behind much of the current excitement in the neural network field [63]. Also, training algorithms that do not operate with hidden units, such as the Wi... |

416 | A learning algorithm for continually running fully recurrent neural networks
- William, Zipser
(Show Context)
Citation Context ...s of functions of the states of a dynamic system with respect to that system 's internal parameters has been discovered and applied to recurrent neural networks a number of times [100], [101], [102], =-=[103]-=-; for reviews see also [81], [76], [104]. It is called by various researchers forward propagation, forward perturbation, or real time recurrent learning, RTRL. Like BPTT, the technique was known and a... |

339 |
Increased rates of convergence through learning rate adaptat ion
- Jacobs
- 1988
(Show Context)
Citation Context ...essian of the error with respect to the weights (the matrix of second derivatives) tends to have a wide eigenvalue spread. One technique that has proven useful in this particular situation is that of =-=[134]-=- which was applied by Fang and Sejnowski to the single figure eight problem perturbed in figure 9 with great success by [135]. For a modern variant of this technique which is suitable to online patter... |

304 |
Backpropagation applied to handwritten zip code recognition
- Cun, Boser, et al.
- 1989
(Show Context)
Citation Context ...processed inputs such as tapped delay lines, and various other architectural embellishments [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [10], [32], [33], [34], =-=[35]-=-. For this reason, if one is interested in solving a particular problem, it would be only prudent to try a variety of non-recurrent architectures before resorting to the more powerful and general recu... |

272 |
Parallel Distributed Processing
- Rumelhart, McClelland
- 1986
(Show Context)
Citation Context ...r exploring recurrent architectures is their potential for dealing with two sorts of temporal behavior. First, recurrent networks are capable of settling to a solution that satisfies many constraints =-=[1]-=-, as in a vision system which relaxes to an interpretation of an image which maximally satisfies a complex set of conflicting constraints [2], [3], [4], [5], [6], a system which relaxes to find a post... |

270 |
Backpropagation through time: What it does and how to do it
- Werbos
- 1990
(Show Context)
Citation Context ...domain. We will consider two major gradient calculation techniques, and then a few more derived from them. The first is the obvious extension of backpropagation through time (BPTT) to continuous time =-=[95]-=-, [96], [62]. A. Backpropagation Through Time The fixedpoint learning procedures discussed above are unable to learn non-fixedpoint attractors, or to produce desired temporal behavior over a bounded i... |

265 |
Cooperative computation of stereo disparity
- Marr, Poggio
- 1976
(Show Context)
Citation Context ... of settling to a solution that satisfies many constraints [1], as in a vision system which relaxes to an interpretation of an image which maximally satisfies a complex set of conflicting constraints =-=[2]-=-, [3], [4], [5], [6], a system which relaxes to find a posture for a robot satisfying many criteria [7], and models of language parsing [8]. Although algorithms suitable for building systems of this t... |

246 |
Absolute stability of global pattern formation and parallel memory storage by competitive neural networks
- Cohen, Grossberg
- 1983
(Show Context)
Citation Context ...mmetry (w ij = w ji , w ii = 0) guarantee that the Lyopunov function L = \Gamma X i;j w ij y i y j + X i (y i log y i + (1 \Gamma y i ) log(1 \Gamma y i )) (5) decreases until a fixedpoint is reached =-=[83]-=-. This weight symmetry condition arises naturally if weights are considered to be Bayesian constraints, as in Boltzmann Machines [84]. ffl A unique fixedpoint is reached regardless of initial conditio... |

246 |
Attractor dynamics and parallelism in a connectionist sequential machine
- Jordan
(Show Context)
Citation Context ...rted by Williams and Zipser for some cases where oscillations trained with teacher forcing exhibited radically and systematically lower frequency and amplitude when running free [123]. E. Jordan Nets =-=[124]-=- used a backpropagation network with the outputs clocked back to the inputs to generate temporal sequences. Although these networks were used long before teacher forcing, from our perspective Jordan n... |

180 |
Analysis of hidden units in a layered network trained to classify sonar targets
- Gorman, Sejnowski
- 1988
(Show Context)
Citation Context ...or their solution turn out to be solvable with feedforward architectures, sometimes augmented with preprocessed inputs such as tapped delay lines, and various other architectural embellishments [18], =-=[19]-=-, [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [10], [32], [33], [34], [35]. For this reason, if one is interested in solving a particular problem, it would be only prudent ... |

173 |
Learning distributed representations of concepts
- Hinton
- 1986
(Show Context)
Citation Context ...e virtues of hidden units and internal representations. Hidden units make it possible for networks to discover and exploit regularities of the task at hand, such as symmetries or replicated structure =-=[56]-=-, [57], and training procedures capable of exploiting hidden units, such as the Boltzmann machine learning procedure [58] and backpropagation [59], [60], [61], [62], are behind much of the current exc... |

170 |
A time-delay neural network architecture for isolated word recognition
- Lang, Waibel, et al.
- 1990
(Show Context)
Citation Context ...ir solution turn out to be solvable with feedforward architectures, sometimes augmented with preprocessed inputs such as tapped delay lines, and various other architectural embellishments [18], [19], =-=[20]-=-, [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [10], [32], [33], [34], [35]. For this reason, if one is interested in solving a particular problem, it would be only prudent to try... |

165 |
Nonlinear signal processing using neural networks: prediction and system modeling
- Lapedes, Farber
(Show Context)
Citation Context ... computational role in the brain [68], [69], there are no specialized training procedures for chaotic attractors in networks with hidden units. However, Crutchfield et al. [70] and Lapedes and Farber =-=[71]-=- have had success with the identification of chaotic systems using models without hidden state, and there is no reason to believe that learning the dynamics of chaotic systems is more difficult than l... |

164 |
How brains make chaos in order to make sense of the worl
- Skarda, Freeman
- 1987
(Show Context)
Citation Context ...scuss specialized non-gradient methods for learning limit cycle attractors, such as [66], [67]. Although it has been theorized that chaotic dynamics play a significant computational role in the brain =-=[68]-=-, [69], there are no specialized training procedures for chaotic attractors in networks with hidden units. However, Crutchfield et al. [70] and Lapedes and Farber [71] have had success with the identi... |

164 |
Learning state space trajectories in recurrent neural networks,”Neural
- Pearlmutter
- 1989
(Show Context)
Citation Context .... We will consider two major gradient calculation techniques, and then a few more derived from them. The first is the obvious extension of backpropagation through time (BPTT) to continuous time [95], =-=[96]-=-, [62]. A. Backpropagation Through Time The fixedpoint learning procedures discussed above are unable to learn non-fixedpoint attractors, or to produce desired temporal behavior over a bounded interva... |

149 |
Finite state automata and simple recurrent networks
- Cleeremans, Servan-Schreiber, et al.
- 1989
(Show Context)
Citation Context ...], [41], [42]. Although we consider techniques for trajectory learning, we will not review practical applications thereof. In particular, grammar learning, although intriguing and progressing rapidly =-=[43]-=-, [44], [45], [46], [47], [48], [49], typically involves recurrent neural networks as components of more complex systems, and also at present is inferior in practice to discrete algorithmic techniques... |

144 |
A mean field theory learning algorithm for neural networks
- Peterson, Anderson
- 1987
(Show Context)
Citation Context ...l of the distribution of surfaces to be encountered, as is usual. D. Deterministic Boltzmann Machines The Mean Field form of the stochastic Boltzmann Machine learning rule, or MFT Boltzmann Machines, =-=[91]-=- have been shown to descend an error functional [74]. Stochastic Boltzmann Machines themselves [58] are beyond our scope here; instead, we give only the probabilistic interpretation of MFT Boltzmann M... |

137 |
Generalization by Weightelimination with Application to Forecasting
- Weigend, Huberman, et al.
- 1991
(Show Context)
Citation Context ...ward architectures, sometimes augmented with preprocessed inputs such as tapped delay lines, and various other architectural embellishments [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], =-=[28]-=-, [29], [30], [31], [10], [32], [33], [34], [35]. For this reason, if one is interested in solving a particular problem, it would be only prudent to try a variety of non-recurrent architectures before... |

137 |
Generalization of back-propagation to recurrent neural networks
- Pineda
- 1987
(Show Context)
Citation Context ...ems is more difficult than learning the dynamics of non-chaotic ones. Special learning algorithms are available for various restricted cases. There are fixedpoint learning algorithms (for details see =-=[72], [73], [74], [75], or for a sur-=-vey see [76]) 1 Typically oe(��) = (1+e \Gamma�� ) \Gamma1 , in which case oe 0 (��) = oe(��)(1 \Gamma oe(��)), or the scaled oe(��) = tanh(��), in which case oe 0 (��)... |

117 |
Neural Networks for Control
- Miller, Sutton, et al.
- 1990
(Show Context)
Citation Context ...is concerned with learning algorithms for recurrent networks themselves, and not with recurrent networks as elements of larger systems, such as specialized architectures for control [36], [37], [38], =-=[39]-=-. Also, since we are concerned with learning, we will not discuss the computational power of recurrent networks considered as abstract machines [40], [41], [42]. Although we consider techniques for tr... |

116 | Gradient-based learning algorithms for recurrent networks and their computational complexity
- Williams, Zipser
- 1995
(Show Context)
Citation Context ...m + nm) time per step, on average, and O(nm + sm) space. Choosing s = n makes this O(nm) time and O(nm) space, which dominates RTRL. This technique has been discovered independently a number of times =-=[110]-=-, [111]. Finally, one can note that, although the forward equations for y are nonlinear, and therefore require numeric integration, the backwards equations for z in BPTT are linear. Since the dE=dw te... |

115 | An efficient gradient-based algorithm for on-line training of recurrent network trajectories
- Williams, Peng
- 1990
(Show Context)
Citation Context ...relate the states of units in one module to weights in another, has been explored by Zipser [108]. Another is to use BPTT with a history cutoff of k units of time, termed BPTT(k) by Williams and Peng =-=[109]-=-, and make a small weight change each timestep. This obviates the need for epochs, resulting in a purely online technique, and is probably the best technique for most practical problems. A third is to... |

104 |
parallel parsing: a strongly interactive model of natural language interpretation
- Waltz, Pollack
- 1985
(Show Context)
Citation Context ...maximally satisfies a complex set of conflicting constraints [2], [3], [4], [5], [6], a system which relaxes to find a posture for a robot satisfying many criteria [7], and models of language parsing =-=[8]-=-. Although algorithms suitable for building systems of this type are reviewed to some extent below, such as the algorithm used in [9], the bulk of this paper is concerned with the problem of causing n... |

92 | Optimal perceptual inference
- Hinton, Sejnowski
- 1983
(Show Context)
Citation Context ... ) log(1 \Gamma y i )) (5) decreases until a fixedpoint is reached [83]. This weight symmetry condition arises naturally if weights are considered to be Bayesian constraints, as in Boltzmann Machines =-=[84]-=-. ffl A unique fixedpoint is reached regardless of initial conditions if P ij w 2 ij ! max(oe 0 ) where max(oe 0 ) is the maximal value of oe 0 (x) for any x [85], but in practice much weaker bounds o... |

87 | Analog computation via neural networks
- Siegelmann, Sontag
- 1994
(Show Context)
Citation Context ...itectures for control [36], [37], [38], [39]. Also, since we are concerned with learning, we will not discuss the computational power of recurrent networks considered as abstract machines [40], [41], =-=[42]-=-. Although we consider techniques for trajectory learning, we will not review practical applications thereof. In particular, grammar learning, although intriguing and progressing rapidly [43], [44], [... |

84 |
Simulation of chaotic EEG patterns with a dynamic model of the olfactory system
- Freeman
- 1987
(Show Context)
Citation Context ...specialized non-gradient methods for learning limit cycle attractors, such as [66], [67]. Although it has been theorized that chaotic dynamics play a significant computational role in the brain [68], =-=[69]-=-, there are no specialized training procedures for chaotic attractors in networks with hidden units. However, Crutchfield et al. [70] and Lapedes and Farber [71] have had success with the identificati... |

83 |
Stationary and nonstationary learning characteristics of the LMS adaptive filter
- Widrow, McCool, et al.
- 1976
(Show Context)
Citation Context ...he techniques used to analyze the limitations of convergence under various conditions in systems of this sort, and of some other techniques for accelerating their convergence; see [139, page 304] and =-=[140]-=-, [141], [142], [143], [144], [145], [146], [147], [148]. C. Prospects and Future Work Control domains are the most natural application for continous time recurrent networks, but signal processing and... |

81 |
Generalization of backpropagation with application to recurrent gas market model
- Werbos
- 1988
(Show Context)
Citation Context ...all change to y i at time t affects E if everything else is left unchanged. As usual in backpropagation, let us define ~ z i (t) = @ + E @~y i (t) (20) where the @ + denotes the ordered derivative of =-=[97]-=-, with variables ordered here by time and not unit index. Intuitively, ~ z i (t) measures how much a small change to ~ y i at time t affects E when this change is propagated forward through time and i... |

76 |
A learning rule for asynchronous perceptrons with feedback inacombinatorial environment
- Almeida
- 1987
(Show Context)
Citation Context ... more difficult than learning the dynamics of non-chaotic ones. Special learning algorithms are available for various restricted cases. There are fixedpoint learning algorithms (for details see [72], =-=[73], [74], [75], or for a survey se-=-e [76]) 1 Typically oe(��) = (1+e \Gamma�� ) \Gamma1 , in which case oe 0 (��) = oe(��)(1 \Gamma oe(��)), or the scaled oe(��) = tanh(��), in which case oe 0 (��) = (1 ... |

75 | Adapting bias by gradient descent: an incremental version of delta-bar-delta
- Sutton
- 1992
(Show Context)
Citation Context ...and Sejnowski to the single figure eight problem perturbed in figure 9 with great success by [135]. For a modern variant of this technique which is suitable to online pattern presentation, see [136], =-=[137]-=-, [138]. Since the acceleration of convergence in these gradient systems is such an important issue, it can be helpful to know some of the techniques used to analyze the limitations of convergence und... |

74 |
Tsoi, “FIR and IIR synapses, a new neural network architecture for time series modeling
- Back, C
- 1991
(Show Context)
Citation Context ...ctures, sometimes augmented with preprocessed inputs such as tapped delay lines, and various other architectural embellishments [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], =-=[30]-=-, [31], [10], [32], [33], [34], [35]. For this reason, if one is interested in solving a particular problem, it would be only prudent to try a variety of non-recurrent architectures before resorting t... |

69 |
Progress in supervised neural networks
- Hush, Horne
- 1993
(Show Context)
Citation Context ...augmented with preprocessed inputs such as tapped delay lines, and various other architectural embellishments [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [10], =-=[32]-=-, [33], [34], [35]. For this reason, if one is interested in solving a particular problem, it would be only prudent to try a variety of non-recurrent architectures before resorting to the more powerfu... |

69 |
A focused backpropagation algorithm for temporal pattern recognition,” Univ
- Mozer
- 1989
(Show Context)
Citation Context ...rward networks, but are applicable only when w is upper-triangular but not necessarily zero-diagonal, in other words, when the network is feedforward except for recurrent self-connections [77], [78], =-=[79]-=-, [80], [25] or for a survey, [81]. Later, we will describe a number of training procedures that, for a price in space or time, do not rely on such restrictions and can be applied to training networks... |

67 | Random DFA’s can be approximately learned from sparse uniform examples
- Lang
- 1992
(Show Context)
Citation Context ... [45], [46], [47], [48], [49], typically involves recurrent neural networks as components of more complex systems, and also at present is inferior in practice to discrete algorithmic techniques [50], =-=[51]-=-. Grammar learning is therefore beyond our scope here. Similarly, learning of multiscale phenomena, which again typically consists of larger systems containing recurrent networks as components [52], [... |

67 |
Deterministic Boltzmann Learning Performs Steepest Descent
- Hinton
- 1989
(Show Context)
Citation Context ...difficult than learning the dynamics of non-chaotic ones. Special learning algorithms are available for various restricted cases. There are fixedpoint learning algorithms (for details see [72], [73], =-=[74], [75], or for a survey see [76]) -=-1 Typically oe(��) = (1+e \Gamma�� ) \Gamma1 , in which case oe 0 (��) = oe(��)(1 \Gamma oe(��)), or the scaled oe(��) = tanh(��), in which case oe 0 (��) = (1 + oe(�... |

66 |
Learning to Control an Unstable System with Forward Modeling
- Jordan, Jacobs
- 1990
(Show Context)
Citation Context ... This paper is concerned with learning algorithms for recurrent networks themselves, and not with recurrent networks as elements of larger systems, such as specialized architectures for control [36], =-=[37]-=-, [38], [39]. Also, since we are concerned with learning, we will not discuss the computational power of recurrent networks considered as abstract machines [40], [41], [42]. Although we consider techn... |

64 |
Gradient Methods for the Optimization of Dynamical Systems Containing Neural Networks
- Narendra, parthasarathy
- 1991
(Show Context)
Citation Context ...c system with respect to that system 's internal parameters has been discovered and applied to recurrent neural networks a number of times [100], [101], [102], [103]; for reviews see also [81], [76], =-=[104]-=-. It is called by various researchers forward propagation, forward perturbation, or real time recurrent learning, RTRL. Like BPTT, the technique was known and applied to other sorts of systems since t... |

58 | Learning complex, extended sequences using the principle of history compression
- Schmidhuber
- 1992
(Show Context)
Citation Context ...mmar learning is therefore beyond our scope here. Similarly, learning of multiscale phenomena, which again typically consists of larger systems containing recurrent networks as components [52], [53], =-=[54]-=-, [55], will not be discussed. B. Why Hidden Units We will restrict our attention to training procedures for networks which may include hidden units, units which have no particular desired behavior an... |

55 |
Induction of multiscale temporal structure
- Mozer
- 1992
(Show Context)
Citation Context ...]. Grammar learning is therefore beyond our scope here. Similarly, learning of multiscale phenomena, which again typically consists of larger systems containing recurrent networks as components [52], =-=[53]-=-, [54], [55], will not be discussed. B. Why Hidden Units We will restrict our attention to training procedures for networks which may include hidden units, units which have no particular desired behav... |

54 |
Learning algorithms for connectionist networks: applied gradient methods of nonlinear optimization
- Watrous
- 1987
(Show Context)
Citation Context ...e the limitations of convergence under various conditions in systems of this sort, and of some other techniques for accelerating their convergence; see [139, page 304] and [140], [141], [142], [143], =-=[144]-=-, [145], [146], [147], [148]. C. Prospects and Future Work Control domains are the most natural application for continous time recurrent networks, but signal processing and speech generation (and reco... |

49 |
Une proc'edure d'apprentissage pour r'eseau `a seuil asym'etrique
- LeCun
- 1985
(Show Context)
Citation Context ...ere uniform in \Sigma0:1, and in the figure eight network (right) in \Sigma0:05. ties are equal, but find that breaking this symmetry allows these nets to learn the task. B. The Moving Targets Method =-=[120]-=-, [121], [122] propose a moving targets learning algorithm. Such an algorithm maintains a target value for each hidden unit at each point in time. These target values are typically initialized either ... |