Results 1  10
of
25
Long Shortterm Memory
, 1995
"... "Recurrent backprop" for learning to store information over extended time intervals takes too long. The main reason is insufficient, decaying error back flow. We briefly review Hochreiter's 1991 analysis of this problem. Then we overcome it by introducing a novel, efficient method c ..."
Abstract

Cited by 387 (58 self)
 Add to MetaCart
"Recurrent backprop" for learning to store information over extended time intervals takes too long. The main reason is insufficient, decaying error back flow. We briefly review Hochreiter's 1991 analysis of this problem. Then we overcome it by introducing a novel, efficient method called "Long Short Term Memory" (LSTM). LSTM can learn to bridge minimal time lags in excess of 1000 time steps by enforcing constant error flow through internal states of special units. Multiplicative gate units learn to open and close access to constant error flow. LSTM's update
GradientBased Learning Algorithms for Recurrent Networks and Their Computational Complexity
, 1995
"... Introduction 1.1 Learning in Recurrent Networks Connectionist networks having feedback connections are interesting for a number of reasons. Biological neural networks are highly recurrently connected, and many authors have studied recurrent network models of various types of perceptual and memory pr ..."
Abstract

Cited by 150 (4 self)
 Add to MetaCart
Introduction 1.1 Learning in Recurrent Networks Connectionist networks having feedback connections are interesting for a number of reasons. Biological neural networks are highly recurrently connected, and many authors have studied recurrent network models of various types of perceptual and memory processes. The general property making such networks interesting and potentially useful is that they manifest highly nonlinear dynamical behavior. One such type of dynamical behavior that has received much attention is that of settling to a fixed stable state, but probably of greater importance both biologically and from an engineering viewpoint are timevarying behaviors. Here we consider algorithms for training recurrent networks to perform temporal supervised learning tasks, in which the specification of desired behavior is in the form of specific examples of input and desired output trajectories. One example of such a task is sequence classification, where
New Results on Recurrent Network Training: Unifying the Algorithms and Accelerating Convergence
 IEEE TRANS. NEURAL NETWORKS
, 2000
"... How to efficiently train recurrent networks remains a challenging and active research topic. Most of the proposed training approaches are based on computational ways to efficiently obtain the gradient of the error function, and can be generally grouped into five major groups. In this study we presen ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
How to efficiently train recurrent networks remains a challenging and active research topic. Most of the proposed training approaches are based on computational ways to efficiently obtain the gradient of the error function, and can be generally grouped into five major groups. In this study we present a derivation that unifies these approaches. We demonstrate that the approaches are only five different ways of solving a particular matrix equation. The second goal of this paper is develop a new algorithm based on the insights gained from the novel formulation. The new algorithm, which is based on approximating the error gradient, has lower computational complexity in computing the weight update than the competing techniques for most typical problems. In addition, it reaches the error minimum in a much smaller number of iterations. A desirable characteristic of recurrent network training algorithms is to be able to update the weights in an online fashion. We have also developed an online version of the proposed algorithm, that is based on updating the error gradient approximation in a recursive manner.
Extracting Regularities in Space and Time Through a Cascade of Prediction Networks: The Case of a Mobile Robot Navigating in a Structured Environment
, 1999
"... We propose that the ability to extract regularities from time series through prediction learning can be enhanced if we use a hierarchical architecture in which higher layers are trained to predict the internal state of lower layers when such states change significantly. This hierarchical organiza ..."
Abstract

Cited by 42 (8 self)
 Add to MetaCart
(Show Context)
We propose that the ability to extract regularities from time series through prediction learning can be enhanced if we use a hierarchical architecture in which higher layers are trained to predict the internal state of lower layers when such states change significantly. This hierarchical organization has two functions: (a) it forces the system to progressively recode sensory information so as to enhance useful regularities and filter out useless information; (b) it progressively reduces the length of the sequences which should be predicted going from lower to higher layers. This, in turn, allows higher levels to extract higher level regularities which are hidden at the sensory level. By training an architecture of this type to predict the next sensory state of a robot navigating in a environment divided into two rooms we show how the first level prediction layer extracts low level regularities such as `walls', `corners', and `corridors' while the second level prediction laye...
Training Recurrent Networks by Evolino
, 2007
"... In recent years, gradientbased LSTM recurrent neural networks (RNNs) solved many previously RNNunlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a novel method: EVOlution of systems with LINear O ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
In recent years, gradientbased LSTM recurrent neural networks (RNNs) solved many previously RNNunlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a novel method: EVOlution of systems with LINear Outputs (Evolino). Evolino evolves weights to the nonlinear, hidden nodes of RNNs while computing optimal linear mappings from hidden state to output, using methods such as pseudoinversebased linear regression. If we instead use quadratic programming to maximize the margin, we obtain the first evolutionary recurrent support vector machines. We show that Evolinobased LSTM can solve tasks that Echo State nets (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradientbased LSTM.
A Fixed Size Storage O(n³) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks
 NEURAL COMPUTATION
, 1992
"... The RTRL algorithm for fully recurrent continually running networks (Robinson and Fallside, 1987)(Williams and Zipser, 1989) requires O(n^4) computations per time step, where n is the number of noninput units. I describe a method suited for online learning which computes exactly the same gradient ..."
Abstract

Cited by 32 (12 self)
 Add to MetaCart
The RTRL algorithm for fully recurrent continually running networks (Robinson and Fallside, 1987)(Williams and Zipser, 1989) requires O(n^4) computations per time step, where n is the number of noninput units. I describe a method suited for online learning which computes exactly the same gradient and requires fixedsize storage of the same order but has an average time complexity per time step of O(n³).
Learning Unambiguous Reduced Sequence Descriptions
 Advances in Neural Information Processing Systems 4
, 1992
"... You want your neural net algorithm to learn sequences? Do not just use conventional gradient descent (or approximations thereof) in recurrent nets, timedelay nets etc. Instead, use your sequence learning algorithm to implement the following method: No matter what your final goals are, train a netwo ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
(Show Context)
You want your neural net algorithm to learn sequences? Do not just use conventional gradient descent (or approximations thereof) in recurrent nets, timedelay nets etc. Instead, use your sequence learning algorithm to implement the following method: No matter what your final goals are, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higherlevel network of the same kind (working on a slower, selfadjusting time scale). Go on building a hierarchy of such networks. This principle reduces the descriptions of event sequences without loss of information, thus easing supervised or reinforcement learning tasks. Experiments show that systems based on this principle can require less computation per time step and many fewer training sequences than conventional training algorithms for ...
A `SelfReferential' Weight Matrix
 IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS
"... Weight modifications in traditional neural nets are computed by hardwired algorithms. Without exception, all previous weight change algorithms have many specific limitations. Is it (in principle) possible to overcome limitations of hardwired algorithms by allowing neural nets to run and improve th ..."
Abstract

Cited by 21 (15 self)
 Add to MetaCart
(Show Context)
Weight modifications in traditional neural nets are computed by hardwired algorithms. Without exception, all previous weight change algorithms have many specific limitations. Is it (in principle) possible to overcome limitations of hardwired algorithms by allowing neural nets to run and improve their own weight change algorithms? This paper constructively demonstrates that the answer (in principle) is `yes'. I derive an initial gradientbased sequence learning algorithm for a `selfreferential' recurrent network that can `speak' about its own weight matrix in terms of activations. It uses some of its input and output units for observing its own errors and for explicitly analyzing and modifying its own weight matrix, including those parts of the weight matrix responsible for analyzing and modifying the weight matrix. The result is the first `introspective' neural net with explicit potential control over all of its own adaptive parameters. A disadvantage of the algorithm is its high c...
The Vanishing Gradient Problem during Learning Recurrent Neural Nets and . . .
 INTERNATIONAL JOURNAL OF UNCERTAINTY, FUZZINESS AND KNOWLEDGEBASED SYSTEMS
"... ... In this article the decaying error flow is theoretically analyzed. Then methods trying to overcome vanishing gradients are briefly discussed. Finally, experiments comparing conventional algorithms and alternative methods are presented. With advanced methods long time lag problems can be solv ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
... In this article the decaying error flow is theoretically analyzed. Then methods trying to overcome vanishing gradients are briefly discussed. Finally, experiments comparing conventional algorithms and alternative methods are presented. With advanced methods long time lag problems can be solved in reasonable time.
Continuous History Compression
 Proc. of Intl. Workshop on Neural Networks, RWTH Aachen
, 1993
"... Neural networks have proven poor at learning the structure in complex and extended temporal sequences in which contingencies among elements can span long time lags. The principle of history compression [18] provides a means of transforming long sequences with redundant information into equivalent sh ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
Neural networks have proven poor at learning the structure in complex and extended temporal sequences in which contingencies among elements can span long time lags. The principle of history compression [18] provides a means of transforming long sequences with redundant information into equivalent shorter sequences; the shorter sequences are more easily manipulated and learned by neural networks. The principle states that expected sequence elements can be removed from the sequence to form an equivalent, more compact sequence without loss of information. The principle was embodied in a neural net predictive architecture that attempted to anticipate the next element of a sequence given the previous elements. If the prediction was accurate, the next element was discarded; otherwise, it was passed on to a second network that processed the sequence in some fashion (e.g., recognition, classification, autoencoding, etc.). As originally proposed, a binary judgement was made as to the predictabi...