Results 1  10
of
27
Long Shortterm Memory
, 1995
"... "Recurrent backprop" for learning to store information over extended time intervals takes too long. The main reason is insufficient, decaying error back flow. We briefly review Hochreiter's 1991 analysis of this problem. Then we overcome it by introducing a novel, efficient method called "Long Sho ..."
Abstract

Cited by 244 (55 self)
 Add to MetaCart
"Recurrent backprop" for learning to store information over extended time intervals takes too long. The main reason is insufficient, decaying error back flow. We briefly review Hochreiter's 1991 analysis of this problem. Then we overcome it by introducing a novel, efficient method called "Long Short Term Memory" (LSTM). LSTM can learn to bridge minimal time lags in excess of 1000 time steps by enforcing constant error flow through internal states of special units. Multiplicative gate units learn to open and close access to constant error flow. LSTM's update
A General Framework for Adaptive Processing of Data Structures
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 1998
"... A structured organization of information is typically required by symbolic processing. On the other hand, most connectionist models assume that data are organized according to relatively poor structures, like arrays or sequences. The framework described in this paper is an attempt to unify adaptive ..."
Abstract

Cited by 117 (46 self)
 Add to MetaCart
A structured organization of information is typically required by symbolic processing. On the other hand, most connectionist models assume that data are organized according to relatively poor structures, like arrays or sequences. The framework described in this paper is an attempt to unify adaptive models like artificial neural nets and belief nets for the problem of processing structured information. In particular, relations between data variables are expressed by directed acyclic graphs, where both numerical and categorical values coexist. The general framework proposed in this paper can be regarded as an extension of both recurrent neural networks and hidden Markov models to the case of acyclic graphs. In particular we study the supervised learning problem as the problem of learning transductions from an input structured space to an output structured space, where transductions are assumed to admit a recursive hidden statespace representation. We introduce a graphical formalism for r...
Representation of Finite State Automata in Recurrent Radial Basis Function Networks
, 1996
"... to :hs paper we propose some techniques ft>r injccling linite Stale automata rate l.ec:rr,zn Radial Basis Functlt>n networks (R2BF). When providing proper hints and constraining the v,oght space prlpe'ly. we show that thc,e nelworks behave as automata. A teebraque is snggcsted /"t ebrorag the lemmn ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
to :hs paper we propose some techniques ft>r injccling linite Stale automata rate l.ec:rr,zn Radial Basis Functlt>n networks (R2BF). When providing proper hints and constraining the v,oght space prlpe'ly. we show that thc,e nelworks behave as automata. A teebraque is snggcsted /"t ebrorag the lemmng process re develop aulomata representationq that is based on adding a pro)per penalty tunelton to the mdinary cost. Successful experinental results are shown for tuducttvc mcrenc.' 1 regular gramrnar Keywords: Attemala, backpropagation t[rough trine, high(rder neural networks, induclix. c reference. learning item hints. radial basis ftlnctions, rectarent radial basra tnnclmns. recurrent netw(>rks 1. introduction The ability (>f learning fiom examples is certainly lhe most appealing l'eature c)f neu ral networks. In the last lw years, several researchers have used conncctontst models for solving different kinds ol probfoms ranging from robot control to pattern recogmtioa Coping wilh optimization of [unctions with several thousands of x, ariablcs s quite common Surprisingly, in many practical cases, global or near global r)ptimization is attained also wth non sophistteated numertcal methods. For example, successlul applications of neural nets fi)r recognition of handwritten characters (le Cun, 189) md for phoncmc discrimination (Waibcl c al., 1989) ave bccn proposed which d() n<,t report serious convergence problems Some attempts to understand the theoretical reasons )r lhc successes and atlures of supervised }earrang schemes have been carried oat which explain when such schemes are likely to succeed in discovering oplmal solutions (Bmnchini cl al.. 1994; Gori & Tesi, 1992; Yu, 192), and to gencrali7c to new examples (Baum & Haussler. 1989L These results give st>me ...
Constructive Learning of Recurrent Neural Networks: Limitations of Recurrent Casade Correlation and a Simple Solution
, 1993
"... It is often difficult to predict the optimal neural network size for a particular application. Constructive or destructive methods that add or subtract neurons, layers, connections, etc. might offer a solution to this problem. We prove that one method, Recurrent Cascade Correlation, due to its topol ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
It is often difficult to predict the optimal neural network size for a particular application. Constructive or destructive methods that add or subtract neurons, layers, connections, etc. might offer a solution to this problem. We prove that one method, Recurrent Cascade Correlation, due to its topology, has fundamental limitations in representation and thus in its learning capabilities. It cannot represent with monotone (i.e. sigmoid) and hardthreshold activation functions certain finite state automata. We give a "preliminary" approach on how to get around these limitations by devising a simple constructive training method that adds neurons during training while still preserving the powerful fullyrecurrent structure. We illustrate this approach by simulations which learn many examples of regular grammars that the Recurrent Cascade Correlation method is unable to learn. 1 Introduction Choosing the architecture of a neural network for a particular problem usually requires some prior k...
Rule Extraction from Recurrent Neural Networks: a Taxonomy and Review
 Neural Computation
, 2005
"... this paper, the progress of this development is reviewed and analysed in detail. In order to structure the survey and to evaluate the techniques, a taxonomy, specifically designed for this purpose, has been developed. Moreover, important open research issues are identified, that, if addressed pr ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
this paper, the progress of this development is reviewed and analysed in detail. In order to structure the survey and to evaluate the techniques, a taxonomy, specifically designed for this purpose, has been developed. Moreover, important open research issues are identified, that, if addressed properly, possibly can give the field a significant push forward
An Analysis of Noise in Recurrent Neural Networks: Convergence and Generalization
 IEEE Transactions on Neural Networks
, 1996
"... There has been much interest in applying noise to feedforward neural networks in order to observe their effect on network performance. We extend these results by introducing and analyzing various methods of injecting synaptic noise into dynamicallydriven recurrent networks during training. We prese ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
There has been much interest in applying noise to feedforward neural networks in order to observe their effect on network performance. We extend these results by introducing and analyzing various methods of injecting synaptic noise into dynamicallydriven recurrent networks during training. We present theoretical results which show that applying a controlled amount of noise during training may improve convergence and generalization performance. In addition, we analyze the effects of various noise parameters (additive vs. multiplicative, cumulative vs. noncumulative, per time step vs. per string) and predict that best overall performance can be achieved by injecting additive noise at each time step. Noise contributes a secondorder gradient term to the error function which can be viewed as an anticipatory agent to aid convergence. This term appears to find promising regions of weight space in the beginning stages of training when the training error is large and should improve convergen...
The Neural Network Pushdown Automaton: Model, Stack and Learning Simulations
, 1993
"... In order for neural networks to learn complex languages or grammars, they must have sufficient computational power or resources to recognize or generate such languages. Though many approaches to effectively utilizing the computational power of neural networks have been discussed, an obvious one is t ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
In order for neural networks to learn complex languages or grammars, they must have sufficient computational power or resources to recognize or generate such languages. Though many approaches to effectively utilizing the computational power of neural networks have been discussed, an obvious one is to couple a recurrent neural network with an external stack memory in effect creating a neural network pushdown automata (NNPDA). This NNPDA generalizes the concept of a recurrent network so that the network becomes a more complex computing structure. This paper discusses in detail a NNPDA its construction, how it can be trained and how useful symbolic information can be extracted from the trained network. To effectively couple the external stack to the neural network, an optimization method is developed which uses an error function that connects the learning of the state automaton of the neural network to the learning of the operation of the external stack: push, pop, and nooperation. To minimize the error function using gradient descent learning, an analog stack is designed such that the action and storage of information in the stack are continuous. One interpretation of a continuous stack is the probabilistic storage of and action on data. After training on sample strings of an unknown source grammar, a quantization procedure extracts from the analog stack and neural network a discrete pushdown automata (PDA). Simulations show that in learning deterministic contextfree grammars the balanced parenthesis language, 1 n 0 n, and the deterministic Palindrome the extracted PDA is correct in the sense that it can correctly recognize unseen strings of arbitrary length. In addition, the extracted PDAs can be shown to be identical or equivalent to the PDAs of the source grammars which were used to generate the training strings.
Guessing Can Outperform Many Long Time Lag Algorithms
, 1996
"... Numerous recent papers focus on standard recurrent nets' problems with long time lags between relevant signals. Some propose rather sophisticated, alternative methods. We show: many problems used to test previous methods can be solved more quickly by random weight guessing. ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
Numerous recent papers focus on standard recurrent nets' problems with long time lags between relevant signals. Some propose rather sophisticated, alternative methods. We show: many problems used to test previous methods can be solved more quickly by random weight guessing.
Learning and Extracting Initial Mealy Automata With a Modular Neural Network Model
 NEURAL COMPUTATION
, 1995
"... A hybrid recurrent neural network is shown to learn small initial mealy machines (that can be thought of as translation machines translating input strings to corresponding output strings, as opposed to recognition automata that classify strings as either grammatical or nongrammatical) from positive ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
A hybrid recurrent neural network is shown to learn small initial mealy machines (that can be thought of as translation machines translating input strings to corresponding output strings, as opposed to recognition automata that classify strings as either grammatical or nongrammatical) from positive training samples. A welltrained neural net is then presented once again with the training set and a Kohonen selforganizing map with the "star" topology of neurons is used to quantize recurrent network state space into distinct regions representing corresponding states of a mealy machine being learned. This enables us to extract the learned mealy machine from the trained recurrent network. One neural network (Kohonen selforganizing map) is used to extract meaningful information from another network (recurrent neural network).
LSTM Can Solve Hard Long Time Lag Problems
 Advances in Neural Information Processing Systems 9
, 1997
"... Standard recurrent nets cannot deal with long minimal time lags between relevant signals. Several recent NIPS papers propose alternative methods. We first show: problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms. ..."
Abstract

Cited by 15 (8 self)
 Add to MetaCart
Standard recurrent nets cannot deal with long minimal time lags between relevant signals. Several recent NIPS papers propose alternative methods. We first show: problems used to promote various previous algorithms can be solved more quickly by random weight guessing than by the proposed algorithms. We then use LSTM, our own recent algorithm, to solve a hard problem that can neither be quickly solved by random search nor by any other recurrent net algorithm we are aware of. 1 TRIVIAL PREVIOUS LONG TIME LAG PROBLEMS Traditional recurrent nets fail in case of long minimal time lags between input signals and corresponding error signals [7, 3]. Many recent papers propose alternative methods, e.g., [16, 12, 1, 5, 9]. For instance, Bengio et al. investigate methods such as simulated annealing, multigrid random search, timeweighted pseudoNewton optimization, and discrete error propagation [3]. They also propose an EM approach [1]. Quite a few papers use variants of the "2sequence ...