Results 1  10
of
18
Learning LongTerm Dependencies with Gradient Descent is Difficult
 TO APPEAR IN THE SPECIAL ISSUE ON RECURRENT NETWORKS OF THE IEEE TRANSACTIONS ON NEURAL NETWORKS
"... Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in th ..."
Abstract

Cited by 256 (24 self)
 Add to MetaCart
Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We showwhy gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a tradeoff between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered.
Neural Net Architectures for Temporal Sequence Processing
, 1994
"... I present a general taxonomy of neural net architectures for processing timevarying patterns. This taxonomy subsumes many existing architectures in the literature, and points to several promising architectures that have yet to be examined. Any architecture that processes timevarying patterns requir ..."
Abstract

Cited by 106 (0 self)
 Add to MetaCart
I present a general taxonomy of neural net architectures for processing timevarying patterns. This taxonomy subsumes many existing architectures in the literature, and points to several promising architectures that have yet to be examined. Any architecture that processes timevarying patterns requires two conceptually distinct components: a shortterm memory that holds on to relevant past events and an associator that uses the shortterm memory to classify or predict. My taxonomy is based on a characterization of shortterm memory models along the dimensions of form, content, and adaptability. Experiments on predicting future values of a financial time series (US dollarSwiss franc exchange rates) are presented using several alternative memory models. The results of these experiments serve as a baseline against which more sophisticated architectures can be compared. Neural networks have proven to be a promising alternative to traditional techniques for nonlinear temporal prediction t...
Input/output hmms for sequence processing
 IEEE Transactions on Neural Networks
, 1996
"... We consider problems of sequence processing and propose a solution based on a discrete state model in order to represent past context. Weintroduce a recurrent connectionist architecture having a modular structure that associates a subnetwork to each state. The model has a statistical interpretation ..."
Abstract

Cited by 98 (12 self)
 Add to MetaCart
We consider problems of sequence processing and propose a solution based on a discrete state model in order to represent past context. Weintroduce a recurrent connectionist architecture having a modular structure that associates a subnetwork to each state. The model has a statistical interpretation we call Input/Output Hidden Markov Model (IOHMM). It can be trained by the EM or GEM algorithms, considering state trajectories as missing data, which decouples temporal credit assignment and actual parameter estimation. The model presents similarities to hidden Markov models (HMMs), but allows us to map input sequences to output sequences, using the same processing style as recurrent neural networks. IOHMMs are trained using a more discriminant learning paradigm than HMMs, while potentially taking advantage of the EM algorithm. We demonstrate that IOHMMs are well suited for solving grammatical inference problems on a benchmark problem. Experimental results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization.
Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multiscale processing
 Connection Science
, 1994
"... In algorithmic music composition, a simple technique involves selecting notes sequentially according to a transition table that specifies the probability of the next note as a function of the previous context. I describe an extension of this transition table approach using a recurrent autopredictive ..."
Abstract

Cited by 38 (0 self)
 Add to MetaCart
In algorithmic music composition, a simple technique involves selecting notes sequentially according to a transition table that specifies the probability of the next note as a function of the previous context. I describe an extension of this transition table approach using a recurrent autopredictive connectionist network called CONCERT. CONCERT is trained on a set of pieces with the aim of extracting stylistic regularities. CONCERT can then be used to compose new pieces. A central ingredient of CONCERT is the incorporation of psychologicallygrounded representations of pitch, duration, and harmonic structure. CONCERT was tested on sets of examples artificially generated according to simple rules and was shown to learn the underlying structure, even where other approaches failed. In larger experiments, CONCERT was trained on sets of J. S. Bach pieces and traditional European folk melodies and was then allowed to compose novel melodies. Although the compositions are occasionally pleasa...
Architectural Bias in Recurrent Neural Networks  Fractal Analysis
 IEEE Transactions on Neural Networks
, 1931
"... We have recently shown that when initialized with "small" weights, recurrent neural networks (RNNs) with standard sigmoidtype activation functions are inherently biased towards Markov models, i.e. even prior to any training, RNN dynamics can be readily used to extract finite memory machines (Hammer ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
We have recently shown that when initialized with "small" weights, recurrent neural networks (RNNs) with standard sigmoidtype activation functions are inherently biased towards Markov models, i.e. even prior to any training, RNN dynamics can be readily used to extract finite memory machines (Hammer & Tino, 2002; Tino, Cernansky & Benuskova, 2002; Tino, Cernansky & Benuskova, 2002a). Following Christiansen and Chater (1999), we refer to this phenomenon as the architectural bias of RNNs. In this paper we further extend our work on the architectural bias in RNNs by performing a rigorous fractal analysis of recurrent activation patterns. We assume the network is driven by sequences obtained by traversing an underlying finitestate transition diagram  a scenario that has been frequently considered in the past e.g. when studying RNNbased learning and implementation of regular grammars and finitestate transducers. We obtain lower and upper bounds on various types of fractal dimensions, such as boxcounting and Hausdor# dimensions. It turns out that not only can the recurrent activations inside RNNs with small initial weights be explored to build Markovian predictive models, but also the activations form fractal clusters the dimension of which can be bounded by the scaled entropy of the underlying driving source. The scaling factors are fixed and are given by the RNN parameters.
On the Problem of Local Minima in Recurrent Neural Networks
 IEEE Transactions on Neural Networks
, 1994
"... Many researchers have recently focused their efforts on devising efficient algorithms, mainly based on optimization schemes, for learning the weights of recurrent neural networks. Like for feedforward networks however, these learning algorithms may get stuck in local minima during gradient descent, ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
Many researchers have recently focused their efforts on devising efficient algorithms, mainly based on optimization schemes, for learning the weights of recurrent neural networks. Like for feedforward networks however, these learning algorithms may get stuck in local minima during gradient descent, thus discovering suboptimal solutions. This paper analyzes the problem of optimal learning in recurrent networks by proposing some sufficient conditions which guarantee local minima free error surfaces. An example is given which also show the constructive role of the proposed theory in designing networks suitable for solving a given task. Moreover, a formal relationship between recurrent and static feedforward networks is established such that the examples of local minima for feedforward networks already known in the literature can be associated with analogous ones in recurrent networks. Index Terms Backpropagation through time, feedforward networks, learning environment, linearly separabl...
An EM Approach to Learning Sequential Behavior
, 1994
"... We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Different subnetworks are model the dynamics (state transition) and the output of the ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We consider problems of sequence processing and we propose a solution based on a discrete state model. We introduce a recurrent architecture having a modular structure that allocates subnetworks to discrete states. Different subnetworks are model the dynamics (state transition) and the output of the model, conditional on the previous state and an external input. The model has a statistical interpretation and can be trained by the EM or GEM algorithms, considering state trajectories as missing data. This allows to decouple temporal credit assignment and actual parameters estimation. The model presents similarities to hidden Markov models, but allows to map input sequences to output sequences, using the same processing style of recurrent networks. For this reason we call it Input/Output HMM (IOHMM). Another remarkable difference is that IOHMMs are trained using a supervised learning paradigm (while potentially taking advantage of the EM algorithm), whereas standard HMMs are trained by an...
Scheduling of Modular Architectures for Inductive Inference of Regular Grammars
 In ECAI'94 Workshop on Combining Symbolic and Connectionist Processing
, 1994
"... The problem of inductive inference of regular grammars has recently been faced with recurrent neural networks by many researchers [Giles et al., 1992; Pollack, 1991; Watrous & Kuhn, 1992]. We claim that recurrent radial basis function (R 2 BF ) networks [Gori et al., 1993a] are very wellsuited fo ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
The problem of inductive inference of regular grammars has recently been faced with recurrent neural networks by many researchers [Giles et al., 1992; Pollack, 1991; Watrous & Kuhn, 1992]. We claim that recurrent radial basis function (R 2 BF ) networks [Gori et al., 1993a] are very wellsuited for dealing with such inferential problems, because of their clustered representation of the network states. On the other hand, the main problems that seem to affect the success of such inferential methods is that of gradient vanishing [Bengio et al., 1993] and of bifurcation of the weight space trajectory [Doya, 1993] when learning longterm dependencies , no matter what recurrent network is used. In particular, in this paper, we propose using a modular neural network architecture in which the activations of each module are updated with their own rate. The updating rates and the connection of the different modules are arranged in such a way that the number of updating steps taking place in ea...
Fast training of recurrent networks based on EM algorithm
 IEEE Transactions on Neural Networks
, 1998
"... Abstract — In this work, a probabilistic model is established for recurrent networks. The EM (expectationmaximization) algorithm is then applied to derive a new fast training algorithm for recurrent networks through meanfield approximation. This new algorithm converts training a complicated recurr ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Abstract — In this work, a probabilistic model is established for recurrent networks. The EM (expectationmaximization) algorithm is then applied to derive a new fast training algorithm for recurrent networks through meanfield approximation. This new algorithm converts training a complicated recurrent network into training an array of individual feedforward neurons. These neurons are then trained via a linear weighted regression algorithm. The training time has been improved by five to 15 times on benchmark problems. Index Terms—EM algorithm, fast, meanfield approximation, moving targets, probability model, recurrent networks.
Learning Beyond Finite Memory in Recurrent Networks of Spiking Neurons
 Advances in Natural Computation  ICNC 2005, Lecture Notes in Computer Science
, 2005
"... Abstract. We investigate possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons strictly operating in the pulsecoding regime. We extend the existing gradientbased algorithm for training feedforward spiking neuron networks (SpikeProp [1]) to re ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract. We investigate possibilities of inducing temporal structures without fading memory in recurrent networks of spiking neurons strictly operating in the pulsecoding regime. We extend the existing gradientbased algorithm for training feedforward spiking neuron networks (SpikeProp [1]) to recurrent network topologies, so that temporal dependencies in the input stream are taken into account. It is shown that temporal structures with unbounded input memory specified by simple Moore machines (MM) can be induced by recurrent spiking neuron networks (RSNN). The networks are able to discover pulsecoded representations of abstract information processing states coding potentially unbounded histories of processed inputs. 1