## Continuous History Compression (1993)

Venue: | Proc. of Intl. Workshop on Neural Networks, RWTH Aachen |

Citations: | 9 - 6 self |

### BibTeX

@INPROCEEDINGS{Schmidhuber93continuoushistory,

author = {Jürgen H. Schmidhuber and Michael C. Mozer and Daniel Prelinger},

title = {Continuous History Compression},

booktitle = {Proc. of Intl. Workshop on Neural Networks, RWTH Aachen},

year = {1993},

pages = {87--95},

publisher = {Augustinus}

}

### OpenURL

### Abstract

Neural networks have proven poor at learning the structure in complex and extended temporal sequences in which contingencies among elements can span long time lags. The principle of history compression [18] provides a means of transforming long sequences with redundant information into equivalent shorter sequences; the shorter sequences are more easily manipulated and learned by neural networks. The principle states that expected sequence elements can be removed from the sequence to form an equivalent, more compact sequence without loss of information. The principle was embodied in a neural net predictive architecture that attempted to anticipate the next element of a sequence given the previous elements. If the prediction was accurate, the next element was discarded; otherwise, it was passed on to a second network that processed the sequence in some fashion (e.g., recognition, classification, autoencoding, etc.). As originally proposed, a binary judgement was made as to the predictabi...

### Citations

2746 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...en by existing non-local algorithms. 1.1 A WEAKNESS OF PREVIOUS LEARNING ALGORITHMS Exact gradient-based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. =-=[14]-=-[24][10], the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a rest... |

1550 | Finding structure in time
- Elman
- 1990
(Show Context)
Citation Context ...restricted class of recurrent networks, however, there is a local gradient-based algorithm [6]. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. [3]=-=[1]-=-). Local approaches to reinforcement learning in recurrent networks [25][15][5] unfortunately are not very practicable in realistic applications. Although non-local gradient-based recurrent nets are g... |

518 |
PaulWerbos. Beyond Regression: New Tools for Prediction and Analysis
- Werbos
- 1974
(Show Context)
Citation Context ... takes on a default value, say 0. At time 0 ! tsn p , A computes h p (t) = g(x p (t); h p (t \Gamma 1)): Here g is implemented by the conventional activation spreading rules for back-propagation nets =-=[23]-=- [4] [9] [14]. A's output z p (t) is computed from h p (t) according to the same rules. z p (t) is interpreted as a `reconstruction' of x p (t) ffi h p (t \Gamma 1). Following [11], we modify g such t... |

337 | Recursive distributed representations
- Pollack
- 1990
(Show Context)
Citation Context ...lidean space---have similar `semantics'. 3 CONTINUOUS HISTORY COMPRESSION WITH RAAMs One way to implement continuous history compression involves Pollack's recursive auto-associative memories (RAAMs) =-=[11]-=-. This section first explains RAAMs and demonstrates that local supervised learning algorithms based on RAAMs may theoretically bridge arbitrary time lags between correlated events (modulo crosstalk).... |

178 |
Serial Order: A Paral lel , Distributed Processing Approach
- Jordan
- 1989
(Show Context)
Citation Context ... a restricted class of recurrent networks, however, there is a local gradient-based algorithm [6]. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. =-=[3]-=-[1]). Local approaches to reinforcement learning in recurrent networks [25][15][5] unfortunately are not very practicable in realistic applications. Although non-local gradient-based recurrent nets ar... |

165 |
Learning State Space Trajectories in recurrent Neural Networks
- Pearlmutter
- 1989
(Show Context)
Citation Context ...isting non-local algorithms. 1.1 A WEAKNESS OF PREVIOUS LEARNING ALGORITHMS Exact gradient-based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24]=-=[10]-=-, the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted c... |

118 | Gradient-based learning algorithms for recurrent networks and their computational complexity
- Williams, Zipser
- 1992
(Show Context)
Citation Context ...d supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24][10], the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions [26]=-=[28]-=-[17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gradient-based algorithm [6]. Local (but ... |

81 |
Generalization of backpropagation with application to recurrent gas market model
- Werbos
- 1988
(Show Context)
Citation Context ...y existing non-local algorithms. 1.1 A WEAKNESS OF PREVIOUS LEARNING ALGORITHMS Exact gradient-based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14]=-=[24]-=-[10], the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restrict... |

69 |
A focused backpropagation algorithm for temporal pattern recognition,” Univ
- Mozer
- 1989
(Show Context)
Citation Context ...ersions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gradient-based algorithm =-=[6]-=-. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. [3][1]). Local approaches to reinforcement learning in recurrent networks [25][15][5] unfortunate... |

69 |
Experimental Analysis of the Real-Time Recurrent Learning Algorithm
- Williams, Zipser
- 1989
(Show Context)
Citation Context ...NING ALGORITHMS Exact gradient-based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24][10], the real time recurrent learning algorithm (RTRL) [13]=-=[27]-=-, its accelerated versions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gradie... |

59 | Learning complex extended sequences using the principle of history compression
- Schmidhuber
- 1992
(Show Context)
Citation Context ...ural networks have proven poor at learning the structure in complex and extended temporal sequences in which contingencies among elements can span long time lags. The principle of history compression =-=[18]-=- provides a means of transforming long sequences with redundant information into equivalent shorter sequences; the shorter sequences are more easily manipulated and learned by neural networks. The pri... |

56 |
Induction of multiscale temporal structure
- Mozer
- 1992
(Show Context)
Citation Context ...ons. Although non-local gradient-based recurrent nets are general and can sometimes learn to perform quite complicated algorithms, they tend to fail when it comes to long time lags (see e.g. [18] and =-=[8]-=-): Suppose you want your learning algorithm to make the distinction between two possible input sequences: a; x 1 ; x 2 ; : : : ; x 10 and b; x 1 ; x 2 ; : : : ; x 10 . The distinction is to be made by... |

53 | Reinforcement learning in Markovian and nonMarkovian environments
- Schmidhuber
- 1991
(Show Context)
Citation Context ... trained to do whatever its task is 1 . 1 For instance, a supervised feed-forward net can be trained to emit desired outputs. A reinforcement learner with a non-Markovian interface to its environment =-=[16]-=- will be potentially able to build a Markovian interface using RAAMs. 3.2 RAAMs FAIL WITH LONG TIME LAGS We conducted experiments similar to the one described in [18] which showed that RAAMs usually f... |

50 |
Une procedure d’apprentissage pour reseau a seuil assymetrique
- LeCun
- 1985
(Show Context)
Citation Context ...s on a default value, say 0. At time 0 ! tsn p , A computes h p (t) = g(x p (t); h p (t \Gamma 1)): Here g is implemented by the conventional activation spreading rules for back-propagation nets [23] =-=[4]-=- [9] [14]. A's output z p (t) is computed from h p (t) according to the same rules. z p (t) is interpreted as a `reconstruction' of x p (t) ffi h p (t \Gamma 1). Following [11], we modify g such that ... |

41 |
Toward a theory of reinforcement-learning connectionist systems
- Williams
- 1988
(Show Context)
Citation Context ...ent-based algorithm [6]. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. [3][1]). Local approaches to reinforcement learning in recurrent networks =-=[25]-=-[15][5] unfortunately are not very practicable in realistic applications. Although non-local gradient-based recurrent nets are general and can sometimes learn to perform quite complicated algorithms, ... |

35 |
Static and dynamic error propagation networks with applications to speech coding
- Robinson, Fallside
- 1988
(Show Context)
Citation Context ...LEARNING ALGORITHMS Exact gradient-based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24][10], the real time recurrent learning algorithm (RTRL) =-=[13]-=-[27], its accelerated versions [26][28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gr... |

33 |
A mathematical theory of communication, part i
- Shannon
- 1948
(Show Context)
Citation Context ... (t)log z p j (t) can be interpreted as a measure of the predictor's confidence. How much information is conveyed by x p (t + 1) (relative to the current predictor), once it is observed? According to =-=[22]-=- it is \Gammalog z p j (t) with j chosen such that x p j (t + 1) = 1. In section 3 we will define an update procedure where the `strength' of an update in response to a more or less unexpected event w... |

31 | A fixed size storage O(n ) time complexity learning algorithm for fully recurrent continually running networks
- Schmidhuber
- 1992
(Show Context)
Citation Context ...pervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24][10], the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions [26][28]=-=[17]-=-, and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gradient-based algorithm [6]. Local (but much... |

27 | A local learning algorithm for dynamic feedforward and recurrent networks
- Schmidhuber
- 1990
(Show Context)
Citation Context ...based algorithm [6]. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. [3][1]). Local approaches to reinforcement learning in recurrent networks [25]=-=[15]-=-[5] unfortunately are not very practicable in realistic applications. Although non-local gradient-based recurrent nets are general and can sometimes learn to perform quite complicated algorithms, they... |

27 | Learning unambiguous reduced sequence descriptions. NIPS 4
- Schmidhuber
- 1992
(Show Context)
Citation Context ...s the case. The essential problem is: In unstructured nets, error signals tend to disperse the further they have to go `back in time'. 1.2 HISTORY COMPRESSION The principle of history compression [18]=-=[20]-=- can help to overcome problems like the one with extended sequences mentioned in the last subsection. This principle essentially states that only unexpected events (including representations of the ti... |

25 |
Connectionist music composition based on melodic, stylistic, and psychophysical constraints. Music and Connectionism
- Mozer
- 1990
(Show Context)
Citation Context ...how to extend the capabilities of sequential RAAMs. 3.3 RAAMs AND CONTINUOUS HISTORY COMPRESSION By combining continuous history compression, RAAMs, and an update rule previously proposed by Mozer in =-=[7]-=-, we can greatly extend the capabilities of sequential RAAMs as follows. We need two modules. The first module is a RAAM A which corresponds to the architecture described in section 3.1. A is going to... |

22 |
Complexity of exact gradient computation algorithms for recurrent neural networks
- Williams
- 1989
(Show Context)
Citation Context ...based supervised learning algorithms for minimizing E are back-propagation through time (BPTT), e.g. [14][24][10], the real time recurrent learning algorithm (RTRL) [13][27], its accelerated versions =-=[26]-=-[28][17], and the recent fast-weight algorithm [19]. All these approaches are non-local -- for a restricted class of recurrent networks, however, there is a local gradient-based algorithm [6]. Local (... |

19 | Learning to control fast-weight memories: An alternative to recurrent nets - Schmidhuber - 1992 |

6 |
Netzwerkarchitekturen, Zielfunktionen und Kettenregel. Habilitationsschrift, Institut fur Informatik, Technische Universitat Munchen
- Schmidhuber
- 1993
(Show Context)
Citation Context ... time lags, supervised E-minimization became an easy job for the second network. In some cases it was possible to obtain speed-up factors of more than 1,000 over conventional learning algorithms [18] =-=[21]-=-. The history compression technique formulated in [18], however, suffers from the weakness that a binary judgement is made as to the predictability of each input vector. Expectation-mismatches are def... |

2 |
Diploma thesis
- Hochreiter
- 1991
(Show Context)
Citation Context ...orks have the greatest difficulties. If not carefully designed, they hardly learn to bridge the 10 step time lags between the end of the sequences and the discriminating events a and b, respectively. =-=[2]-=- provides some theoretical analysis why this is the case. The essential problem is: In unstructured nets, error signals tend to disperse the further they have to go `back in time'. 1.2 HISTORY COMPRES... |

2 |
Review of Schmidhuber's paper `Recurrent networks adjusted by adaptive critics'. Neural Network Reviews
- Lukes
- 1990
(Show Context)
Citation Context ...d algorithm [6]. Local (but much weaker) approximations of the general supervised algorithms have been proposed (e.g. [3][1]). Local approaches to reinforcement learning in recurrent networks [25][15]=-=[5]-=- unfortunately are not very practicable in realistic applications. Although non-local gradient-based recurrent nets are general and can sometimes learn to perform quite complicated algorithms, they te... |

1 |
Diploma thesis
- Prelinger
- 1992
(Show Context)
Citation Context ...experiments similar to the one described in [18] which showed that RAAMs usually fail to create sufficiently distinct representations of sequences with lengths of the order of as few as 10 time steps =-=[12]-=-. The following section shows how to extend the capabilities of sequential RAAMs. 3.3 RAAMs AND CONTINUOUS HISTORY COMPRESSION By combining continuous history compression, RAAMs, and an update rule pr... |