Results 1 - 10
of
11
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
, 1989
"... The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have: (1) the advantage that they do not require a precis ..."
Abstract
-
Cited by 368 (4 self)
- Add to MetaCart
The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have: (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms are shown to allow networks having recurrent connections to learn complex tasks requiring the retention of information over time periods having either fixed or indefinite length. 1 Introduction A major problem in connectionist theory is to develop learning algorithms that can tap the full computational power of neural networks. Much progress has been made with feedforward networks, and attention has recently turned to developing algorithms for networks with recurrent connections, wh...
First and Second-Order Methods for Learning: between Steepest Descent and Newton's Method
- Neural Computation
, 1992
"... On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neura ..."
Abstract
-
Cited by 108 (6 self)
- Add to MetaCart
On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.
Local Gain Adaptation in Stochastic Gradient Descent
- In Proc. Intl. Conf. Artificial Neural Networks
, 1999
"... Gain adaptation algorithms for neural networks typically adjust learning rates by monitoring the correlation between successive gradients. Here we discuss the limitations of this approach, and develop an alternative by extending Sutton's work on linear systems to the general, nonlinear case. The res ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
Gain adaptation algorithms for neural networks typically adjust learning rates by monitoring the correlation between successive gradients. Here we discuss the limitations of this approach, and develop an alternative by extending Sutton's work on linear systems to the general, nonlinear case. The resulting online algorithms are computationally little more expensive than other acceleration techniques, do not assume statistical independence between successive training patterns, and do not require an arbitrary smoothing parameter. In our benchmark experiments, they consistently outperform other acceleration methods, and show remarkable robustness when faced with noni. i.d. sampling of the input space.
Centering Neural Network Gradient Factors
- Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer Science
, 1997
"... It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [2]. Here we generalize this notion to all factors involved in the network's grad ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [2]. Here we generalize this notion to all factors involved in the network's gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Slope Centering: Making Shortcut Weights Effective
- Proceedings of the 8th International Conference on Artificial Neural Networks, Perspectives in Neural Computing
, 1998
"... Shortcut connections are a popular architectural feature of multi-layer perceptrons. It is generally assumed that by implementing a linear submapping, shortcuts assist the learning process in the remainder of the network. Here we find that this is not always the case: shortcut weights may also act a ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Shortcut connections are a popular architectural feature of multi-layer perceptrons. It is generally assumed that by implementing a linear submapping, shortcuts assist the learning process in the remainder of the network. Here we find that this is not always the case: shortcut weights may also act as distractors that slow down convergence and can lead to inferior solutions. This problem can be addressed with slope centering, a particular form of gradient factor centering [2]. By removing the linear component of the error signal at a hidden node, slope centering effectively decouples that node from the shortcuts that bypass it. This eliminates the possibility of destructive interference from shortcut weights, and thus ensures that the benefits of shortcut connections are fully realized.
On Centering Neural Network Weight Updates
- Tricks of the Trade
, 1997
"... It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals (Schraudolph and Sejnowski, 1996). Here we generalize this notion to all factors ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals (Schraudolph and Sejnowski, 1996). Here we generalize this notion to all factors involved in the weight update, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Dynamic Recurrent Neural Networks: a Dynamical Analysis
- IEEE TRANS. ON SYSTEMS MAN AND CYBERNETICS, PART B
, 1996
"... In this paper, we explore the dynamical features of a neural network model which presents two types of adaptative parameters : the classical weights between the units and the time constants associated with each artificial neuron. The purpose of this study is to provide a strong theoretical basis for ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we explore the dynamical features of a neural network model which presents two types of adaptative parameters : the classical weights between the units and the time constants associated with each artificial neuron. The purpose of this study is to provide a strong theoretical basis for modeling and simulating dynamic recurrent neural networks. In order to achieve this, we study the effect of the statistical distribution of the weights and of the time constants on the network dynamics and we make a statistical analysis of the neural transformation. We examine the network power spectra (to draw some conclusions over the frequential behavior of the network) and we compute the stability regions to explore the stability of the model. We show that the network is sensitive to the variations of the mean values of the weights and the time constants (because of the temporal aspects of the learned tasks). Nevertheless, our results highlight the improvements in the network dynamics d...
Accelerated Gradient Descent by Factor-Centering Decomposition
, 1998
"... Gradient factor centering is a new methodology for decomposing neural networks into biased and centered subnets which are then trained in parallel. The decomposition can be applied to any pattern-dependent factor in the network's gradient, and is designed such that the subnets are more amenable to o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Gradient factor centering is a new methodology for decomposing neural networks into biased and centered subnets which are then trained in parallel. The decomposition can be applied to any pattern-dependent factor in the network's gradient, and is designed such that the subnets are more amenable to optimization by gradient descent than the original network: biased subnets because of their simplified architecture, centered subnets due to a modified gradient that improves conditioning. The architectural and algorithmic modifications mandated by this approach include both familiar and novel elements, often in prescribed combinations. The framework suggests for instance that shortcut connections --- a well-known architectural feature --- should work best in conjunction with slope centering, a new technique described herein. Our benchmark experiments bear out this prediction, and show that factorcentering decomposition can speed up learning significantly without adversely affecting the train...
FIXED POINT ANALYSIS FOR RECURRENT NETWORKS
"... This paper provides a systematic analysis of the recurrentbackpropagation (RBP) algorithm, introducing a number of new results. The main limitation of the RBP algorithm is that it assumes the convergence of the network to a stable xed point in order to backpropagate the error signals. We show by exp ..."
Abstract
- Add to MetaCart
This paper provides a systematic analysis of the recurrentbackpropagation (RBP) algorithm, introducing a number of new results. The main limitation of the RBP algorithm is that it assumes the convergence of the network to a stable xed point in order to backpropagate the error signals. We show by experiment and eigenvalue analysis that this condition can be violated and that chaotic behavior can be avoided. Next we examine the advantages of RBP over the standard backpropagation algorithm. RBP is shown to build stable xed points corresponding to the input patterns. This makes it an appropriate tool for content addressable memories, one-to-many function learning, and inverse problems.
Online Local Gain Adaptation for Multi-Layer Perceptrons
, 1998
"... We introduce a new method for adapting the step size of each individual weight in a multi-layer perceptron trained by stochastic gradient descent. Our technique derives from the K1 algorithm for linear systems (Sutton, 1992b), which in turn is based on a diagonalized Kalman Filter. We expand upon Su ..."
Abstract
- Add to MetaCart
We introduce a new method for adapting the step size of each individual weight in a multi-layer perceptron trained by stochastic gradient descent. Our technique derives from the K1 algorithm for linear systems (Sutton, 1992b), which in turn is based on a diagonalized Kalman Filter. We expand upon Sutton's work in two regards: K1 is a) extended to multi-layer perceptrons, and b) made more efficient by linearizing an exponentiation operation. The resulting elk1 (extended, linearized K1) algorithm is computationally little more expensive than alternative proposals (Zimmermann, 1994; Almeida et al., 1997, 1998), and does not require an arbitrary smoothing parameter. In our benchmark experiments, elk1 consistently outperforms these alternatives, as well as stochastic gradient descent with momentum, even when the number of floating-point operations required per weight update is taken into account. Unlike the method of Almeida et al. (1997, 1998), elk1 does not require statistical independence between successive training patterns, and handles large initial learning rates well.

