Results 1  10
of
29
Natural Gradient Works Efficiently in Learning
 Neural Computation
, 1998
"... When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for ..."
Abstract

Cited by 289 (16 self)
 Add to MetaCart
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation) and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon which appears in the backpropagation learning algorithm of multilayer perceptrons might disappear or might be not so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. 1 Introduction The stochastic gradient method (Widrow, 1963; Amari, 1967; Tsypkin, 1973; Rumelhart et al...
Neural Learning in Structured Parameter Spaces  Natural Riemannian Gradient
 In Advances in Neural Information Processing Systems
, 1997
"... The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
The parameter space of neural networks has the Riemannian metric structure. The natural Riemannian gradient should be used instead of the conventional gradient, since the former denotes the steepest descent direction of a loss function in the Riemannian space. The behavior of the stochastic gradient learning algorithm is much more effective if the natural gradient is used. The present paper studies the informationgeometrical structure of perceptrons and other networks, and prove that the online learning method based on the natural gradient is asymptotically as efficient as the optimal batch algorithm. Adaptive modification of the learning constant is proposed and analyzed in terms of the Riemannian measure and is shown to be efficient. The natural gradient is finally applied to blind separation of mixtured independent signal sources. 1 Introduction Neural learning takes place in the parameter space of modifiable synaptic weights of a neural network. The role of each parameter is dif...
OnLine Learning Processes in Artificial Neural Networks
, 1993
"... We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. O ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. Online learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a timedependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network's representation. Minimization of a criterion incorporating both effects yields an algorithm for online adaptation of the learning parameter. The inherent noise of online learning makes it possible to escape from undesired local minima of the error potential on which the learning rule performs (stochastic) gradient descent. We try to quantify these often made cl...
Online Learning in Changing Environments with Applications in Supervised and Unsupervised Learning
, 2002
"... An adaptive online algorithm extending the learning of learning idea is proposed and theoretically motivated. Relying only on gradient flow information it can be applied to learning continuous functions or distributions, even when no explicit loss function is given and the Hessian is not available. ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
An adaptive online algorithm extending the learning of learning idea is proposed and theoretically motivated. Relying only on gradient flow information it can be applied to learning continuous functions or distributions, even when no explicit loss function is given and the Hessian is not available. The framework is applied for unsupervised and supervised learning. Its efficiency is demonstrated for drifting and switching nonstationary blind separation tasks of acoustic signals. Furthermore applications to classification (USPS data set) and timeseries prediction in changing environments are presented.
Superefficiency in Blind Source Separation
"... Blind source separation extracts independent component signals from their mixtures without knowing the mixing coefficients nor the probability distributions of source signals. It is known that some algorithms work surprisingly well. The present paper elucidates the superefficiency of algorithms base ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Blind source separation extracts independent component signals from their mixtures without knowing the mixing coefficients nor the probability distributions of source signals. It is known that some algorithms work surprisingly well. The present paper elucidates the superefficiency of algorithms based on the statistical analysis. It is in general known from the asymptotic theory of statistical analysis that the covariance of any two extracted independent signals converges to 0 in the order of 1=t in the case of statistical estimation by using t examples. In the case of online learning, the theory of online dynamics shows that the covariances converge to 0 in the order of j when the learning rate j is fixed to be a small constant. In contrast with the above general properties, the surprising superefficiency holds in blind source separation under a certain conditions. The superefficiency implies that the covariance decreases in the order of 1=t 2 or of j 2 . The present paper uses t...
Online Pattern Analysis by Evolving SelfOrganizing Maps
 J. OF NEUROCOMPUTING
, 2003
"... Many real world data analysis and processing tasks require systems with the ability of online, selfadaptive learning. In this paper present some theoretical background for the Evolving SelfOrganising Map (ESOM) model and further apply it in solving some online pattern analysis problems. Results ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Many real world data analysis and processing tasks require systems with the ability of online, selfadaptive learning. In this paper present some theoretical background for the Evolving SelfOrganising Map (ESOM) model and further apply it in solving some online pattern analysis problems. Results are compared with some benchmarks.
Stochastic Dynamics of Learning with Momentum in Neural Networks
, 1994
"... We study online learning with momentum term for nonlinear learning rules. Through introduction of auxiliary variables, we show that the learning process can be described by a Markov process. For small learning parameters j and momentum parameters ff close to 1, such that fl = j=(1 \Gamma ff) 2 i ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
We study online learning with momentum term for nonlinear learning rules. Through introduction of auxiliary variables, we show that the learning process can be described by a Markov process. For small learning parameters j and momentum parameters ff close to 1, such that fl = j=(1 \Gamma ff) 2 is finite, the time scales for the evolution of the weights and the auxiliary variables are the same. In this case Van Kampen's expansion can be applied in a straightforward manner. We obtain evolution equations for the average network state and the fluctuations around this average. These evolution equations depend (after rescaling of time and fluctuations) only on fl: all combinations (j; ff) with the same value of fl give rise to similar behaviour. The case ff constant and j small requires a completely different analysis. There are two different time scales: a fast time scale on which the auxiliary variables equilibrate and a slow time scale for the change of the weights. By projection on t...
Approximation with neural networks: Between local and global approximation
 In: Proceedings of the 1995 International Conference on Neural Networks
, 1995
"... We investigate neural network based approximation methods. These methods depend on the locality of the basis functions. After discussing local and global basis functions, we propose a a multiresolution hierarchical method. The various resolutions are stored at various levels in a tree. At the root ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We investigate neural network based approximation methods. These methods depend on the locality of the basis functions. After discussing local and global basis functions, we propose a a multiresolution hierarchical method. The various resolutions are stored at various levels in a tree. At the root of the tree, a global approximation is kept; the leafs store the learning samples themselves. Intermediate nodes store intermediate representations. In order to find an optimal partitioning of the input space, selforganising maps (SOM's) are used. The proposed method has implementational problems reminiscent of those encountered in manyparticle simulations. We will investigate the parallel implementation of this method, using parallel hierarchical methods for manyparticle simulations as a starting point. 1. Introduction The use of neural networks for the approximation of functions of high dimensionality from randomly distributed learning samples has long been established. Successful res...
Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
, 2004
"... Gradientfollowing learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and we ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Gradientfollowing learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.
On FokkerPlanck approximations of online learning processes
 Journal of Physics A
, 1994
"... There are several ways to describe online learning in neural networks. The two major ones are a continuoustime master equation and a discretetime randomwalk equation. The randomwalk equation is obtained in case of fixed time intervals between subsequent learning steps, the master equation resul ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
There are several ways to describe online learning in neural networks. The two major ones are a continuoustime master equation and a discretetime randomwalk equation. The randomwalk equation is obtained in case of fixed time intervals between subsequent learning steps, the master equation results when the time intervals are drawn from a Poisson distribution. Following Van Kampen [1], we give a rigorous expansion of both the master and the randomwalk equation in the limit of small learning parameters. The results explain the difference between the FokkerPlanck approaches proposed by Radons et al. [2] and Hansen et al. [3]. Furthermore, we find that the mathematical validity of these approaches is restricted to local properties of the learning process. Yet FokkerPlanck approaches are often suggested as models to study global properties, such as mean first passage times and stationary solutions. To check their accuracy and usefulness in these situations we compare simulations of t...