Results 1  10
of
28
First and SecondOrder Methods for Learning: between Steepest Descent and Newton's Method
 Neural Computation
, 1992
"... Online first order backpropagation is sufficiently fast and effective for many largescale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first and secondorder optimization methods for learning in feedforward neura ..."
Abstract

Cited by 126 (6 self)
 Add to MetaCart
Online first order backpropagation is sufficiently fast and effective for many largescale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first and secondorder optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.
A New Class Of Incremental Gradient Methods For Least Squares Problems
 SIAM J. Optim
, 1996
"... The LMS method for linear least squares problems di#ers from the steepest descent method in that it processes data blocks onebyone, with intermediate adjustment of the parameter vector under optimization. This mode of operation often leads to faster convergence when far from the eventual limit, an ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
The LMS method for linear least squares problems di#ers from the steepest descent method in that it processes data blocks onebyone, with intermediate adjustment of the parameter vector under optimization. This mode of operation often leads to faster convergence when far from the eventual limit, and to slower (sublinear) convergence when close to the optimal solution. We embed both LMS and steepest descent, as well as other intermediate methods, within a oneparameter class of algorithms, and we propose a hybrid class of methods that combine the faster early convergence rate of LMS with the faster ultimate linear convergence rate of steepest descent. These methods are wellsuited for neural network training problems with large data sets. Furthermore, these methods allow the e#ective use of scaling based for example on diagonal or other approximations of the Hessian matrix. 1 Research supported by NSF under Grant 9300494DMI. 2 Dept. of Electrical Engineering and Computer Science, M...
The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography
 SIAM J. Optim
, 2001
"... Abstract. We describe an optimization problem arising in reconstructing 3D medical images from Positron Emission Tomography (PET). A mathematical model of the problem, based on the Maximum Likelihood principle is posed as a problem of minimizing a convex function of several millions variables over t ..."
Abstract

Cited by 34 (6 self)
 Add to MetaCart
Abstract. We describe an optimization problem arising in reconstructing 3D medical images from Positron Emission Tomography (PET). A mathematical model of the problem, based on the Maximum Likelihood principle is posed as a problem of minimizing a convex function of several millions variables over the standard simplex. To solve a problem of these characteristics, we develop and implement a new algorithm, Ordered Subsets Mirror Descent, and demonstrate, theoretically and computationally, that it is well suited for solving the PET reconstruction problem. Key words: positron emission tomography, maximum likelihood, image reconstruction, convex optimization, mirror descent. 1
Training Neural Nets with the Reactive Tabu Search
"... In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrategy to real ..."
Abstract

Cited by 33 (7 self)
 Add to MetaCart
In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrategy to realize a discrete dynamical system that discourages limit cycles and the confinement of the search trajectory in a limited portion of the search space. The possible cycles are discouraged by prohibiting (i.e., making tabu) the execution of moves that reverse the ones applied in the most recent part of the search, for a prohibition period that is adapted in an automated way. The confinement is avoided and a proper exploration is obtained by activating a diversification strategy when too many configurations are repeated excessively often. The RTS method is applicable to nondifferentiable functions, it is robust with respect to the random initialization and effective in continuing the search after local minima. Three tests of the technique on feedforward and feedback systems are presented.
OnLine Learning Processes in Artificial Neural Networks
, 1993
"... We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. O ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. Online learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a timedependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network's representation. Minimization of a criterion incorporating both effects yields an algorithm for online adaptation of the learning parameter. The inherent noise of online learning makes it possible to escape from undesired local minima of the error potential on which the learning rule performs (stochastic) gradient descent. We try to quantify these often made cl...
Serial And Parallel Backpropagation Convergence Via Nonmonotone Perturbed Minimization
 OPTIMIZATION METHODS AND SOFTWARE
, 1994
"... A general convergence theorem is proposed for a family of serial and parallel nonmonotone unconstrained minimization methods with perturbations. A principal application of the theorem is to establish convergence of backpropagation (BP), the classical algorithm for training artificial neural networks ..."
Abstract

Cited by 28 (11 self)
 Add to MetaCart
A general convergence theorem is proposed for a family of serial and parallel nonmonotone unconstrained minimization methods with perturbations. A principal application of the theorem is to establish convergence of backpropagation (BP), the classical algorithm for training artificial neural networks. Under certain natural assumptions, such as divergence of the sum of the learning rates and convergence of the sum of their squares, it is shown that every accumulation point of the BP iterates is a stationary point of the error function associated with the given set of training examples. The results presented cover serial and parallel BP, as well as modified BP with a momentum term.
A convergent incremental gradient method with constant step size
 SIAM J. OPTIM
, 2004
"... An incremental gradient method for minimizing a sum of continuously differentiable functions is presented. The method requires a single gradient evaluation per iteration and uses a constant step size. For the case that the gradient is bounded and Lipschitz continuous, we show that the method visits ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
An incremental gradient method for minimizing a sum of continuously differentiable functions is presented. The method requires a single gradient evaluation per iteration and uses a constant step size. For the case that the gradient is bounded and Lipschitz continuous, we show that the method visits regions in which the gradient is small infinitely often. Under certain unimodality assumptions, global convergence is established. In the quadratic case, a global linear rate of convergence is shown. The method is applied to distributed optimization problems arising in wireless sensor networks, and numerical experiments compare the new method with the standard incremental gradient method.
Incremental Gradient Algorithms with Stepsizes Bounded Away From Zero
 Computational Opt. and Appl
, 1998
"... Abstract. We consider the class of incremental gradient methods for minimizing a sum of continuously differentiable functions. An important novel feature of our analysis is that the stepsizes are kept bounded away from zero. We derive the first convergence results of any kind for this computationall ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Abstract. We consider the class of incremental gradient methods for minimizing a sum of continuously differentiable functions. An important novel feature of our analysis is that the stepsizes are kept bounded away from zero. We derive the first convergence results of any kind for this computationally important case. In particular, we show that a certain εapproximate solution can be obtained and establish the linear dependence of ε on the stepsize limit. Incremental gradient methods are particularly wellsuited for large neural network training problems where obtaining an approximate solution is typically sufficient and is often preferable to computing an exact solution. Thus, in the context of neural networks, the approach presented here is related to the principle of tolerant training. Our results justify numerous stepsize rules that were derived on the basis of extensive numerical experimentation but for which no theoretical analysis was previously available. In addition, convergence to (exact) stationary points is established when the gradient satisfies a certain growth property.
An Efficient Method to Construct a Radial Basis Function Neural Network Classifier
, 1997
"... Radial basis function neural network(RBFN) has the power of the universal function approximation. But it is usually not straightforward how to construct an RBFN to solve a given problem. This paper describes a method to construct an RBFN classifier efficiently and effectively. The method determines ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
Radial basis function neural network(RBFN) has the power of the universal function approximation. But it is usually not straightforward how to construct an RBFN to solve a given problem. This paper describes a method to construct an RBFN classifier efficiently and effectively. The method determines the middle layer neurons by a fast clustering algorithm and computes the optimal weights between the middle and the output layers statistically. We applied the proposed method to construct an RBFN classifier for an unconstrained handwritten digit recognition. The experiment showed that the method could construct an RBFN classifier fast and the performance of the classifier was better than the best result previously reported. Keyword : Radial Basis Function, Linear Discriminant Function, Classification, APCIII, Clustering, GRBF, LMS, Handwritten Digit Recognition RBF Neural Network Classifier 2 1 INTRODUCTION Radial basis function neural network(RBFN) (Moody and Darken, 1989; Poggio and G...
A Counterexample to Temporal Differences Learning
, 1995
"... Sutton’s TD(N method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on A. For X = 1 the representation is optimal with respect to a leastsquares error criterion, but as X d ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Sutton’s TD(N method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on A. For X = 1 the representation is optimal with respect to a leastsquares error criterion, but as X decreases toward 0 the representation becomes progressively worse and, in some cases, very poor. The example suggests a need to understand better the circumstances under which TD(0) and Qlearning obtain satisfactory neural networkbased compact representations of the cost function. A variation of TD(0) is also given, which performs better on the example.