Results 1  10
of
41
First and SecondOrder Methods for Learning: between Steepest Descent and Newton's Method
 Neural Computation
, 1992
"... Online first order backpropagation is sufficiently fast and effective for many largescale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first and secondorder optimization methods for learning in feedforward neura ..."
Abstract

Cited by 162 (7 self)
 Add to MetaCart
Online first order backpropagation is sufficiently fast and effective for many largescale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first and secondorder optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.
Parallel stochastic gradient algorithms for largescale matrix completion
 Mathematical Programming Computation
, 2013
"... This paper develops Jellyfish, an algorithm for solving dataprocessing problems with matrixvalued decision variables regularized to have low rank. Particular examples of problems solvable by Jellyfish include matrix completion problems and leastsquares problems regularized by the nuclear norm or ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
(Show Context)
This paper develops Jellyfish, an algorithm for solving dataprocessing problems with matrixvalued decision variables regularized to have low rank. Particular examples of problems solvable by Jellyfish include matrix completion problems and leastsquares problems regularized by the nuclear norm or γ2norm. Jellyfish implements a projected incremental gradient method with a biased, random ordering of the increments. This biased ordering allows for a parallel implementation that admits a speedup nearly proportional to the number of processors. On largescale matrix completion tasks, Jellyfish is orders of magnitude more efficient than existing codes. For example, on the Netflix Prize data set, prior art computes rating predictions in approximately 4 hours, while Jellyfish solves the same problem in under 3 minutes on a 12 core workstation.
A New Class Of Incremental Gradient Methods For Least Squares Problems
 SIAM J. Optim
, 1996
"... The LMS method for linear least squares problems di#ers from the steepest descent method in that it processes data blocks onebyone, with intermediate adjustment of the parameter vector under optimization. This mode of operation often leads to faster convergence when far from the eventual limit, an ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
(Show Context)
The LMS method for linear least squares problems di#ers from the steepest descent method in that it processes data blocks onebyone, with intermediate adjustment of the parameter vector under optimization. This mode of operation often leads to faster convergence when far from the eventual limit, and to slower (sublinear) convergence when close to the optimal solution. We embed both LMS and steepest descent, as well as other intermediate methods, within a oneparameter class of algorithms, and we propose a hybrid class of methods that combine the faster early convergence rate of LMS with the faster ultimate linear convergence rate of steepest descent. These methods are wellsuited for neural network training problems with large data sets. Furthermore, these methods allow the e#ective use of scaling based for example on diagonal or other approximations of the Hessian matrix. 1 Research supported by NSF under Grant 9300494DMI. 2 Dept. of Electrical Engineering and Computer Science, M...
A convergent incremental gradient method with constant step size
 SIAM J. OPTIM
, 2004
"... An incremental gradient method for minimizing a sum of continuously differentiable functions is presented. The method requires a single gradient evaluation per iteration and uses a constant step size. For the case that the gradient is bounded and Lipschitz continuous, we show that the method visits ..."
Abstract

Cited by 63 (3 self)
 Add to MetaCart
(Show Context)
An incremental gradient method for minimizing a sum of continuously differentiable functions is presented. The method requires a single gradient evaluation per iteration and uses a constant step size. For the case that the gradient is bounded and Lipschitz continuous, we show that the method visits regions in which the gradient is small infinitely often. Under certain unimodality assumptions, global convergence is established. In the quadratic case, a global linear rate of convergence is shown. The method is applied to distributed optimization problems arising in wireless sensor networks, and numerical experiments compare the new method with the standard incremental gradient method.
The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography
 SIAM J. Optim
, 2001
"... Abstract. We describe an optimization problem arising in reconstructing 3D medical images from Positron Emission Tomography (PET). A mathematical model of the problem, based on the Maximum Likelihood principle is posed as a problem of minimizing a convex function of several millions variables over t ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
(Show Context)
Abstract. We describe an optimization problem arising in reconstructing 3D medical images from Positron Emission Tomography (PET). A mathematical model of the problem, based on the Maximum Likelihood principle is posed as a problem of minimizing a convex function of several millions variables over the standard simplex. To solve a problem of these characteristics, we develop and implement a new algorithm, Ordered Subsets Mirror Descent, and demonstrate, theoretically and computationally, that it is well suited for solving the PET reconstruction problem. Key words: positron emission tomography, maximum likelihood, image reconstruction, convex optimization, mirror descent. 1
Serial And Parallel Backpropagation Convergence Via Nonmonotone Perturbed Minimization
 OPTIMIZATION METHODS AND SOFTWARE
, 1994
"... A general convergence theorem is proposed for a family of serial and parallel nonmonotone unconstrained minimization methods with perturbations. A principal application of the theorem is to establish convergence of backpropagation (BP), the classical algorithm for training artificial neural networks ..."
Abstract

Cited by 37 (11 self)
 Add to MetaCart
(Show Context)
A general convergence theorem is proposed for a family of serial and parallel nonmonotone unconstrained minimization methods with perturbations. A principal application of the theorem is to establish convergence of backpropagation (BP), the classical algorithm for training artificial neural networks. Under certain natural assumptions, such as divergence of the sum of the learning rates and convergence of the sum of their squares, it is shown that every accumulation point of the BP iterates is a stationary point of the error function associated with the given set of training examples. The results presented cover serial and parallel BP, as well as modified BP with a momentum term.
OnLine Learning Processes in Artificial Neural Networks
, 1993
"... We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. O ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
We study online learning processes in artificial neural networks from a general point of view. Online learning means that a learning step takes place at each presentation of a randomly drawn training pattern. It can be viewed as a stochastic process governed by a continuoustime master equation. Online learning is necessary if not all training patterns are available all the time. This occurs in many applications when the training patterns are drawn from a timedependent environmental distribution. Studying learning in a changing environment, we encounter a conflict between the adaptability and the confidence of the network's representation. Minimization of a criterion incorporating both effects yields an algorithm for online adaptation of the learning parameter. The inherent noise of online learning makes it possible to escape from undesired local minima of the error potential on which the learning rule performs (stochastic) gradient descent. We try to quantify these often made cl...
Training Neural Nets with the Reactive Tabu Search
"... In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrate ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
In this paper the task of training subsymbolic systems is considered as a combinatorial optimization problem and solved with the heuristic scheme of the Reactive Tabu Search. An iterative optimization process based on a "modified greedy search" component is complemented with a metastrategy to realize a discrete dynamical system that discourages limit cycles and the confinement of the search trajectory in a limited portion of the search space. The possible cycles are discouraged by prohibiting (i.e., making tabu) the execution of moves that reverse the ones applied in the most recent part of the search, for a prohibition period that is adapted in an automated way. The confinement is avoided and a proper exploration is obtained by activating a diversification strategy when too many configurations are repeated excessively often. The RTS method is applicable to nondifferentiable functions, it is robust with respect to the random initialization and effective in continuing the search after local minima. Three tests of the technique on feedforward and feedback systems are presented.
Approximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization
, 2009
"... Convex optimization problems arising in applications, possibly as approximations of intractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
(Show Context)
Convex optimization problems arising in applications, possibly as approximations of intractable problems, are often structured and large scale. When the data are noisy, it is of interest to bound the solution error relative to the (unknown) solution of the original noiseless problem. Related to this is an error bound for the linear convergence analysis of firstorder gradient methods for solving these problems. Example applications include compressed sensing, variable selection in regression, TVregularized image denoising, and sensor network localization.
Incremental Gradient Algorithms with Stepsizes Bounded Away From Zero
 Computational Opt. and Appl
, 1998
"... Abstract. We consider the class of incremental gradient methods for minimizing a sum of continuously differentiable functions. An important novel feature of our analysis is that the stepsizes are kept bounded away from zero. We derive the first convergence results of any kind for this computationall ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the class of incremental gradient methods for minimizing a sum of continuously differentiable functions. An important novel feature of our analysis is that the stepsizes are kept bounded away from zero. We derive the first convergence results of any kind for this computationally important case. In particular, we show that a certain εapproximate solution can be obtained and establish the linear dependence of ε on the stepsize limit. Incremental gradient methods are particularly wellsuited for large neural network training problems where obtaining an approximate solution is typically sufficient and is often preferable to computing an exact solution. Thus, in the context of neural networks, the approach presented here is related to the principle of tolerant training. Our results justify numerous stepsize rules that were derived on the basis of extensive numerical experimentation but for which no theoretical analysis was previously available. In addition, convergence to (exact) stationary points is established when the gradient satisfies a certain growth property.