Results 1  10
of
19
Efficient BackProp
, 1998
"... . The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers expl ..."
Abstract

Cited by 125 (24 self)
 Add to MetaCart
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Improving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation Methods
, 1999
"... This article focuses on gradientbased backpropagation algorithms that use either a common adaptive learning rate for all weights or an individual adaptive learning rate for each weight and apply the Goldstein/Armijo line search. The learningrate adaptation is based on descent techniques and estima ..."
Abstract

Cited by 27 (15 self)
 Add to MetaCart
This article focuses on gradientbased backpropagation algorithms that use either a common adaptive learning rate for all weights or an individual adaptive learning rate for each weight and apply the Goldstein/Armijo line search. The learningrate adaptation is based on descent techniques and estimates of the local Lipschitz constant that are obtained without additional error function and gradient evaluations. The proposed algorithms improve the backpropagation training in terms of both convergence rate and convergence characteristics, such as stable learning and robustness to oscillations. Simulations are conducted to compare and evaluate the convergence behavior of these gradientbased training algorithms with several popular training methods.
Fast and accurate text classification via multiple linear discriminant projections
 In VLDB
, 2002
"... Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the bestknown SVM ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the bestknown SVM implementations take time proportional to n a, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!like directories.We present SIMPL, a nearly lineartime classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck.It uses Fisher’s linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected lowdimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively.It not only approaches and sometimes exceeds SVM accuracy, but also beats the running time of a popular SVM implementation by orders of magnitude.While describing SIMPL, we make a detailed experimental comparison of SVMgenerated discriminants with Fisher’s discriminants, and we also report on an analysis of the cache performance of a popular SVM implementation.Our analysis shows that SIMPL has the potential to be the method of choice for practitioners who want the accuracy of SVMs and the simplicity and speed of naive Bayes classifiers.
Improved neural network training of interword context units for connected digit recognition
 IN ICASSP'98
, 1998
"... For connected digit recognition the relative frequency of occurrence for contextdependent phonetic units at interword boundaries depends on the ordering of the spoken digits and may or may not include silence or pause. If these units represent classes in a model this means that the distribution of ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
For connected digit recognition the relative frequency of occurrence for contextdependent phonetic units at interword boundaries depends on the ordering of the spoken digits and may or may not include silence or pause. If these units represent classes in a model this means that the distribution of samples between classes #the class prior# may be extremely nonuniform and that the distribution over many utterances in a training set maybevery di#erent from the rather #at distribution over any single test utterance. Using a neural network to model contextdependent phonetic units we show how to compensate for this problem. We do this by roughly #attening the class prior for infrequently occurring context units by a suitable weighting of the neural network cost function. This is based entirely on training set statistics. We show that this leads to improved classi#cation of infrequent classes and translates into improved overall recognition performance. We give results for telephone speech...
Nonmonotone Methods for Backpropagation Training with Adaptive Learning Rate
 In: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’99). Washington D.C
, 1999
"... In this paper, we present nonmonotone methods for feedforward neural network training, i.e. training methods in which error function values are allowed to increase at some iterations. More specifically, at each epoch we impose that the current error function value must satisfy an Armijotype criteri ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
In this paper, we present nonmonotone methods for feedforward neural network training, i.e. training methods in which error function values are allowed to increase at some iterations. More specifically, at each epoch we impose that the current error function value must satisfy an Armijotype criterion, with respect to the maximum error function value of M previous epochs. A strategy to dynamically adapt M is suggested and two training algorithms with adaptive learning rates that successfully employ the above mentioned acceptability criterion are proposed. Experimental results show that the nonmonotone learning strategy improves the convergence speed and the success rate of the methods considered.
Using curvature information for fast stochastic search
 In Advances in Neural Information Processing Systems 9
, 1996
"... We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes e ective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the t ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes e ective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear backprop networks. Improving Stochastic Search Learning algorithms that perform gradient descent on a cost function can be formulated in either stochastic (online) or batch form. The stochastic version takes the form!t+1 =!t + t G (!t�xt) (1) where!t is the current weight estimate, t is the learning rate, G is minus the instantaneous gradient estimate, and xt is the input at time t1. One obtains the corresponding batch mode learning rule by takingconstant and averaging G over
Efficient Training of FeedForward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Adaptive Computational Chemotaxis in Bacterial Foraging Optimization: An Analysis
 IEEE Computer Society Press, ISBN 0769531091
, 2008
"... Some researchers have illustrated how individual and groups of bacteria forage for nutrients and to model it as a distributed optimization process, which is called the Bacterial Foraging Optimization (BFOA). One of the major driving forces of BFOA is the chemotactic movement of a virtual bacterium, ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Some researchers have illustrated how individual and groups of bacteria forage for nutrients and to model it as a distributed optimization process, which is called the Bacterial Foraging Optimization (BFOA). One of the major driving forces of BFOA is the chemotactic movement of a virtual bacterium, which models a trial solution of the optimization problem. In this article, we analyze the chemotactic step of a one dimensional BFOA in the light of the classical Gradient Descent Algorithm (GDA). Our analysis points out that chemotaxis employed in BFOA may result in sustained oscillation, especially for a flat fitness landscape, when a bacterium cell is very near to the optima. To accelerate the convergence speed near optima we have made the chemotactic step size C adaptive. Computer simulations over several numerical benchmarks indicate that BFOA with the new chemotactic operation shows better convergence behavior as compared to the classical BFOA.
Backpropagation Convergence Via Deterministic Nonmonotone Perturbed Minimization
, 1994
"... The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
The fundamental backpropagation (BP) algorithm for training artificial neural networks is cast as a deterministic nonmonotone perturbed gradient method . Under certain natural assumptions, such as the series of learning rates diverging while the series of their squares converging, it is established that every accumulation point of the online BP iterates is a stationary point of the BP error function. The results presented cover serial and parallel online BP, modified BP with a momentum term, and BP with weight decay.