Results 1  10
of
28
Efficient BackProp
, 1998
"... . The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers expl ..."
Abstract

Cited by 209 (31 self)
 Add to MetaCart
. The convergence of backpropagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that secondorder optimization methods are advantageous for neural net training. It is shown that most "classical" secondorder methods are impractical for large neural networks. A few methods are proposed that do not have these limitations. 1 Introduction Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science. Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number ...
Fast Exact Multiplication by the Hessian
 Neural Computation
, 1994
"... Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly ca ..."
Abstract

Cited by 91 (5 self)
 Add to MetaCart
Just storing the Hessian H (the matrix of second derivatives d^2 E/dw_i dw_j of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. This allows H to be treated as a generalized sparse matrix. To calculate Hv, we first define a differential operator R{f(w)} = (d/dr)f(w + rv)_{r=0}, note that R{grad_w} = Hv and R{w} = v, and then apply R{} to the equations used to compute grad_w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to backpropagation networks, recurrent backpropagation, and stochastic Boltzmann Machines. Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating the need for direct methods.
Improving the convergence of the backpropagation algorithm using learning rate adaptation methods
 Neural Computation
, 1999
"... ..."
Fast and accurate text classification via multiple linear discriminant projections
 In VLDB
, 2002
"... Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the bestknown SVM ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Support vector machines (SVMs) have shown superb performance for text classification tasks.They are accurate, robust, and quick to apply to test instances.Their only potential drawback is their training time and memory requirement.For n training instances held in memory, the bestknown SVM implementations take time proportional to n a, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!like directories.We present SIMPL, a nearly lineartime classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck.It uses Fisher’s linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected lowdimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively.It not only approaches and sometimes exceeds SVM accuracy, but also beats the running time of a popular SVM implementation by orders of magnitude.While describing SIMPL, we make a detailed experimental comparison of SVMgenerated discriminants with Fisher’s discriminants, and we also report on an analysis of the cache performance of a popular SVM implementation.Our analysis shows that SIMPL has the potential to be the method of choice for practitioners who want the accuracy of SVMs and the simplicity and speed of naive Bayes classifiers.
Adaptive Computational Chemotaxis in Bacterial Foraging Optimization: An Analysis
 IEEE Computer Society Press, ISBN 0769531091
, 2008
"... Some researchers have illustrated how individual and groups of bacteria forage for nutrients and to model it as a distributed optimization process, which is called the Bacterial Foraging Optimization (BFOA). One of the major driving forces of BFOA is the chemotactic movement of a virtual bacterium, ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
Some researchers have illustrated how individual and groups of bacteria forage for nutrients and to model it as a distributed optimization process, which is called the Bacterial Foraging Optimization (BFOA). One of the major driving forces of BFOA is the chemotactic movement of a virtual bacterium, which models a trial solution of the optimization problem. In this article, we analyze the chemotactic step of a one dimensional BFOA in the light of the classical Gradient Descent Algorithm (GDA). Our analysis points out that chemotaxis employed in BFOA may result in sustained oscillation, especially for a flat fitness landscape, when a bacterium cell is very near to the optima. To accelerate the convergence speed near optima we have made the chemotactic step size C adaptive. Computer simulations over several numerical benchmarks indicate that BFOA with the new chemotactic operation shows better convergence behavior as compared to the classical BFOA.
Efficient Training of FeedForward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Improved neural network training of interword context units for connected digit recognition
 IN ICASSP'98
, 1998
"... For connected digit recognition the relative frequency of occurrence for contextdependent phonetic units at interword boundaries depends on the ordering of the spoken digits and may or may not include silence or pause. If these units represent classes in a model this means that the distribution of ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
For connected digit recognition the relative frequency of occurrence for contextdependent phonetic units at interword boundaries depends on the ordering of the spoken digits and may or may not include silence or pause. If these units represent classes in a model this means that the distribution of samples between classes #the class prior# may be extremely nonuniform and that the distribution over many utterances in a training set maybevery di#erent from the rather #at distribution over any single test utterance. Using a neural network to model contextdependent phonetic units we show how to compensate for this problem. We do this by roughly #attening the class prior for infrequently occurring context units by a suitable weighting of the neural network cost function. This is based entirely on training set statistics. We show that this leads to improved classi#cation of infrequent classes and translates into improved overall recognition performance. We give results for telephone speech...
Nonmonotone Methods for Backpropagation Training with Adaptive Learning Rate
 IN: PROCEEDINGS OF THE IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN'99), WASHINGTON D.C
, 1999
"... In this paper, w present nonmonotone methods for feedforward neural network training, i.e. training methods in whicherPO function valuesar allowed to incrE2O at some iter6#OEq8 Mor specifically, at each epoch we impose that thecur#0 terPP function value must satisfy anAr[286O ypecr206Eq8# with rth e ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
In this paper, w present nonmonotone methods for feedforward neural network training, i.e. training methods in whicherPO function valuesar allowed to incrE2O at some iter6#OEq8 Mor specifically, at each epoch we impose that thecur#0 terPP function value must satisfy anAr[286O ypecr206Eq8# with rth ect to the maximumer[2 function value of Mpr]6[PE epochs. A str]OEq to dynamically adapt M is suggested and twotr[#[6E algorEq82 with adaptive lear6 ingr ates that successfully employ the above mentioned acceptability cr[[#P#E ar pr[ osed. ExperE28 tal rl sults show that the nonmonotone lear2]3 str][02 improves the convergence speed and the success rate of the methods considered.
Computeraided diagnostic scheme for distinction between benign and malignant nodules in thoracic lowdose CT by use of massive training artificial neural network
 IEEE Transactions on Medical Imaging
, 2005
"... Abstract—Lowdose helical computed tomography (LDCT) is being applied as a modality for lung cancer screening. It may be difficult, however, for radiologists to distinguish malignant from benign nodules in LDCT. Our purpose in this study was to develop a computeraided diagnostic (CAD) scheme for di ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Lowdose helical computed tomography (LDCT) is being applied as a modality for lung cancer screening. It may be difficult, however, for radiologists to distinguish malignant from benign nodules in LDCT. Our purpose in this study was to develop a computeraided diagnostic (CAD) scheme for distinction between benign and malignant nodules in LDCT scans by use of a massive training artificial neural network (MTANN). The MTANN is a trainable, highly nonlinear filter based on an artificial neural network. To distinguish malignant nodules from six different types of benign nodules, we developed multiple MTANNs (multiMTANN) consisting of six expert MTANNs that are arranged in parallel. Each of the MTANNs was trained by use of input CT images and teaching images containing the estimate of the distribution for the “likelihood of being a malignant nodule,” i.e., the teaching image for a malignant nodule contains a twodi
Using curvature information for fast stochastic search
 In Advances in Neural Information Processing Systems 9
, 1996
"... We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes e ective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the t ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
We present an algorithm for fast stochastic gradient descent that uses a nonlinear adaptive momentum scheme to optimize the late time convergence rate. The algorithm makes e ective use of curvature information, requires only O(n) storage and computation, and delivers convergence rates close to the theoretical optimum. We demonstrate the technique on linear and large nonlinear backprop networks. Improving Stochastic Search Learning algorithms that perform gradient descent on a cost function can be formulated in either stochastic (online) or batch form. The stochastic version takes the form!t+1 =!t + t G (!t�xt) (1) where!t is the current weight estimate, t is the learning rate, G is minus the instantaneous gradient estimate, and xt is the input at time t1. One obtains the corresponding batch mode learning rule by takingconstant and averaging G over