Results 1 - 10
of
14
First and Second-Order Methods for Learning: between Steepest Descent and Newton's Method
- Neural Computation
, 1992
"... On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neura ..."
Abstract
-
Cited by 108 (6 self)
- Add to MetaCart
On-line first order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.
Efficient Back Prop
, 1996
"... HINE Parameters X0, X1, ....Xp Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp Input w w0 w1 AT&T Laboratories (c) COST FUNCTION Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp X0, X1, ....Xp Input Parameters w B R A COMPUTING THE GRADIENT WITH BACKPROPAGATIO ..."
Abstract
-
Cited by 94 (17 self)
- Add to MetaCart
HINE Parameters X0, X1, ....Xp Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp Input w w0 w1 AT&T Laboratories (c) COST FUNCTION Output E0, E1,....Ep Error Desired Output D0, D1,...Dp Y0, Y1,...Yp X0, X1, ....Xp Input Parameters w B R A COMPUTING THE GRADIENT WITH BACKPROPAGATION O = A(I1, I2) dI1 = dO ¶ A ¶ I1 dI2 = dO ¶ A ¶ I2 - The learning machine is composed of modules (e.g. layers) - Each module can do two things: 1- compute its outputs from its inputs (FPROP) 2- compute gradient vectors at its inputs from gradient vectors at its outputs (BPROP) A O, dO I1, dI1 I2, dI2 AT&T Laboratories (c) AN INTERESTING SPECIAL CASE: MULTILAYER NETWORKS X0, X1, ....Xp Output Desired Output D0, D1,...Dp Y0, Y1,...Yp Input || D - Y || 2 2 1 WX F() WX F() Mean Square Error Parameters (weights + biases) w Weight matrix E0, E1,....Ep Sigmoids + Biase
Local Gain Adaptation in Stochastic Gradient Descent
- In Proc. Intl. Conf. Artificial Neural Networks
, 1999
"... Gain adaptation algorithms for neural networks typically adjust learning rates by monitoring the correlation between successive gradients. Here we discuss the limitations of this approach, and develop an alternative by extending Sutton's work on linear systems to the general, nonlinear case. The res ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
Gain adaptation algorithms for neural networks typically adjust learning rates by monitoring the correlation between successive gradients. Here we discuss the limitations of this approach, and develop an alternative by extending Sutton's work on linear systems to the general, nonlinear case. The resulting online algorithms are computationally little more expensive than other acceleration techniques, do not assume statistical independence between successive training patterns, and do not require an arbitrary smoothing parameter. In our benchmark experiments, they consistently outperform other acceleration methods, and show remarkable robustness when faced with noni. i.d. sampling of the input space.
Comparison of Optimized Backpropagation Algorithms
- Proc. of ESANN'93, Brussels
, 1993
"... Backpropagation is one of the most famous training algorithms for multilayer perceptrons. Unfortunately it can be very slow for practical applications. Over the last years many improvement strategies have been developed to speed up backpropagation. It's very difficult to compare these different tech ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Backpropagation is one of the most famous training algorithms for multilayer perceptrons. Unfortunately it can be very slow for practical applications. Over the last years many improvement strategies have been developed to speed up backpropagation. It's very difficult to compare these different techniques, because most of them have been tested on various specific data sets. Most of the reported results are based on some kind of tiny and artificial training sets like XOR, encoder or decoder. It's very doubtful if these results hold for more complicate practical application. In this report an overview of many different speedup techniques is given. All of them were assessed by a very hard practical classification task, which consists of a big medical data set. As you will see many of these optimized algorithms fail in learning the data set. 1 Introduction This report is intended to summarize our experience using many different speedup techniques for the backpropagation algorithm. We have...
Transferring Previously Learned Back-Propagation Neural Networks To New Learning Tasks
, 1993
"... ..."
Improving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation Methods
, 1999
"... This article focuses on gradient-based backpropagation algorithms that use either a common adaptive learning rate for all weights or an individual adaptive learning rate for each weight and apply the Goldstein/Armijo line search. The learning-rate adaptation is based on descent techniques and estima ..."
Abstract
-
Cited by 19 (13 self)
- Add to MetaCart
This article focuses on gradient-based backpropagation algorithms that use either a common adaptive learning rate for all weights or an individual adaptive learning rate for each weight and apply the Goldstein/Armijo line search. The learning-rate adaptation is based on descent techniques and estimates of the local Lipschitz constant that are obtained without additional error function and gradient evaluations. The proposed algorithms improve the backpropagation training in terms of both convergence rate and convergence characteristics, such as stable learning and robustness to oscillations. Simulations are conducted to compare and evaluate the convergence behavior of these gradient-based training algorithms with several popular training methods.
Efficient Training of Feed-Forward Neural Networks
, 1997
"... : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 A.3 Optimization strategy : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 A.4 The Backpropagation algorithm : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5 Conjugate direction methods : : : : : : : : : : : : : : : : : : : : : : : : : : 63 A.5.1 Conjugate gradients : : : : : : : : : : : : : : : : : : : : : : : : : : 65 A.5.2 The CGL algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.5.3 The BFGS algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.6 The SCG algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 67 A.7 Test results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 A.7.1 Comparison metric : : : : : : : : : : : : : : : : : : : : : : : :...
Globally Convergent Algorithms With Local Learning Rates
- IEEE Tr. Neural Networks
"... In this paper, a new generalized theoretical result is presented that underpins the development of globally convergent first-order batch training algorithms which employ local learning rates. This result allows us to equip algorithms of this class with a strategy for adapting the overall direction o ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper, a new generalized theoretical result is presented that underpins the development of globally convergent first-order batch training algorithms which employ local learning rates. This result allows us to equip algorithms of this class with a strategy for adapting the overall direction of search to a descent one. In this way, a decrease of the batch-error measure at each training iteration is ensured, and convergence of the sequence of weight iterates to a local minimizer of the batch error function is obtained from remote initial weights. The effectiveness of the theoretical result is illustrated in three application examples by comparing two well-known training algorithms with local learning rates to their globally convergent modifications.
Deterministic Nonmonotone Strategies for Effective Training of Multilayer Perceptrons
"... In this paper, we present deterministic nonmonotone learning strategies for multilayer perceptrons (MLPs), i.e., deterministic training algorithms in which error function values are allowed to increase at some epochs. To this end, we argue that the current error function value must satisfy a nonmono ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In this paper, we present deterministic nonmonotone learning strategies for multilayer perceptrons (MLPs), i.e., deterministic training algorithms in which error function values are allowed to increase at some epochs. To this end, we argue that the current error function value must satisfy a nonmonotone criterion with respect to the maximum error function value of the previous epochs, and we propose a subprocedure to dynamically compute . The nonmonotone strategy can be incorporated in any batch training algorithm and provides fast, stable, and reliable learning. Experimental results in different classes of problems show that this approach improves the convergence speed and success percentage of first-order training algorithms and alleviates the need for fine-tuning problem-depended heuristic parameters.
A neural network for classification with incomplete data: application to robust ASR
- Proc. of ICSLP (International Conference on Spoken Language Processing
, 2000
"... There are many situations in data classification where the data vector to be classified is partially corrupted, or otherwise incomplete. In this case the optimal estimate for each class probability output, for any given set of missing data components, can be obtained by calculating its expected valu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
There are many situations in data classification where the data vector to be classified is partially corrupted, or otherwise incomplete. In this case the optimal estimate for each class probability output, for any given set of missing data components, can be obtained by calculating its expected value. However, this means that classifiers whose expected outputs do not have a closed form expression in terms of the original function parameters, such as the commonly used multi-layer perceptron (MLP), cannot be used for classification with missing data. No classifier can compete with the performance of an MLP on complete data unless it is discriminatively trained. In this paper we present a particular form of RBF classifier which can be discriminatively trained and whose expected outputs are a simple function of the original classifier parameters, even though the output unit function is non-linear. This provides us with an incomplete data classifier network (IDCN) which combines the discriminative classification performance normally associated with artificial neural networks, with the ability to deal gracefully with missing data. We describe two ways in which this IDCN can be applied to robust automatic speech recognition (ASR), depending on whether or not the position of missing data is known. We compare the performance of one of these models with an existing system for ASR with missing data.

