## Globally Convergent Algorithms With Local Learning Rates

Venue: | IEEE Tr. Neural Networks |

Citations: | 5 - 3 self |

### BibTeX

@ARTICLE{Magoulas_globallyconvergent,

author = {George D. Magoulas and Vassilis P. Plagianakos and Michael N. Vrahatis},

title = {Globally Convergent Algorithms With Local Learning Rates},

journal = {IEEE Tr. Neural Networks},

year = {},

volume = {13},

pages = {774--779}

}

### OpenURL

### Abstract

In this paper, a new generalized theoretical result is presented that underpins the development of globally convergent first-order batch training algorithms which employ local learning rates. This result allows us to equip algorithms of this class with a strategy for adapting the overall direction of search to a descent one. In this way, a decrease of the batch-error measure at each training iteration is ensured, and convergence of the sequence of weight iterates to a local minimizer of the batch error function is obtained from remote initial weights. The effectiveness of the theoretical result is illustrated in three application examples by comparing two well-known training algorithms with local learning rates to their globally convergent modifications.

### Citations

2732 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...s to an undesired local minimum (undesired local minimizers are those having error function values higher than the desired error goal). In a more difficult problem, learning the three-bit parity [7], =-=[25]-=-, a typical run for the Qprop method and its globally convergent modification is shown in Fig. 2. Starting with the same initial weights and learning parameters, the modified Qprop (dotted line) succe... |

944 |
Numerical methods for unconstrained optimization and nonlinear equations
- Dennis, Schnabel
- 1983
(Show Context)
Citation Context ...12], [14], [24]. Moreover, it seems that using heuristics it is not possible to develop globally convergent algorithms and, thus, guarantee convergence to a local minimizer from any initial condition =-=[5]. In-=- the context of optimization theory, the issue of making an unconstrained minimization iterative scheme globally convergent is treated as will be described below. Suppose that � X h& � 3 is the ob... |

644 | A direct adaptive method for faster backpropagation learning: The RPROP algorithm
- Riedmiller, Braun
- 1993
(Show Context)
Citation Context ...TE ADAPTATION STRATEGIES � P � @� a IY PY FFFY�A, i.e., � � � I Y P Y FFFY �, which increases if the successive corrections of the weights are in the same direction and decreases other=-=wise [8], [19], [23]-=-, [27]. This paper focuses on the last approach and particularly on the special class of first-order adaptive training algorithms that employ local learning rates. These algorithms employ heuristic st... |

510 |
Iterative Solution of Nonlinear Equations
- Ortega, Rheinboldt
- 2000
(Show Context)
Citation Context ...his � end, a simple backtracking strategy could be used to decrease by � a reduction factor Ia�, where � b I. This has the effect that is decreased by the largest number in the sequence ��=-= 0� � I �aI [18]. We-=- remark here that the selection of � is not critical for successful learning, however it has an influence on the number of error function evaluations required to satisfy the Wolfe’s conditions. A ... |

339 |
Increased rates of convergence through learning rate adaptation. Neural Networks 1(4):295--308
- Jacobs
- 1988
(Show Context)
Citation Context ...LEARNING RATE ADAPTATION STRATEGIES � P � @� a IY PY FFFY�A, i.e., � � � I Y P Y FFFY �, which increases if the successive corrections of the weights are in the same direction and decr=-=eases otherwise [8]-=-, [19], [23], [27]. This paper focuses on the last approach and particularly on the special class of first-order adaptive training algorithms that employ local learning rates. These algorithms employ ... |

300 | A scaled conjugate gradient algorithm for fast supervised learning.”NEURAL - Møller - 1993 |

252 |
Fast-learning variations on back-propagation: An empirical study
- Fahlman
- 1989
(Show Context)
Citation Context ...nd its gradient, �i@�A, by the backpropagation (BP) algorithm, consists of using a different adaptive learning rate for each direction in weight space. Batch-type BP training algorithms of this cl=-=ass [6], -=-[8], [19], [23], [27], follow the iterative scheme decrease at each iteration and that the weight sequence will converge to a minimizer of the batch error function i. To alleviate this situa� tion, ... |

162 |
Minimization of functions having Lipschitz continuous first partial derivatives
- Armijo
- 1966
(Show Context)
Citation Context ...al for successful learning, however it has an influence on the number of error function evaluations required to satisfy the Wolfe’s conditions. A value of � aPis generally suggested in the literat=-=ure [1], -=-[18] and, indeed, it has been found to work without problems in our experiments. In reference to the Wolfe conditions (3)–(4), (3) ensures that the error is reduced sufficiently, while (4) prevents ... |

128 | First and Second-Order Methods for Learning Between Steepest Descentand Newton’s Method
- BATTITI
- 1992
(Show Context)
Citation Context ... of multilayer feedforward neural networks can be considered as a highly nonlinear minimization problem, involving sigmoid functions that have infinitely broad regions with arbitrary small derivative =-=[3]-=-, [26]. First-order training algorithms that follow the iterative scheme (1) usually evaluate the local learning rates by means of heuristic procedures that exploit information regarding the history o... |

123 |
Accelerating the Convergence of the Back-Propagation Method. Biological Cybernetics 59
- Vogl, Mangis, et al.
- 1988
(Show Context)
Citation Context ..., to 1) start with a small learning rate, H , and increase it at the next iteration, �CI, if successive iterations reduce the error, or rapidly decrease it if a significant error increase occurs [2]=-=, [29]; -=-2) start with a small H and increase it at the � CIiteration, if successive iterations keep gradient direction fairly constant, or rapidly decrease it if the direction of the gradient varies greatly... |

83 | Theory of Algorithms for Unconstrained Optimization
- Nocedal
- 1992
(Show Context)
Citation Context ... A� aH (7) which means that the sequence of gradients converges to zero. For an iterative scheme of the form (2), the limit (7) is the best type of global convergence result that can be obtained (se=-=e [17] f-=-or a detailed discussion). From the above, it is evident that no guarantee is provided that the iterative scheme (2) will converge to a global minimizer, � 3 , but only that it possesses the global ... |

73 |
Convergence Conditions for Ascent Method
- Wolfe
(Show Context)
Citation Context ...a local minimizer of i [6], [8], [14], [19], [23], [27]. �� @� � C � � � A b � � 'P�� @� � A b � � where �� @�A is the gradient of � at �, and H `'I `'P ` I. Th=-=en, the following theorem, due to Wolfe [32], [33] a-=-nd Zoutendijk [34], can be used to obtain global convergence results. Theorem 1 [32], [34]: Suppose that � X h& � 3 is bounded below in � and that � is continuously differentiable in a neighbo... |

47 |
Accelerated backpropagation learning: two optimization methods
- Battiti
- 1989
(Show Context)
Citation Context ...ample, to 1) start with a small learning rate, H , and increase it at the next iteration, �CI, if successive iterations reduce the error, or rapidly decrease it if a significant error increase occur=-=s [2], -=-[29]; 2) start with a small H and increase it at the � CIiteration, if successive iterations keep gradient direction fairly constant, or rapidly decrease it if the direction of the gradient varies g... |

41 | Ill-conditioning in neural network training problems
- Saarinen, Bramley, et al.
- 1993
(Show Context)
Citation Context ...ultilayer feedforward neural networks can be considered as a highly nonlinear minimization problem, involving sigmoid functions that have infinitely broad regions with arbitrary small derivative [3], =-=[26]-=-. First-order training algorithms that follow the iterative scheme (1) usually evaluate the local learning rates by means of heuristic procedures that exploit information regarding the history of the ... |

38 |
Acceleration techniques for the back–propagation algorithm
- Silva, Almeida
- 1990
(Show Context)
Citation Context ...PTATION STRATEGIES � P � @� a IY PY FFFY�A, i.e., � � � I Y P Y FFFY �, which increases if the successive corrections of the weights are in the same direction and decreases otherwise [=-=8], [19], [23], [27]-=-. This paper focuses on the last approach and particularly on the special class of first-order adaptive training algorithms that employ local learning rates. These algorithms employ heuristic strategi... |

25 |
An adaptive training algorithm for back–propagation networks
- Chan, Fallside
- 1987
(Show Context)
Citation Context ... 2) start with a small H and increase it at the � CIiteration, if successive iterations keep gradient direction fairly constant, or rapidly decrease it if the direction of the gradient varies greatl=-=y [4]; 3) use a local lea-=-rning rate for each weight � � II. LOCAL LEARNING RATE ADAPTATION STRATEGIES � P � @� a IY PY FFFY�A, i.e., � � � I Y P Y FFFY �, which increases if the successive corrections of t... |

21 | Image recognition and neuronal networks: Intelligent systems for the improvement of imaging information, Minimal Invasive Therapy & Allied Technologies
- Karkanis, Magoulas, et al.
- 2000
(Show Context)
Citation Context ...rmal and ten abnormal tissue samples have been randomly chosen from each image and used for training the network to discriminate between malignant and normal regions with 3% classification error (see =-=[10] for fur-=-ther technical details). We have used G-Qprop. Fig. 4 (b) shows a plot of the average percentage of success with respect to six initial weight ranges @0—Y —A, where — P �HXPY HXTY IY IXRY IXVY... |

21 | A class of gradient unconstrained minimization algorithms with adaptive step-size - Vrahatis, Androulakis, et al. - 2000 |

16 |
Effective back–propagation with variable stepsize
- Magoulas, Vrahatis, et al.
- 1997
(Show Context)
Citation Context ...t this point, it is useful to illustrate the behavior of the proposed strategy by means of a simple example, which concerns the case of a single node with two weights and logistic activation function =-=[13]. Th-=-is minimal architecture is trained using the classic Qprop method and its globally convergent modification, which uses a positive learning rate value � I computed by the Qprop formula and � P give... |

13 | Simulated annealing and weight decay in adaptive learning : The SARPROP algorithm - Treadgold, Gedeon - 1998 |

12 |
An analysis of premature saturation in backpropagation learning
- Lee, Oh, et al.
- 1993
(Show Context)
Citation Context ...ng inappropriate values for the heuristic learning parameters can considerably slow the rate of training, or even lead to divergence and to premature saturation, as has been observed in certain cases =-=[12]-=-, [14], [24]. Moreover, it seems that using heuristics it is not possible to develop globally convergent algorithms and, thus, guarantee convergence to a local minimizer from any initial condition [5]... |

9 |
Rescalins of variables in backpropagation learning
- Rigler, Irvine, et al.
- 1991
(Show Context)
Citation Context ...iate values for the heuristic learning parameters can considerably slow the rate of training, or even lead to divergence and to premature saturation, as has been observed in certain cases [12], [14], =-=[24]-=-. Moreover, it seems that using heuristics it is not possible to develop globally convergent algorithms and, thus, guarantee convergence to a local minimizer from any initial condition [5]. In the con... |

6 | Globally convergent modification of the quickprop method
- Vrahatis, Magoulas, et al.
- 2000
(Show Context)
Citation Context ...@�A with respect to the �th weight and/or the history of each learning rate, depending on the algorithm. For example, the Qprop, [6], performs independent secant steps in the direction of each wei=-=ght [31]-=-, while the Rprop algorithm, [23], updates the weights using the learning rate and the sign of the partial derivative of the error function with respect to each weight. Clearly, the weight vector in (... |

5 | On complexity analysis of supervised MLP-learning for algorithmic comparisons
- Mizutani, Dreyfus
- 2001
(Show Context)
Citation Context ...f success with the original methods. Note that in training practice, a gradient evaluation is considered three times more costly than an error function evaluation for certain classes of problems [14]�=-=��[16]-=-. Our experience from the simulations is that the proposed strategy behaves predictably and reliably. Below, we exhibit results from 100 runs for the SA method, the Qprop algorithm, and their globally... |

5 |
Learning in multilayer perceptrons using global optimization strategies”, Nonlinear Analysis: Theory, Methods and Applications
- Plagianakos, Magoulas, et al.
- 2001
(Show Context)
Citation Context ...e are interested in developing algorithms that will converge to a local minimizer with certainty. The interesting topic of finding global minimizers in training neural networks is described elsewhere =-=[20]��-=-�[22], [28]. This paper is organized as follows. In Section II, local learning rate training algorithms are presented, and their advantages and disadvantages are discussed. The proposed approach and t... |

4 | Solving the n-bit parity problem using neural networks - Hohil, Liu, et al. - 1999 |

3 |
Construction of a large scale neural network: Simulation of handwritten Japanese Character Recognition
- Joe, Mori, et al.
(Show Context)
Citation Context ...al learning rates are mainly motivated by the need to train neural networks in situations when a learning rate appropriate for one weight direction is not necessarily appropriate for other directions =-=[9]-=-. Moreover, in certain cases a learning rate may not be appropriate for all of the portions of the error surface. To this end, a common approach to avoid slow convergence in flat directions and oscill... |

3 |
Early Colorectal Cancer
- Kudo
- 1996
(Show Context)
Citation Context ...ifferent types of abnormalities in colonoscopy images taken from two different colons. Image 1 [Fig. 4(a) left] is considered histologically as a low grade cancer (a Type-III’s lesion macroscopicall=-=y [11]-=-). Image 2 [Fig. 4(b) right] is considered histologically as a moderately differentiated carcinoma (a Type-V lesion macroscopically). Textures from ten normal and ten abnormal tissue samples have been... |

3 |
Speeding-up backpropagation—A comparison of orthogonal techniques
- Pfister, Rojas
- 1993
(Show Context)
Citation Context ...ING RATE ADAPTATION STRATEGIES � P � @� a IY PY FFFY�A, i.e., � � � I Y P Y FFFY �, which increases if the successive corrections of the weights are in the same direction and decreases=-= otherwise [8], [19]-=-, [23], [27]. This paper focuses on the last approach and particularly on the special class of first-order adaptive training algorithms that employ local learning rates. These algorithms employ heuris... |

3 |
Nonlinear programming, computational methods, Integer and Nonlinear Programming
- Zoutendijk
- 1970
(Show Context)
Citation Context ..., [8], [14], [19], [23], [27]. �� @� � C � � � A b � � 'P�� @� � A b � � where �� @�A is the gradient of � at �, and H `'I `'P ` I. Then, the following theorem,=-= due to Wolfe [32], [33] and Zoutendijk [34], can be-=- used to obtain global convergence results. Theorem 1 [32], [34]: Suppose that � X h& � 3 is bounded below in � and that � is continuously differentiable in a neighborhood x of the level set v... |

1 | training using global search methods,” in Advances in Convex Analysis and Global Optimization - “Supervised |

1 |
learning of neural nets through global search,” in Global Optimization—Selected Case Studies, Edited Volume of Case
- “Improved
(Show Context)
Citation Context ... interested in developing algorithms that will converge to a local minimizer with certainty. The interesting topic of finding global minimizers in training neural networks is described elsewhere [20]�=-=��[22]-=-, [28]. This paper is organized as follows. In Section II, local learning rate training algorithms are presented, and their advantages and disadvantages are discussed. The proposed approach and the co... |