## Linear-Least-Squares Initialization of Multilayer Perceptrons Through Backpropagation of the Desired Response

Citations: | 6 - 1 self |

### BibTeX

@MISC{Erdogmus_linear-least-squaresinitialization,

author = {Deniz Erdogmus and Oscar Fontenla-romero and Jose C. Principe and Amparo Alonso-betanzos and Enrique Castillo},

title = {Linear-Least-Squares Initialization of Multilayer Perceptrons Through Backpropagation of the Desired Response},

year = {}

}

### OpenURL

### Abstract

Abstract—Training multilayer neural networks is typically carried out using descent techniques such as the gradient-based backpropagation (BP) of error or the quasi-Newton approaches including the Levenberg–Marquardt algorithm. This is basically due to the fact that there are no analytical methods to find the optimal weights, so iterative local or global optimization techniques are necessary. The success of iterative optimization procedures is strictly dependent on the initial conditions, therefore, in this paper, we devise a principled novel method of backpropagating the desired response through the layers of a multilayer perceptron (MLP), which enables us to accurately initialize these neural networks in the minimum mean-square-error sense, using the analytic linear least squares solution. The generated solution can be used as an initial condition to standard iterative optimization algorithms. However, simulations demonstrate that in most cases, the performance achieved through the proposed initialization scheme leaves little room for further improvement in the mean-square-error (MSE) over the training set. In addition, the performance of the network optimized with the proposed approach also generalizes well to testing data. A rigorous derivation of the initialization algorithm is presented and its high performance is verified with a number of benchmark training problems including chaotic time-series prediction, classification, and nonlinear system identification with MLPs. Index Terms—Approximate least-squares training of multilayer perceptrons (MLPs), backpropagation (BP) of desired response, neural network initialization. I.

### Citations

851 |
Approximations by superpositions of sigmoidal functions
- Cybenko
(Show Context)
Citation Context ... MLPs with more layers. However, it is known that an MLP with only a single hidden layer that contains a sufficiently large number of hidden PEs can approximate any continuous-differentiable function =-=[33]-=-, [34], thus, this topology is sufficient, in general. Consider a two-layer MLP with inputs, hidden PEs, and output PEs. For the sake of generality, we also assume that the output layer contains nonli... |

793 |
The perceptron: A probabilistic model for information storage and organization in the brain
- Rosenblatt
- 1958
(Show Context)
Citation Context ...ural network initialization. I. INTRODUCTION ALTHOUGH examples of neural networks have appeared in the literature as possible function approximation and adaptive filtering tools as early as 1950s [1]–=-=[5]-=-, due to the Manuscript received June 10, 2003; revised March 19, 2004. This work was supported in part by the National Science Foundation under Grant ECS-0300340. The work of D. Erdogmus and O. Fonte... |

758 |
Learning representations by back-propagating errors
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...et al. proposed the well-known and widely appreciated error backpropagation (BP) algorithm for training multilayer perceptrons (MLPs) and other neural networks belonging to Grossberg’s additive model =-=[6]-=-. For the last two decades, while appreciating this brilliant algorithm, researchers have focused their efforts on improving the convergence properties of BP, the main concern being the slow convergen... |

343 |
Increased rates of convergence through learning rate adaptation
- Jacobs
- 1988
(Show Context)
Citation Context ...mproving the convergence properties of BP, the main concern being the slow convergence speed due to its gradient-descent nature. Among these, we mention the momentum term [6], [7], adaptive stepsizes =-=[8]-=-, Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessi... |

293 | Natural gradient works efficiently in learning
- Amari
- 1998
(Show Context)
Citation Context ...rties of BP, the main concern being the slow convergence speed due to its gradient-descent nature. Among these, we mention the momentum term [6], [7], adaptive stepsizes [8], Amari’s natural gradient =-=[9]-=-, and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessian [17], [18]. A clear problem... |

284 | Training feedforward networks with the Marquardt algorithm - Hagan, Menhaj - 1994 |

248 |
Time series prediction: Forecasting the future and understanding the past
- Weigend, Gershenfeld
- 1994
(Show Context)
Citation Context ... are benchmark problem types that are usually considered in the literature. Therefore, our case studies include the single-step prediction of two real-world chaotic time-series (the laser time series =-=[35]-=-, [36], and the Dow Jones closing index series [36]). In addition, the identification of the realistic nonlinear engine manifold dynamics (based on a car engine [37], [38]) is carried out using an MLP... |

208 |
Approximation capabilities of multilayer feedforward neural networks, Neural Networks 4
- Hornik
- 1991
(Show Context)
Citation Context ...with more layers. However, it is known that an MLP with only a single hidden layer that contains a sufficiently large number of hidden PEs can approximate any continuous-differentiable function [33], =-=[34]-=-, thus, this topology is sufficient, in general. Consider a two-layer MLP with inputs, hidden PEs, and output PEs. For the sake of generality, we also assume that the output layer contains nonlinearit... |

188 | Steps toward artificial intelligence - Minsky - 1961 |

126 |
Alkon: Accelerating the convergence of the back-propagation method
- Vogl, Mangis, et al.
- 1988
(Show Context)
Citation Context ...cused their efforts on improving the convergence properties of BP, the main concern being the slow convergence speed due to its gradient-descent nature. Among these, we mention the momentum term [6], =-=[7]-=-, adaptive stepsizes [8], Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–[16], and even exact... |

56 |
Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization
- Watrous
- 1987
(Show Context)
Citation Context ...nt-descent nature. Among these, we mention the momentum term [6], [7], adaptive stepsizes [8], Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) =-=[10]-=-, [11], and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessian [17], [18]. A clear problem with all of these iterative optimization techniques is their susceptibility to l... |

49 |
Exact Calculation of the Hessian Matrix for the Multilayer
- Bishop
- 1992
(Show Context)
Citation Context ...mari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessian =-=[17]-=-, [18]. A clear problem with all of these iterative optimization techniques is their susceptibility to local minima. In order to conquer this difficulty global search procedures have to be introduced.... |

39 |
First and second order methods for learning: Between steepest descent and Newton's method
- Battiti
- 1992
(Show Context)
Citation Context ...mentum term [6], [7], adaptive stepsizes [8], Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–=-=[16]-=-, and even exact evaluation of the Hessian [17], [18]. A clear problem with all of these iterative optimization techniques is their susceptibility to local minima. In order to conquer this difficulty ... |

36 |
Experiments in nonconvex optimization: Stochastic approximation with function smoothing and simulated annealing
- Styblinski, Tang
- 1990
(Show Context)
Citation Context ...ive optimization techniques is their susceptibility to local minima. In order to conquer this difficulty global search procedures have to be introduced. From simple methods using random perturbations =-=[19]-=- to more principled global search algorithms based on genetic algorithms [20] or simulated annealing [21], such proposals have also appeared in the literature; most without a striking success. A less ... |

34 | Optimisation for Training Neural Nets - Barnard |

29 |
Tests on a cell assembly theory of the action of the brain, using a large digital computer
- Rochester, Holland, et al.
- 1956
(Show Context)
Citation Context ..., neural network initialization. I. INTRODUCTION ALTHOUGH examples of neural networks have appeared in the literature as possible function approximation and adaptive filtering tools as early as 1950s =-=[1]-=-–[5], due to the Manuscript received June 10, 2003; revised March 19, 2004. This work was supported in part by the National Science Foundation under Grant ECS-0300340. The work of D. Erdogmus and O. F... |

28 | Computing Second Derivatives in Feed-Forward Networks: A Review
- Buntine, Weigend
- 1994
(Show Context)
Citation Context ... natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessian [17], =-=[18]-=-. A clear problem with all of these iterative optimization techniques is their susceptibility to local minima. In order to conquer this difficulty global search procedures have to be introduced. From ... |

27 |
Alternative Neural Network Training Methods [Active Sonar Processing
- Porto, Fogel
- 1995
(Show Context)
Citation Context ...global search procedures have to be introduced. From simple methods using random perturbations [19] to more principled global search algorithms based on genetic algorithms [20] or simulated annealing =-=[21]-=-, such proposals have also appeared in the literature; most without a striking success. A less researched area deals with the search for good initialization algorithms. An interesting approach by Husk... |

14 | Initialization and optimization of multilayered perceptrons
- Duch, Adamczak, et al.
- 1997
(Show Context)
Citation Context ...he momentum term [6], [7], adaptive stepsizes [8], Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], [11], and pseudo-Newton methods [10], =-=[12]-=-–[16], and even exact evaluation of the Hessian [17], [18]. A clear problem with all of these iterative optimization techniques is their susceptibility to local minima. In order to conquer this diffic... |

14 |
Improving the learning speed of two-layer neural networks by choosing initial values of the adaptive weights
- Nguyen, Widrow
- 1990
(Show Context)
Citation Context ...) from a predetermined library according to the output mean-square-error (MSE) [24]. Nguyen and Widrow proposed assigning each hidden PE to an approximate portion of the range of the desired response =-=[25]-=-. Drago and Ridella, on the other hand, proposed the statistically controlled activation weight initialization, which aimed to prevent PEs from saturating during adaptation by estimating the maximum v... |

13 |
Statistically Controlled Activation Weight initialization (SCAWI
- Drago, Ridella
- 1992
(Show Context)
Citation Context ...sed the statistically controlled activation weight initialization, which aimed to prevent PEs from saturating during adaptation by estimating the maximum values that the weights should take initially =-=[26]-=-. Theres326 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005 have been heuristic proposals of least-squares initialization and training approaches [27]–[29], which did not rigorously c... |

12 | Efficient Training of Feed-Forward Neural Networks
- Moller
- 1993
(Show Context)
Citation Context ...cent nature. Among these, we mention the momentum term [6], [7], adaptive stepsizes [8], Amari’s natural gradient [9], and more advanced search methods based on line search (conjugate gradient) [10], =-=[11]-=-, and pseudo-Newton methods [10], [12]–[16], and even exact evaluation of the Hessian [17], [18]. A clear problem with all of these iterative optimization techniques is their susceptibility to local m... |

12 | Training neural networks with additive noise in the desired signal
- Wang, Principe
- 1999
(Show Context)
Citation Context ...tem equations that yield the least squares solutions for the weight matrices may become ill-conditioned. In such case, introducing a small additive noise to the desired output values, as suggested in =-=[39]-=-, helps to improve the conditioning of these matrices as well as reduce the number of iterations required to get a very good initializing solution. The training performance using such a noisy desired ... |

10 |
High-order and multilayer perceptron initialization
- Thimm, Fiesler
- 1997
(Show Context)
Citation Context ...is not trivial. Since typically random weight initialization is utilized, Thimm and Fiesler investigated the effects of the distribution type and its variance on the speed and performance of training =-=[23]-=-, while Colla et al. discussed using the orthogonal least squares method for selecting weights for the processing elements (PEs) from a predetermined library according to the output mean-square-error ... |

10 |
A Global Optimum Approach for One-Layer Neural Networks
- Castillo, Fontenla-Romero, et al.
- 2002
(Show Context)
Citation Context ...so proposed a linear least squares training approach for single-layer nonlinear neural networks; the nonlinearity of the neurons were also adapted besides the weights in order to optimize performance =-=[30]-=-. Along these lines, Cho and Chow proposed optimizing the output layer using linear least squares (assuming a linear output layer) and optimizing the preceding layers using gradient descent [31]. In t... |

9 | F.: A learning algorithm for multilayered neural networks based on linear least squares problems. Neural Networks 6(1 - Biegler-Konig, Barmann - 1993 |

8 |
Training multilayer neural networks using fast global learning algorithmleast squares and penalized optimization methods, Neurocomputing 25
- Cho, Chow
- 1999
(Show Context)
Citation Context ...mance [30]. Along these lines, Cho and Chow proposed optimizing the output layer using linear least squares (assuming a linear output layer) and optimizing the preceding layers using gradient descent =-=[31]-=-. In this paper, we propose a new initialization algorithm that is based on a simple linear approximation to the nonlinear solution that can be computed analytically with linear least squares. We reas... |

7 | Use of genetic programming for the search of a new learning rule for neural networks
- Bengio, Bengio, et al.
- 1994
(Show Context)
Citation Context ... to conquer this difficulty global search procedures have to be introduced. From simple methods using random perturbations [19] to more principled global search algorithms based on genetic algorithms =-=[20]-=- or simulated annealing [21], such proposals have also appeared in the literature; most without a striking success. A less researched area deals with the search for good initialization algorithms. An ... |

6 |
A new method in determining the initial weights of feedforward neural networks for training enhancement
- Yam, Chow
- 1997
(Show Context)
Citation Context ...he weights should take initially [26]. Theres326 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005 have been heuristic proposals of least-squares initialization and training approaches =-=[27]-=-–[29], which did not rigorously consider the transformation of desired output through the layers. 1 Castillo et al. also proposed a linear least squares training approach for single-layer nonlinear ne... |

5 | A Universal Non-linear Filter, Predictor, and Simulator which Optimizes Itself by a Learning Process - Gabor, Wilby, et al. - 1959 |

5 |
Observer-Based Air-Fuel Ratio Control
- Powell, Fekete, et al.
- 1998
(Show Context)
Citation Context ...time-series (the laser time series [35], [36], and the Dow Jones closing index series [36]). In addition, the identification of the realistic nonlinear engine manifold dynamics (based on a car engine =-=[37]-=-, [38]) is carried out using an MLP. The engine manifold model assumes the manifold pressure and manifold temperature as the states, and the pressure as the system output. The input is the angle of th... |

3 | Fast learning for problem classes using knowledge based network initialization
- Husken, Goerick
- 2000
(Show Context)
Citation Context ...en and Goerick was to utilize evolutionary algorithms to select a good set of initialization weights for a neural network from a set of optimal weight solutions obtained a priori for similar problems =-=[22]-=-. After all, if we could approximately choose the weights close to the global optimum, then BP or any other descent algorithm (e.g., Levenberg–Marquardt) could take the network weights toward the opti... |

2 | Temporal and Spatial Patterns in a Conditional Probability Machine - Uttley - 1956 |

2 | Optimization in Companion Search Spaces: The Case of Cross-Entropy and the Levenberg-Marquardt Algorithm - Fancourt, Principe - 2001 |

2 |
Orthogonal least squares algorithm applied to the initialization of multi-layer perceptrons
- Colla, Reyneri, et al.
- 1999
(Show Context)
Citation Context ...e Colla et al. discussed using the orthogonal least squares method for selecting weights for the processing elements (PEs) from a predetermined library according to the output mean-square-error (MSE) =-=[24]-=-. Nguyen and Widrow proposed assigning each hidden PE to an approximate portion of the range of the desired response [25]. Drago and Ridella, on the other hand, proposed the statistically controlled a... |

2 | A neural network perspective to extended Luenberger observers
- Erdogmus, Genc, et al.
- 2002
(Show Context)
Citation Context ...eries (the laser time series [35], [36], and the Dow Jones closing index series [36]). In addition, the identification of the realistic nonlinear engine manifold dynamics (based on a car engine [37], =-=[38]-=-) is carried out using an MLP. The engine manifold model assumes the manifold pressure and manifold temperature as the states, and the pressure as the system output. The input is the angle of the thro... |

1 | and Prolasi, “Exploring and comparing the best “Direct methods” for the efficient training of MLP-networks
- Martino
- 1996
(Show Context)
Citation Context ...ights should take initially [26]. Theres326 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005 have been heuristic proposals of least-squares initialization and training approaches [27]–=-=[29]-=-, which did not rigorously consider the transformation of desired output through the layers. 1 Castillo et al. also proposed a linear least squares training approach for single-layer nonlinear neural ... |