## A delay damage model selection algorithm for NARX neural networks (1997)

### Cached

### Download Links

Venue: | IEEE TRANSACTIONS ON SIGNAL PROCESSING |

Citations: | 9 - 1 self |

### BibTeX

@ARTICLE{Lin97adelay,

author = {Tsung-Nan Lin and C. Lee Giles and Bill G. Horne and Sun-Yuan Kung},

title = {A delay damage model selection algorithm for NARX neural networks},

journal = {IEEE TRANSACTIONS ON SIGNAL PROCESSING},

year = {1997},

volume = {45},

number = {11},

pages = {2719--2730}

}

### OpenURL

### Abstract

Recurrent neural networks have become popular models for system identification and time series prediction. Nonlinear autoregressive models with exogenous inputs (NARX) neural network models are a popular subclass of recurrent networks and have been used in many applications. Although embedded memory can be found in all recurrent network models, it is particularly prominent in NARX models. We show that using intelligent memory order selection through pruning and good initial heuristics significantly improves the generalization and predictive performance of these nonlinear systems on problems as diverse as grammatical inference and time series prediction.

### Citations

1840 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...ffectiveness of problem solving. When there is no prior knowledge about the model of the underlying process, traditional statistical tests can be used, for example, Akaike information criterion (AIC) =-=[1]-=- and the minimum description length (MDL) principle [53]. Such models are judged on their “goodness-of-fit,” which is a function of the likelihood of the data given the hypothesized model and its asso... |

1225 |
Multilayer feedforward networks are universal approximators. Neural Netw
- Hornik, Stinchcombe, et al.
- 1989
(Show Context)
Citation Context ...ximate the mapping function . It has been shown that a feedforward neural network with enough neurons is capable of approximating any nonlinear function to an arbitrary degree of accuracy [12], [19], =-=[28]-=-, [29]. Neural networks thus can provide a good approximation to the function . These arguments provide the basic motivation for the use of NARX networks to the nonlinear time series prediction. It is... |

1191 |
System Identification Theory for the User
- Ljung
- 1999
(Show Context)
Citation Context ... handwritten ZIP codes by pruning the weights of feedforward networks [10], [11]. II. NARX NEURAL NETWORK An important and useful class of discrete-time nonlinear systems is the NARX model [6], [34], =-=[39]-=-, [57], [58] where and represent input and output of the model at time , and are the input-memory and output-memory order, and the function is a nonlinear function. When the function can be approximat... |

841 |
Approximation by superpositions of a sigmoidal function
- Cybenko
- 1989
(Show Context)
Citation Context ... is to approximate the mapping function . It has been shown that a feedforward neural network with enough neurons is capable of approximating any nonlinear function to an arbitrary degree of accuracy =-=[12]-=-, [19], [28], [29]. Neural networks thus can provide a good approximation to the function . These arguments provide the basic motivation for the use of NARX networks to the nonlinear time series predi... |

509 |
1981] “Detecting strange attractors in turbulence,” Dynamical Systems and Turbulence
- Takens
- 1980
(Show Context)
Citation Context ...el includes as few regressors as possible because the variance of the model’s predictions increases along with the increasing number of regressors [41]. According to the embedding theorem [48], [54], =-=[60]-=-, the memory orders need to be large enough in order to provide a sufficient embedding. The problem of choosing the proper memory architecture corresponds to giving a good representation of input data... |

457 |
Identification and Control of Dynamical Systems using Neural Networks
- Narendra, Parthasarathy
- 1990
(Show Context)
Citation Context ...uning, recurrent neural networks, tapped-delay lines, temporal sequences, time series. I. INTRODUCTION NONLINEAR autoregressive models with exogenous inputs (NARX) recurrent neural architectures [6], =-=[44]-=-, as opposed to other recurrent neural models, have limited feedback architectures that come only from the output neuron instead of from hidden neurons. It has been shown that in theory, one can use N... |

419 | Optimal brain damage - LeCun, Denker, et al. - 1990 |

339 |
On the Approximate Realization of Continuous Mappings by Neural Networks
- Funahashi
- 1989
(Show Context)
Citation Context ... approximate the mapping function . It has been shown that a feedforward neural network with enough neurons is capable of approximating any nonlinear function to an arbitrary degree of accuracy [12], =-=[19]-=-, [28], [29]. Neural networks thus can provide a good approximation to the function . These arguments provide the basic motivation for the use of NARX networks to the nonlinear time series prediction.... |

323 |
Phoneme recognition using time-delay neural networks
- Waibel, Hanazawa, et al.
- 1989
(Show Context)
Citation Context ... time series [9], and various artificial nonlinear systems [6], [44], [51]. When the output-memory order of NARX network is zero, a NARX network becomes a time delay neural network (TDNN) [32], [33], =-=[63]-=-, which is simply a tapped delay line input into a MLP. In general, the TDNN implements a function of the form Tapped delay lines can be implementations of delay space embedding and can form the basis... |

285 |
Universal coding, information, prediction, and estimation
- Rissanen
- 1984
(Show Context)
Citation Context ... knowledge about the model of the underlying process, traditional statistical tests can be used, for example, Akaike information criterion (AIC) [1] and the minimum description length (MDL) principle =-=[53]-=-. Such models are judged on their “goodness-of-fit,” which is a function of the likelihood of the data given the hypothesized model and its associated degrees of freedom. Fogel [16] applied the modifi... |

246 |
Time Series Prediction: Forecasting the Future and Understanding the Past
- Weigend, Gerschenfeld
- 1994
(Show Context)
Citation Context ...pturing the global behavior of NSAR neural networks, we also tested them on the laser data of the Santa Fe competition. The data set consists of laser intensity collected from a laboratory experiment =-=[65]-=-. Although deterministic, its behavior is chaotic as seen in Fig. 7. Fig. 8 shows the normalized autocorrelation function of the first 1000 training data points.LIN et al.: DELAY DAMAGE MODEL SELECTI... |

233 |
The determination of the order of an autoregression
- Hannan, Quinn
- 1979
(Show Context)
Citation Context ... hypothesized model and its associated degrees of freedom. Fogel [16] applied the modification of AIC to select a “best” network. However, the AIC method is complex and can be troubled by imprecision =-=[25]-=-, [55]. Such model complexity and regularization methods are readily used for nonlinear models such as neural networks; see, for example [24], [42], and [67]. Evolutionary programming [2], [18] is ano... |

188 |
Predicting the Future: Connectionist Approach
- Weigend, Huberman, et al.
- 1990
(Show Context)
Citation Context ...edding before discussing the results of time series prediction. In order to also optimize the architecture of the MLP of a NARX network or NSAR, several methods of weight elimination [5], [31], [47], =-=[64]-=-, [66] can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31]. All experiments were trained using back-propagation through time (BP... |

186 | Handwritten digit recognition with a back-propagation network
- LeCun, Boser, et al.
- 1992
(Show Context)
Citation Context ...y estimating the second-order derivative for each weight. The success of their algorithm had been implemented in identification of handwritten ZIP codes by pruning the weights of feedforward networks =-=[10]-=-, [11]. II. NARX NEURAL NETWORK An important and useful class of discrete-time nonlinear systems is the NARX model [6], [34], [39], [57], [58] where and represent input and output of the model at time... |

170 | Second order derivatives for network pruning: Optimal Brain Surgeon
- Stork, Hassibi
- 1993
(Show Context)
Citation Context ...rks by calculating the sensitivity of the error to each memory order after the network is trained by gradient-based learning algorithm. Several methods for sensitivity calculations have been proposed =-=[26]-=-, [30], [43]; for details, see the survey paper by Reed [52]. Our method of calculating sensitivity is based on evaluating the second-order derivative of the cost function with respect to each memory ... |

170 |
A time-delay neural network architecture for isolated word recognition
- Lang, Waibel, et al.
- 1990
(Show Context)
Citation Context ...ystems [61], time series [9], and various artificial nonlinear systems [6], [44], [51]. When the output-memory order of NARX network is zero, a NARX network becomes a time delay neural network (TDNN) =-=[32]-=-, [33], [63], which is simply a tapped delay line input into a MLP. In general, the TDNN implements a function of the form Tapped delay lines can be implementations of delay space embedding and can fo... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...method is complex and can be troubled by imprecision [25], [55]. Such model complexity and regularization methods are readily used for nonlinear models such as neural networks; see, for example [24], =-=[42]-=-, and [67]. Evolutionary programming [2], [18] is another search mechanism. This algorithm operates on a population of models. Offspring models are created by randomly mutating parents models. Competi... |

165 |
Nonlinear signal processing using neural networks: prediction and system modeling
- Lapedes, Farber
(Show Context)
Citation Context ... [61], time series [9], and various artificial nonlinear systems [6], [44], [51]. When the output-memory order of NARX network is zero, a NARX network becomes a time delay neural network (TDNN) [32], =-=[33]-=-, [63], which is simply a tapped delay line input into a MLP. In general, the TDNN implements a function of the form Tapped delay lines can be implementations of delay space embedding and can form the... |

137 |
Generalization by Weightelimination with Application to Forecasting
- Weigend, Huberman, et al.
- 1991
(Show Context)
Citation Context ...complex and can be troubled by imprecision [25], [55]. Such model complexity and regularization methods are readily used for nonlinear models such as neural networks; see, for example [24], [42], and =-=[67]-=-. Evolutionary programming [2], [18] is another search mechanism. This algorithm operates on a population of models. Offspring models are created by randomly mutating parents models. Competition betwe... |

133 |
Introduction to linear regression analysis
- Montgomery, Peck, et al.
- 2012
(Show Context)
Citation Context ...e; on the other hand, it is also desired that the model includes as few regressors as possible because the variance of the model’s predictions increases along with the increasing number of regressors =-=[41]-=-. According to the embedding theorem [48], [54], [60], the memory orders need to be large enough in order to provide a sufficient embedding. The problem of choosing the proper memory architecture corr... |

123 |
Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment
- Smolensky, Mozer
- 1989
(Show Context)
Citation Context ...lating the sensitivity of the error to each memory order after the network is trained by gradient-based learning algorithm. Several methods for sensitivity calculations have been proposed [26], [30], =-=[43]-=-; for details, see the survey paper by Reed [52]. Our method of calculating sensitivity is based on evaluating the second-order derivative of the cost function with respect to each memory order [50]. ... |

121 |
Simplifying neural networks by soft weight-sharing
- Nowlan, Hinton
- 1992
(Show Context)
Citation Context ...ic embedding before discussing the results of time series prediction. In order to also optimize the architecture of the MLP of a NARX network or NSAR, several methods of weight elimination [5], [31], =-=[47]-=-, [64], [66] can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31]. All experiments were trained using back-propagation through ti... |

116 | Gradient-based learning algorithms for recurrent networks and their computational complexity
- Williams, Zipser
- 1995
(Show Context)
Citation Context ...6] can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31]. All experiments were trained using back-propagation through time (BPTT) =-=[68]-=-. A. Grammatical Inference: Learning a 512-State Finite Memory Machine NARX networks have been shown to be able to simulate and learn a class of finite state machines [8], [21] called, respectively, d... |

96 | Networks and the best approximation property
- Girosi, Poggio
- 1990
(Show Context)
Citation Context ...e AIC method is complex and can be troubled by imprecision [25], [55]. Such model complexity and regularization methods are readily used for nonlinear models such as neural networks; see, for example =-=[24]-=-, [42], and [67]. Evolutionary programming [2], [18] is another search mechanism. This algorithm operates on a population of models. Offspring models are created by randomly mutating parents models. C... |

87 | A simple weight-decay can improve generalization
- Krogh, Hertz
- 1992
(Show Context)
Citation Context ...thm (the delay damage algorithm) to determine the optimal memoryorder of NARX and input time delay neural networks. This algorithm can also incorporate several useful heuristics, such as weight decay =-=[31]-=-, which are used extensively in static networks to optimize the nonlinear function. (For a survey of pruning methods for feedforward neural networks, see [52].) The procedure of the algorithm starts w... |

76 |
Input-Output Parametric Models for Nonlinear Systems part I: Deterministic Nonlinear Systems
- Leontaritis, Billings
- 1985
(Show Context)
Citation Context ...ion of handwritten ZIP codes by pruning the weights of feedforward networks [10], [11]. II. NARX NEURAL NETWORK An important and useful class of discrete-time nonlinear systems is the NARX model [6], =-=[34]-=-, [39], [57], [58] where and represent input and output of the model at time , and are the input-memory and output-memory order, and the function is a nonlinear function. When the function can be appr... |

69 |
Predicting Sunspots and Exchange Rates with Connectionist Networks
- Weigend, Huberman, et al.
- 1992
(Show Context)
Citation Context ... before discussing the results of time series prediction. In order to also optimize the architecture of the MLP of a NARX network or NSAR, several methods of weight elimination [5], [31], [47], [64], =-=[66]-=- can be incorporated into the training algorithm. In the following experiments, networks are trained using weight decay [31]. All experiments were trained using back-propagation through time (BPTT) [6... |

67 |
simple procedure for pruning back-propagation trained neural networks
- Karnin, “A
- 1990
(Show Context)
Citation Context ... calculating the sensitivity of the error to each memory order after the network is trained by gradient-based learning algorithm. Several methods for sensitivity calculations have been proposed [26], =-=[30]-=-, [43]; for details, see the survey paper by Reed [52]. Our method of calculating sensitivity is based on evaluating the second-order derivative of the cost function with respect to each memory order ... |

65 |
Non-linear system identification using neural networks
- Chen, Billings, et al.
- 1990
(Show Context)
Citation Context ...s, pruning, recurrent neural networks, tapped-delay lines, temporal sequences, time series. I. INTRODUCTION NONLINEAR autoregressive models with exogenous inputs (NARX) recurrent neural architectures =-=[6]-=-, [44], as opposed to other recurrent neural models, have limited feedback architectures that come only from the output neuron instead of from hidden neurons. It has been shown that in theory, one can... |

63 |
Pruning algorithms—a survey
- Reed
- 1993
(Show Context)
Citation Context ...l useful heuristics, such as weight decay [31], which are used extensively in static networks to optimize the nonlinear function. (For a survey of pruning methods for feedforward neural networks, see =-=[52]-=-.) The procedure of the algorithm starts with a NARX network with enough degrees of freedom in both input and output memory or taps and then deleting those memory orders with small sensitivity measure... |

59 |
On the Dimension of the Compact Invariant Sets of Certain Nonlinear Maps
- Mañé
- 1981
(Show Context)
Citation Context ...ion to reconstruct the states of the dynamical system. The purpose of time-delay embedding is to unfold the projection back to a multivariate state space that is representative of the original system =-=[46]-=-, [49], [60]. It was shown that if the dynamical system and the observed quantity were generic, then the delay coordinate map from a - dimensional smooth compact manifold to -dimensional reconstructio... |

49 |
Capabilities of Three-Layered Perceptrons
- Irie, Miyake
- 1988
(Show Context)
Citation Context ... the mapping function . It has been shown that a feedforward neural network with enough neurons is capable of approximating any nonlinear function to an arbitrary degree of accuracy [12], [19], [28], =-=[29]-=-. Neural networks thus can provide a good approximation to the function . These arguments provide the basic motivation for the use of NARX networks to the nonlinear time series prediction. It is still... |

46 | Learning long-term dependencies in NARX recurrent neural networks. Neural Networks
- Lin, Horne, et al.
- 1996
(Show Context)
Citation Context ...ating gradient information more efficiently when the networks are unfolded in time to backpropagate the error signal and thus reduce the network’s sensitivity to the problem of long-term dependencies =-=[36]-=-, [38]. Recently, it has been shown Manuscript received June 17, 1997. The associate editor coordinating the review of this paper and approving it for publication was Prof. Yu-Hen Hu. T.-N. Lin was wi... |

41 |
S.: Geometry from a time series, Phys
- Packard, Crutchfield, et al.
- 1980
(Show Context)
Citation Context ... reconstruct the states of the dynamical system. The purpose of time-delay embedding is to unfold the projection back to a multivariate state space that is representative of the original system [46], =-=[49]-=-, [60]. It was shown that if the dynamical system and the observed quantity were generic, then the delay coordinate map from a - dimensional smooth compact manifold to -dimensional reconstruction spac... |

36 | An experimental comparison of recurrent neural networks
- Horne, Giles
- 1995
(Show Context)
Citation Context ...have several advantages in practice. For example, it has been reported that gradient-descent learning can be more effective in NARX networks than in other recurrent architectures with “hidden states” =-=[27]-=-. Part of the reason can be attributed to the embedded memory of NARX networks. This embedded memory will appear as jump-ahead connections that provide shorter paths for propagating gradient informati... |

32 |
Coping with chaos
- Ott, Saner, et al.
- 1994
(Show Context)
Citation Context ...that the model includes as few regressors as possible because the variance of the model’s predictions increases along with the increasing number of regressors [41]. According to the embedding theorem =-=[48]-=-, [54], [60], the memory orders need to be large enough in order to provide a sufficient embedding. The problem of choosing the proper memory architecture corresponds to giving a good representation o... |

31 | Computational Capabilities of Recurrent NARX Neural Networks
- Siegelmann, Horne, et al.
- 1997
(Show Context)
Citation Context ...rons. It has been shown that in theory, one can use NARX networks, rather than conventional recurrent networks, without any computational loss and that they are at least equivalent to Turing machines =-=[56]-=-. Not only are NARX neural networks computationally powerful in theory, but they have several advantages in practice. For example, it has been reported that gradient-descent learning can be more effec... |

28 |
Notes on the Simulation of Evolution
- Atmar
- 1994
(Show Context)
Citation Context ...imprecision [25], [55]. Such model complexity and regularization methods are readily used for nonlinear models such as neural networks; see, for example [24], [42], and [67]. Evolutionary programming =-=[2]-=-, [18] is another search mechanism. This algorithm operates on a population of models. Offspring models are created by randomly mutating parents models. Competition between offspring models for surviv... |

22 |
An Information Criterion for Optimal Neural Network Selection
- Fogel
- 1991
(Show Context)
Citation Context ...th (MDL) principle [53]. Such models are judged on their “goodness-of-fit,” which is a function of the likelihood of the data given the hypothesized model and its associated degrees of freedom. Fogel =-=[16]-=- applied the modification of AIC to select a “best” network. However, the AIC method is complex and can be troubled by imprecision [25], [55]. Such model complexity and regularization methods are read... |

20 | Learning a class of large finite state machines with a recurrent neural network. Neural Networks
- Giles, Lawrence, et al.
- 1995
(Show Context)
Citation Context ...y chosen from the complete set. The complete set, which consists of all strings of length from 1 to (ten in this case) are shown to be able to sufficiently identify a finite memory machine with depth =-=[20]-=-. The strings were encoded such that input values of 0’s and 1’s and target output labels “negative” and “positive” corresponded to floating-point values of 0.0 and 1.0, respectively. Initially, befor... |

19 | Continuous-Time Temporal Back-Propagation with Adaptable Time Delays
- Day, Davenport
- 1993
(Show Context)
Citation Context ...proposed by Etter, it was used as an “adaptive delay filter,” which included variable delays taps as well as variable gains, for modeling several sparse systems [7], [15]. Recently, others [4], [13], =-=[14]-=-, [35] have also extended neural networks to include adaptable time delays. Because the error function of the adaptable time delays depends on the autocorrelation function of input signals [7], [15], ... |

19 |
Adaptive estimation of time delays in sampled data systems
- Etter, Stearn
- 1981
(Show Context)
Citation Context ... gradient information. Originally proposed by Etter, it was used as an “adaptive delay filter,” which included variable delays taps as well as variable gains, for modeling several sparse systems [7], =-=[15]-=-. Recently, others [4], [13], [14], [35] have also extended neural networks to include adaptable time delays. Because the error function of the adaptable time delays depends on the autocorrelation fun... |

18 | Learning long-term dependencies is not as difficult with NARX networks
- Lin, Horne, et al.
- 1996
(Show Context)
Citation Context ...gradient information more efficiently when the networks are unfolded in time to backpropagate the error signal and thus reduce the network’s sensitivity to the problem of long-term dependencies [36], =-=[38]-=-. Recently, it has been shown Manuscript received June 17, 1997. The associate editor coordinating the review of this paper and approving it for publication was Prof. Yu-Hen Hu. T.-N. Lin was with NEC... |

16 | Pruning recurrent neural networks for improved generalization performance
- Giles, Omlin
- 1994
(Show Context)
Citation Context ...s retrained. Of course, this procedure can be iterated. This method should be contrasted with other recurrent neural network pruning procedures where recurrent nodes are pruned based on output values =-=[23]-=- and where second-order methods are used to prune input taps and single order feedback taps for fully recurrent neural networks [50]. The sensitive measure of each memory order is calculated by estima... |

16 |
ECG compression using long-term prediction
- Nave, Cohen
- 1993
(Show Context)
Citation Context ...d are the th coefficient and th order, respectively, and is the white noise innovation with zero mean. SAR models have demonstrated their long-term prediction capability in various applications [40], =-=[45]-=- and can easily be extended into nonlinear models. A nonlinear version of a SAR is the NSAR. 2 A primary problem associated with the nonlinear subset model is how to optimally select the subset orders... |

16 | Evolutionary programming: an introduction and some current directions - Fogel - 1994 |

15 |
The Tempo 2 algorithm: Adjusting time delays by supervised learning
- Bodenhausen, A
- 1991
(Show Context)
Citation Context ...Originally proposed by Etter, it was used as an “adaptive delay filter,” which included variable delays taps as well as variable gains, for modeling several sparse systems [7], [15]. Recently, others =-=[4]-=-, [13], [14], [35] have also extended neural networks to include adaptable time delays. Because the error function of the adaptable time delays depends on the autocorrelation function of input signals... |

15 | Time-Delay Neural Networks: Representation and Induction of Finite State Machines
- Clouse, Giles, et al.
- 1997
(Show Context)
Citation Context ...pagation through time (BPTT) [68]. A. Grammatical Inference: Learning a 512-State Finite Memory Machine NARX networks have been shown to be able to simulate and learn a class of finite state machines =-=[8]-=-, [21] called, respectively, definite and finite memory machines. When being trained on strings that are encoded as temporal sequences, NARX networks are able to “learn” rather large (hundreds to thou... |

15 |
Long-term predictions of chemical processes using recurrent neural networks: a parallel training approach
- Su, McAvoy, et al.
(Show Context)
Citation Context ... ZIP codes by pruning the weights of feedforward networks [10], [11]. II. NARX NEURAL NETWORK An important and useful class of discrete-time nonlinear systems is the NARX model [6], [34], [39], [57], =-=[58]-=- where and represent input and output of the model at time , and are the input-memory and output-memory order, and the function is a nonlinear function. When the function can be approximated by a mult... |

13 |
Comparison of four neural net learning methods for dynamic system identification
- Qin, Su, et al.
- 1992
(Show Context)
Citation Context ...ng systems in a petroleum refinery [58], nonlinear oscillations associated with multilegged locomotion in biological systems [61], time series [9], and various artificial nonlinear systems [6], [44], =-=[51]-=-. When the output-memory order of NARX network is zero, a NARX network becomes a time delay neural network (TDNN) [32], [33], [63], which is simply a tapped delay line input into a MLP. In general, th... |