## Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems (1997)

Venue: | IEEE Transactions on Neural Networks |

Citations: | 74 - 2 self |

### BibTeX

@ARTICLE{Kwok97constructivealgorithms,

author = {Tin-yau Kwok and Dit-Yan Yeung},

title = {Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems},

journal = {IEEE Transactions on Neural Networks},

year = {1997},

volume = {8},

pages = {630--645}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this survey paper, we review the constructive algorithms for structure learning in feedforward neural networks for regression problems. The basic idea is to start with a small network, then add hidden units and weights incrementally until a satisfactory solution is found. By formulating the whole problem as a state space search, we first describe the general issues in constructive algorithms, with special emphasis on the search strategy. A taxonomy, based on the differences in the state transition mapping, the training algorithm and the network architecture, is then presented. Keywords--- Constructive algorithm, structure learning, state space search, dynamic node creation, projection pursuit regression, cascade-correlation, resource-allocating network, group method of data handling. I. Introduction A. Problems with Fixed Size Networks I N recent years, many neural network models have been proposed for pattern classification, function approximation and regression problems. Among...

### Citations

4120 |
Pattern Classification and Scene Analysis
- DUDA, HART
- 1973
(Show Context)
Citation Context ...f the training process, which is especially significant for constructive algorithms, as mentioned in Section III. Memorization has been used in methods like k-nearest neighbors and the Parzen windows =-=[126]-=-. However, these methods tend to produce networks that potentially grow linearly in the training set size, thus demanding large space and long time in computing network output. Algorithms in this sect... |

3127 | An Introduction to the Bootstrap
- Efron, Tibshirani
- 1994
(Show Context)
Citation Context ... original question of estimatingserr(f n;N ). There are a number of approaches. For example, the estimate may be based on a separate validation set [35], cross-validation [36], [37], or bootstrapping =-=[38]-=-, [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) [41], final prediction error (FPE) ... |

2290 |
A new look at statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...parate validation set [35], cross-validation [36], [37], or bootstrapping [38], [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) =-=[40]-=-, Bayesian information criterion (BIC) [41], final prediction error (FPE) [42], generalized crossvalidations(GCV) [43], predicted square error (PSE) [44], minimum description length (MDL) [45], and ge... |

1748 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...nce, in [11], one of the possibilities suggested to get around the intractability of this loading problem is to alter the network architecture as learning proceeds. Using Valiant's learning framework =-=[16]-=-, Baum [17] presented an existence proof showing that if the learning algorithm is allowed to add hidden units and weights to the network, it can solve in polynomial time any learning problem that can... |

1235 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...ns(AIC) [40], Bayesian information criterion (BIC) [41], final prediction error (FPE) [42], generalized crossvalidations(GCV) [43], predicted square error (PSE) [44], minimum description length (MDL) =-=[45]-=-, and generalized prediction error (GPE) [46]. By formulating neural network training in a Bayesian framework, evidence may also be used [5], [10]. However, these methods are not completely satisfacto... |

910 |
Approximation by Superposition of a Sigmoidal Function
- Cybenko
- 1989
(Show Context)
Citation Context ...requirement of universal approximation is also a fundamental concern for neural networks of fixed architecture and pruning algorithms. It is affirmative for multi-layer perceptrons (see, for example, =-=[53]-=---[56]) and RBF networks [57], [58], [59]. However, this point is still emphasized here because, as we will see in Section IV, there are algorithms that construct architectures which lack this importa... |

804 |
Cross-validatory choice and assessment of statistical predictions
- Stone
- 1974
(Show Context)
Citation Context ... one must return to the original question of estimatingserr(f n;N ). There are a number of approaches. For example, the estimate may be based on a separate validation set [35], cross-validation [36], =-=[37]-=-, or bootstrapping [38], [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) [41], final ... |

642 | Bayesian Learning for Neural Networks - Neal - 1996 |

639 |
Neural networks and the bias-variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...moment of the Fourier magnitude distribution of f . The first term in (3) comes from the approximation error while the second term comes from the estimation error. Thus, as has also been discussed in =-=[34]-=-, when the number of hidden units increases, bias falls but variance increases. Hence, for good generalization performance, it is important to have a proper tradeoff by stopping network growth appropr... |

574 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...hers [5]--[8] have incorporated Bayesian methods into neural network learning. Regularization can then be accomplished by using appropriate priors that favor small network weights (such as the normal =-=[6]-=- or Laplace [9] distribution), and the regularization parameter can be automatically set. This approach is promising, though the relationship between generalization performance and the Bayesian eviden... |

477 |
Smoothing noisy data with spline functions. Numerische Mathematik
- Craven, Whaba
- 1979
(Show Context)
Citation Context ...d from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) [41], final prediction error (FPE) [42], generalized crossvalidations(GCV) =-=[43]-=-, predicted square error (PSE) [44], minimum description length (MDL) [45], and generalized prediction error (GPE) [46]. By formulating neural network training in a Bayesian framework, evidence may al... |

443 | Optimal brain damage
- LeCun, Denker, et al.
- 1990
(Show Context)
Citation Context ...e algorithms, will be presented in Section IV. The last section will be a discussion with some concluding remarks. 3 Typically, approximation involves using up to the first [19], [20] or second [21], =-=[22]-=- term in the Taylor series expansion for the change in error. Further approximation is possible by computing these values as weighted averages during the course of learning or by assuming that the Hes... |

436 | Projection pursuit regression
- Friedman, Stuetzle
- 1981
(Show Context)
Citation Context ...much residual error as possible, and is then installed permanently into the network. In general, such a greedy approach may not result in an optimal set of weights for the whole network. Back-fitting =-=[81]-=- may be used for fine adjustment. This amounts to cyclically adjusting the weights associated with each previously installed hidden unit, while keeping the parameters (weights and other parameters def... |

426 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...antee to load any given training set in polynomial time. However, this problem may not be that severe in practice. As mentioned in [15], in applications of neural networks, we 1 As an example, MacKay =-=[10]-=- reported that the Gaussian approximation in the evidence framework seemed to break down significantly for N=k ! 3 \Sigma 1, where N is the number of training patterns and k is the number of weights i... |

393 |
Universal approximation bounds for superposition of a sigmoidal function
- Barron
- 1993
(Show Context)
Citation Context ...rsal approximation capability is a prerequisite for the convergence property. The convergence issue also concerns pruning algorithms. This issue has only been studied for some constructive algorithms =-=[62]-=-, [63]. Discussion on the convergence properties of individual constructive algorithms will be postponed to Section IV. Also, note that the norm used in the convergence definition must be based on the... |

390 |
Estimating the Dimension of a Model
- Schwartz
- 1978
(Show Context)
Citation Context ...n [36], [37], or bootstrapping [38], [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) =-=[41]-=-, final prediction error (FPE) [42], generalized crossvalidations(GCV) [43], predicted square error (PSE) [44], minimum description length (MDL) [45], and generalized prediction error (GPE) [46]. By f... |

295 |
Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter
- Golub, Heath, et al.
- 1979
(Show Context)
Citation Context ...rror term and the penalty term. Early attempts either set this regularization parameter manually [1], [2] or by ad hoc procedures [3]. A more disciplined and long respected method is cross-validation =-=[4]-=-. However, this approach is usually very slow for nonlinear models like neural networks, because a large number of nonlinear optimization problems must be repeated. Recently, several researchers [5]--... |

270 | Growing Cell Structure: A Self-organizing Network for Unsupervised and Supervised Learning,” Neural Networks
- Fritzke
- 1994
(Show Context)
Citation Context ...n that will get averaged out. This effectively makes it impossible to appropriately model functions that vary on a finer scale. These problems are alleviated in the supervised growing cell structures =-=[123]-=-. Here, error information is accumulated in counters associated with each hidden unit. A new unit is inserted in regions with high local error only after presentation of a number of training patterns.... |

267 |
Faster-learning variations on back-propagation: an empirical study
- Fahlman
- 1988
(Show Context)
Citation Context ...f great value for very large problems. Similarly, simple gradient descent methods, though they only require O(k) space and time for each iteration, are notoriously slow when the network size is large =-=[79]-=-. It has been argued in [80] that these scale-up problems are less important in constructive algorithms, because they always start with small networks. This may be true in simple problems. But in comp... |

261 | Exploratory projection pursuit
- Friedman
- 1987
(Show Context)
Citation Context ... Other parametric forms may also be used in place of the Hermite functions in (6), such as functions mentioned in [100], the normalized Legendre polynomial expansion in exploratory projection pursuit =-=[101]-=-, basis function expansion in [89], B-splines in multi-dimensional additive spline approximation [90], sigmoidal networks in connectionist projection pursuit regression [96], RBF networks in [94], or ... |

237 |
Designing neural networks using genetic algorithms
- Miller, Todd, et al.
- 1989
(Show Context)
Citation Context ...[67], [68]. By viewing the search for the optimal architecture as searching a surface defined by levels of trained network performance above the space of possible network architectures, Miller et al. =-=[69]-=- mentioned that such a surface is typically ffl infinitely large, as the number of possible hidden units and connections is unbounded; ffl non-differentiable, since changes in the number of units or c... |

232 | Approximation capabilities of multilayer feedforward network - Hornik - 1991 |

207 | Training a 3-node Neural Networks is NP- complete
- Blum, Rivest
- 1992
(Show Context)
Citation Context ...lem of determining an appropriate initial network for use in regularization. Another motivation for this type of algorithm is related to the time complexity of learning. Judd [11] and others (such as =-=[12]-=-, [13], [14]) showed that the loading problem is in general NP-complete 2 . The loading problem is phrased as follows [11]: Input: A network architecture and a training set. Output: Determination of t... |

206 |
Pruning algorithms - a survey
- Reed
- 1993
(Show Context)
Citation Context ... training it until an acceptable solution is found. After this, some hidden units or weights are removed if they are no longer actively used. Methods using this approach are called pruning algorithms =-=[18]-=-. The other approach, which corresponds to constructive algorithms, attempts to search for a good network in the other direction. These methods start with a small network and then add additional hidde... |

181 | Second order derivatives for network pruning: Optimal brain surgeon
- Hassibi, Stork
- 1993
(Show Context)
Citation Context ...ntative algorithms, will be presented in Section IV. The last section will be a discussion with some concluding remarks. 3 Typically, approximation involves using up to the first [19], [20] or second =-=[21]-=-, [22] term in the Taylor series expansion for the change in error. Further approximation is possible by computing these values as weighted averages during the course of learning or by assuming that t... |

179 |
W.,(1991): Universal Approximation Using Radial-BasisFunction Networks
- Park, Sandberg
(Show Context)
Citation Context ...ion is also a fundamental concern for neural networks of fixed architecture and pruning algorithms. It is affirmative for multi-layer perceptrons (see, for example, [53]--[56]) and RBF networks [57], =-=[58]-=-, [59]. However, this point is still emphasized here because, as we will see in Section IV, there are algorithms that construct architectures which lack this important property. The second requirement... |

177 | A resource-allocating network for function interpolation
- Platt
- 1991
(Show Context)
Citation Context ...However, for the criterion function in [65], the convergence property is not known. D. Resource-Allocating Network Similar to the algorithms in Sections IV-A and IV-B, algorithms in this class [122]--=-=[125]-=- also add hidden units to the same layer one at a time. However, the major difference is that memorization of training patterns is used to reduce the computational requirement of the training process,... |

173 | The upstart algorithm: A method for constructing and training feedforward neural networks - Frean - 1990 |

169 |
Introductory real analysis
- Kolmogorov, Fomı̄n
- 1975
(Show Context)
Citation Context ...hich lack this important property. The second requirement is that the network sequence produced by the algorithm should converge to the target function. A sequence ffng is said to converge (strongly) =-=[60]-=- to f if lim n!1 kf \Gamma fnk = 0. Note that kf \Gamma f n k is closely related to the approximation error defined in Section II-C but with a subtle difference. In Section II-C, f n refers to the net... |

168 | Evolutionary artificial neural networks
- Yao
- 1995
(Show Context)
Citation Context ...his paper, and so we just illustrate the ideas by giving a few examples. One example of constructive algorithms that keep a pool of networks is the genetic algorithm based evolutionary approachs[67], =-=[68]-=-. By viewing the search for the optimal architecture as searching a surface defined by levels of trained network performance above the space of possible network architectures, Miller et al. [69] menti... |

166 | Neural networks: a review from a statistical perspective - Cheng, Titterington - 1994 |

154 | Probable Networks and Plausible Predictions- A Review of Practical Bayesian Methods for Supervised Neural Networks.” Unpublished manuscript
- MacKay
- 1986
(Show Context)
Citation Context ...ng algorithms, should not be treated as independent rivals. There are algorithms combining constructive and pruning algorithms [36], [85], [86], [91], [130], combining pruning and regularization [8], =-=[140]-=-, combining regularization and constructive algorithms [104], and even combining all three [117]. Some of these have been mentioned in previous sections. Thus, these approaches are complementary to ea... |

146 | Neural networks and related methods for classification - Ripley - 1994 |

144 |
Generalisation by weight elimination with application to forecasting
- Rumelhart, Hubermann
- 1991
(Show Context)
Citation Context ...essary to match the complexity of the model to the problem being solved. Algorithms that can find an appropriate network architecture automatically are thus highly desirable. Regularization [1], [2], =-=[3]-=- is sometimes used to alleviate this problem. It encourages smoother network mappings by adding a penalty term to the error term being minimized. However, it cannot alter the network topology, which m... |

144 |
A simple lemma on greedy approximation in Hilbert space and convergence rate for projection pursuit regression
- Jones
- 1992
(Show Context)
Citation Context ...pproximation capability is a prerequisite for the convergence property. The convergence issue also concerns pruning algorithms. This issue has only been studied for some constructive algorithms [62], =-=[63]-=-. Discussion on the convergence properties of individual constructive algorithms will be postponed to Section IV. Also, note that the norm used in the convergence definition must be based on the whole... |

142 | Second-order methods for learning: between steepest descent and Newton’s method, Neural computation 4 (2
- Battiti, First-
- 1992
(Show Context)
Citation Context ...on algorithms used. Optimization routines that are applicable to neural networks with fixed architecture should be equally applicable to constructive algorithms. Interested readers may see reviews in =-=[71]-=---[74]. Note that there is also a convergence issue in training algorithms, namely, when to stop the training (optimization). Readers should not confuse this with the convergence issue of constructive... |

140 |
Statistical Predictor Identification
- Akaike
- 1970
(Show Context)
Citation Context ..., [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) [41], final prediction error (FPE) =-=[42]-=-, generalized crossvalidations(GCV) [43], predicted square error (PSE) [44], minimum description length (MDL) [45], and generalized prediction error (GPE) [46]. By formulating neural network training ... |

129 |
Learning with Localized Receptive Fields
- Moody, Darken
- 1988
(Show Context)
Citation Context ...y form smeared hyperplanes in the input space, and radial basis function (RBF) units can only form local bumps. While sigmoid units and RBF units give rise to networks with different properties [50], =-=[51]-=-, [52], the limited flexibility of these simple hidden units may be problematic when they are added one at a time in a greedy manner (details in Section III-B). On the other hand, there are algorithms... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ...n [4]. However, this approach is usually very slow for nonlinear models like neural networks, because a large number of nonlinear optimization problems must be repeated. Recently, several researchers =-=[5]-=---[8] have incorporated Bayesian methods into neural network learning. Regularization can then be accomplished by using appropriate priors that favor small network weights (such as the normal [6] or L... |

126 |
Skeletonization: a technique for trimming the fat from network via relevance assessment
- Mozer, Smolensky
- 1991
(Show Context)
Citation Context ... of the representative algorithms, will be presented in Section IV. The last section will be a discussion with some concluding remarks. 3 Typically, approximation involves using up to the first [19], =-=[20]-=- or second [21], [22] term in the Taylor series expansion for the change in error. Further approximation is possible by computing these values as weighted averages during the course of learning or by ... |

123 |
Dynamic Node Creation in Backpropagation Networks
- Ash
- 1988
(Show Context)
Citation Context ...ategories will be discussed in the following sections, which are named after their representative algorithms. A. Dynamic Node Creation Constructive algorithms in this category [36], [64], [77], [80], =-=[84]-=---[88] are variants of the dynamic node creation (DNC) network proposed by Ash [84]. Here, the state transition mapping is single-valued. Sigmoid hidden units are added one at a time, and are always a... |

117 |
Approximation and estimation bounds for artificial neural networks
- Barron
- 1994
(Show Context)
Citation Context ...ing as more hidden units are added?" The answer is no, because of a bias-variance tradeoff. The error in (2) comes from two sources, the approximation error (bias) and the estimation error (varia=-=nce) [33]-=-. Approximation error, kf \Gamma fn k 2 , refers to the distance between the target function, f , and the closest neural network function, fn , of a given architecture. Estimation error,sEkfn \Gamma f... |

111 | Neural networks and statistical models
- Sarle
- 1994
(Show Context)
Citation Context ...hin statistics, and have been practised by statisticians for decades. The close relationship between various statistical methodologies and neural network models has been discussed widely [23], [134]--=-=[139]-=-, and there is still ongoing research to see how to borrow strength from each other. Finally, note that the various approaches to the control of network complexity, namely, regularization, and constru... |

94 |
Layered neural networks with Gaussian hidden units as universal approximations
- Hartman, Keeler, et al.
- 1990
(Show Context)
Citation Context ...oximation is also a fundamental concern for neural networks of fixed architecture and pruning algorithms. It is affirmative for multi-layer perceptrons (see, for example, [53]--[56]) and RBF networks =-=[57]-=-, [58], [59]. However, this point is still emphasized here because, as we will see in Section IV, there are algorithms that construct architectures which lack this important property. The second requi... |

89 |
Bayesian regularization and pruning using a Laplace prior
- Williams
- 1995
(Show Context)
Citation Context ...ave incorporated Bayesian methods into neural network learning. Regularization can then be accomplished by using appropriate priors that favor small network weights (such as the normal [6] or Laplace =-=[9]-=- distribution), and the regularization parameter can be automatically set. This approach is promising, though the relationship between generalization performance and the Bayesian evidence deserves fur... |

89 |
BackPropagation Algorithm which Varies the Number of Hidden Units’. Neural Networks
- Hirose, Yamashita, et al.
- 1991
(Show Context)
Citation Context ...omplexity, namely, regularization, and constructive and pruning algorithms, should not be treated as independent rivals. There are algorithms combining constructive and pruning algorithms [36], [85], =-=[86]-=-, [91], [130], combining pruning and regularization [8], [140], combining regularization and constructive algorithms [104], and even combining all three [117]. Some of these have been mentioned in pre... |

77 | Prediction risk and architecture selection for neural networks
- Moody
- 1993
(Show Context)
Citation Context ... Thus, one must return to the original question of estimatingserr(f n;N ). There are a number of approaches. For example, the estimate may be based on a separate validation set [35], cross-validation =-=[36]-=-, [37], or bootstrapping [38], [39]. Alternatively, it may be obtained from a number of information criteria like Akaike's information criterions(AIC) [40], Bayesian information criterion (BIC) [41], ... |

74 | The Recurrent Cascade-Correlation Architecture
- Fahlman
- 1991
(Show Context)
Citation Context ...: at any time, only one layer of weights is optimized while all other weights are kept fixed. This is similar to the univariant search method [75]. This decomposition may also lead to faster training =-=[83]-=-, though theoretical justifications are still lacking. Examples of algorithms using this approach will be discussed in Sections IV-B and IV-C. IV. A Taxonomy of Constructive Algorithms Figure 3 gives ... |

73 |
A simple procedure for pruning back-propagation trained neural networks
- Karnin
- 1990
(Show Context)
Citation Context ...ssions of the representative algorithms, will be presented in Section IV. The last section will be a discussion with some concluding remarks. 3 Typically, approximation involves using up to the first =-=[19]-=-, [20] or second [21], [22] term in the Taylor series expansion for the change in error. Further approximation is possible by computing these values as weighted averages during the course of learning ... |

71 |
An overview of predictive learning and function approximation
- Friedman
- 1994
(Show Context)
Citation Context ...s by simply treating the approximation of each target function as a different (unrelated) regression problem. Other methods that utilize the relationship among these target functions are discussed in =-=[23]-=-. KWOK AND YEUNG: CONSTRUCTIVE ALGORITHMS FOR STRUCTURE LEARNING 3 II. Viewing the Problem as a State Space Search In this section, we formulate the problem of constructing a neural network for regres... |