## Introduction to Radial Basis Function Networks (1996)

Citations: | 92 - 3 self |

### BibTeX

@MISC{Orr96introductionto,

author = {Mark J. L. Orr},

title = {Introduction to Radial Basis Function Networks},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

4675 |
Matrix Analysis
- HORN, JOHNSON
- 1990
(Show Context)
Citation Context ... Squares Forward selection is a relatively fast algorithm but it can be speeded up even further using a technique called orthogonal least squares [4]. This is a Gram-Schmidt orthogonalisation process =-=[12]-=- which ensures that each new column added to the design matrix of the growing subset is orthogonal to all previous columns. This simpli es the equation for the change in sum-squared-error and results ... |

2534 |
An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...meters to zero and remembering that the projection matrix is idempotent (P 2 = P). For reviews of model selection see the two articles by Hocking [9, 10] andchapter 17 of Efron and Tisbshirani's book =-=[14]-=-. 5.1 Cross-Validation If data is not scarce then the set of available input-output measurements can be divided into two parts { one part for training and one part for testing. In this way several di ... |

2307 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...] and also a special case of Akaike's information criterion is nal prediction error (FPE) ^ 2 FPE = 1 p = p + p ; ; ^y > P 2 ^y +2 ^ 2 UEV ^y > P 2 ^y p Schwarz's Bayesian information criterion (BIC) =-=[28]-=- is ^ 2 BIC = 1 p = p + (ln(p) ; 1) ; ^y > P 2 ^y +ln(p) ^ 2 UEV p ; : (5.4) ^y > P 2 ^y p : (5.5) Generalised-cross validation can also be written in terms of instead of trace (P), using the equation... |

1774 |
Introduction to the Theory of Neural Computation
- Hertz, Palmer
- 1991
(Show Context)
Citation Context ...e often used in signal processing applications. Logistic functions, of the sort 1 h(x) = 1 + exp(b > x ; b0) � are popular in arti cial neural networks, particularly in multi-layer perceptrons (MLPs) =-=[8]-=-. A familiar example, almost the simplest polynomial, is the straight line f(x) =ax+ b which is a linear model whose two basis functions are h 1(x) = 1 � h 2(x) = x� and whose weights are w 1 = b and ... |

824 |
Solution of Ill-posed Problems
- Tikhonov, Arsenin
(Show Context)
Citation Context ... (or assumptions) and the mathematical technique Tikhonov developed for this is known as regularisation. Tikhonov's work only became widely known in the West after the publication in 1977 of his book =-=[29]-=-. Meanwhile, two American statisticians, Arthur Hoerl and Robert Kennard, published a paper in 1970 [11] on ridge regression, a method for solving badly conditioned linear regression problems. Bad con... |

609 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...(x)) 2 23 �swhere the expectation (averaging) indicated by h:::i is taken over the training sets. This score, which tells us how good the average prediction is, can be broken down into two components =-=[5]-=-, namely MSE = (y(x) ;hf(x)i) 2 + (f(x) ;hf(x)i) 2 The rst part is the bias and the second part is the variance. If hf(x)i = y(x) for all x then the model is unbiased (the bias is zero). However, an u... |

522 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...dge regression is used. Although there are still m weights in the model, what John Moody calls the e ective number of parameters [18] (and David MacKay calls the number of good parameter measurements =-=[17]-=-) is less than m and depends on the size of the regularisation parameter(s). The simplest formula for this number, , is = p ; trace (P) � (4.10) which is consistent with both Moody's and MacKay's form... |

495 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ... Tikhonov's work only became widely known in the West after the publication in 1977 of his book [29]. Meanwhile, two American statisticians, Arthur Hoerl and Robert Kennard, published a paper in 1970 =-=[11]-=- on ridge regression, a method for solving badly conditioned linear regression problems. Bad conditioning means numerical di culties in performing the matrix inverse necessary to obtain the variance m... |

434 |
Multivariable functional interpolation and adaptive networks
- Broomhead, Lowe
- 1988
(Show Context)
Citation Context ...ctions. In principle, they could be employed in any sort of model (linear or nonlinear) and any sort of network (single-layer or 9smulti-layer). However, since Broomhead and Lowe's 1988 seminal paper =-=[3]-=-, radial basis function networks (RBF networks) have traditionally been associated with radial functions in a single-layer network such as shown in gure 3. f(x) h1(x) . . . hj(x) . . . hm(x) 6 ; x1 . ... |

273 |
Generalized cross-validation as a method for choosing a good ridge parameter
- Golub, Heath, et al.
- 1979
(Show Context)
Citation Context ...lternative to training and testing p times. 19s5.2 Generalised Cross-Validation The matrix diag (P) makes LOO slightly awkward to handle mathematically. Its cousin, generalised cross-validation (GCV) =-=[6]-=-, is more convenient and is ^ 2 GCV = p ^y> P2 ^y : (5.2) 2 (trace (P)) The similarity with leave-one-out cross-validation (equation 5.1) is apparent. Just replace diag (P) in the equation for LOO wit... |

249 | Growing cell structures- a self organizing network for unsupervised and supervised learning
- Fritzke
- 1994
(Show Context)
Citation Context ...7) for building networks. Most of the mathematical details are put in an appendix (section A). 7 For alternative approaches see, for example, the work of Platt [24] and associates [21] and of Fritzke =-=[15]-=-. 4s2 Supervised Learning A ubiquitous problem in statistics with applications in many areas is to guess or estimate a function from some example input-output pairs with little or no knowledge of the ... |

209 |
Some comments on cp
- Mallows
- 1973
(Show Context)
Citation Context ...e unbiased estimate of variance (UEV), which we met in a previous section (4.4), is ^ 2 UEV = ^y> P2 ^y � (5.3) p ; where is the e ective number of parameters (equation 4.10). Aversion of Mallow's Cp =-=[22]-=- and also a special case of Akaike's information criterion is nal prediction error (FPE) ^ 2 FPE = 1 p = p + p ; ; ^y > P 2 ^y +2 ^ 2 UEV ^y > P 2 ^y p Schwarz's Bayesian information criterion (BIC) [... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...e of variance (equation 5.3). However, things are not as simple when ridge regression is used. Although there are still m weights in the model, what John Moody calls the e ective number of parameters =-=[18]-=- (and David MacKay calls the number of good parameter measurements [17]) is less than m and depends on the size of the regularisation parameter(s). The simplest formula for this number, , is = p ; tra... |

165 | A resource-allocating network for function interpolation
- Platt
- 1991
(Show Context)
Citation Context ...y we look at forward selection (section 7) for building networks. Most of the mathematical details are put in an appendix (section A). 7 For alternative approaches see, for example, the work of Platt =-=[24]-=- and associates [21] and of Fritzke [15]. 4s2 Supervised Learning A ubiquitous problem in statistics with applications in many areas is to guess or estimate a function from some example input-output p... |

132 |
The relationship between variable selection and data augmentation and a method for prediction
- Allen
- 1974
(Show Context)
Citation Context ... of this is to split the p patterns into a training set of size p ; 1 and a test of size 1 and average the squared error on the left-out pattern over the p possible ways of obtaining such a partition =-=[1]-=-. This is called leave-one-out (LOO) cross-validation. The advantage is that all the data can be used for training { none has to be held back in a separate test set. The beauty of LOO for linear model... |

99 | Neural networks and statistical models
- Sarle
- 1994
(Show Context)
Citation Context ...which are just one particular type of linear model. However, the fashion for neural networks, which started in the mid-80's, has given rise to new names for concepts already familiar to statisticians =-=[27]-=-. Table 1 gives some examples. Such terms are used interchangeably in this document. statistics neural networks model network estimation learning regression supervised learning interpolation generalis... |

92 |
The analysis and selection of variables in linear regression
- Hocking
(Show Context)
Citation Context ...obtained simply by setting all the regularisation parameters to zero and remembering that the projection matrix is idempotent (P 2 = P). For reviews of model selection see the two articles by Hocking =-=[9, 10]-=- andchapter 17 of Efron and Tisbshirani's book [14]. 5.1 Cross-Validation If data is not scarce then the set of available input-output measurements can be divided into two parts { one part for trainin... |

81 | Nonlinear Gated Experts for Time Series: Discovering Regimes and Avoiding Overfitting
- Weigend, Mangeas, et al.
- 1995
(Show Context)
Citation Context ...t. This is one ofanumber of complications which make time series prediction a more di cult problem than straight regression or classi cation. Others include regime switching and asynchronous sampling =-=[13]-=-. 7s3 Linear Models A linear model for a function y(x) takes the form f(x) = mX j=1 wj hj(x) : (3.1) The model f is expressed as a linear combination of a set of m xed functions (often called basis fu... |

71 |
Predicting multivariate responses in multiple linear regression
- Breiman, Friedman
- 1997
(Show Context)
Citation Context ...the special case of univariate output so, for simplicity,we will con ne our attention to the latter. Note, however, that multiple outputs can be treated in a special way in order to reduce redundancy =-=[2]-=-. The training set, in which there are p pairs (indexed by i running from 1 up to p), is represented by x 1 x 2 . xn 3 7 5 : T = f(xi� ^yi)g p i=1 : (2.1) The reason for the hat over the letter y (ano... |

63 | A Function Estimation Approach to Sequential Leaming with Neural Network, Neural Comput, 1993,5, pp.954-975. han. Seqiieiitial Learning in Artificial Neural Networlis. Ph.D Disserr at ion. Cambridge Uni. Eng. Dept : Canibridge.U.I<. ,Sept 1991. L.Yingwei. - Kadirkamanathan, Niranjan - 1997 |

43 | Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method
- Gu, Wahba
- 1991
(Show Context)
Citation Context ...odel selection criteria depend nonlinearly on we needamethod of nonlinear optimisation. We could use any of the standard techniques for this, such as the Newton method, and in fact that has been done =-=[7]-=-. Alternatively [20], we can exploit the fact that when the derivative of the GCV error prediction is set to zero, the resulting equation can be manipulated so that only ^ appears on the left hand sid... |

31 | Regularisation in the Selection of Radial Basis Function Centers
- Orr
- 1995
(Show Context)
Citation Context ...le, the best ) by picking out the one with the lowest predicted error. Two words of warning: they don't always work as e ectively as in this example and UEV is inferior to GCV as a selection criteria =-=[20]-=- (and probably to the others as well). 22 10 5s6 Ridge Regression Around the middle of the 20th century the Russian theoretician Andre Tikhonov was working on the solution of ill-posed problems. These... |

23 |
Applied regression analysis
- Rawlings
- 1988
(Show Context)
Citation Context ... all zero except for the regularisation parameters along its diagonal and ^y =[^y 1 ^y 2 ::: ^yp] > is the vector of training set (equation 2.1) outputs. The solution is the so-called normal equation =-=[26]-=-, ^w = A ;1 H > ^y � (4.5) and ^w =[^w 1 ^w 2 ::: ^wm] > is the vector of weights which minimises the cost function (equation 4.2). The use of these equations is illustrated with a simple example (sec... |

22 |
P..M.: Orthogonal least squares learning for radial basis function networks
- Chen, Cowan, et al.
- 1991
(Show Context)
Citation Context ... section 7.4 for an illustration. 7.1 Orthogonal Least Squares Forward selection is a relatively fast algorithm but it can be speeded up even further using a technique called orthogonal least squares =-=[4]-=-. This is a Gram-Schmidt orthogonalisation process [12] which ensures that each new column added to the design matrix of the growing subset is orthogonal to all previous columns. This simpli es the eq... |

13 | Local smoothing of radial basis function networks
- Orr
- 1995
(Show Context)
Citation Context ...j-th column of the design matrix (equation 4.3). In contrast to the case of standard ridge regression (section 6.2), there is an analytic solution for the optimal value of j based on GCV minimisation =-=[19]-=- { no re-estimation is necessary (see appendix A.11). The trouble is that there are m ; 1 other parameters to optimise and each time one j is optimised it changes 26sthe optimal value of each of the o... |

11 |
Developments in linear regression methodology
- Hocking
- 1983
(Show Context)
Citation Context ...obtained simply by setting all the regularisation parameters to zero and remembering that the projection matrix is idempotent (P 2 = P). For reviews of model selection see the two articles by Hocking =-=[9, 10]-=- andchapter 17 of Efron and Tisbshirani's book [14]. 5.1 Cross-Validation If data is not scarce then the set of available input-output measurements can be divided into two parts { one part for trainin... |

10 |
Improving the generalisation properties of radial basis function neural networks
- Bishop
- 1991
(Show Context)
Citation Context ...ally convenient and consequently other forms of regularisation are rather ignored here. If the reader is interested in higher-order regularisation I suggest looking at [25] for a general overview and =-=[16]-=- for a speci c example (second-order regularisation in RBF networks). We next describe ridge regression from the perspective of bias and variance (section 6.1) and how it a ects the equations for the ... |

5 | Some comments on Cp. Technometrics - Mallows - 1973 |