## Prediction risk and architecture selection for neural networks (1994)

### Cached

### Download Links

Citations: | 73 - 2 self |

### BibTeX

@INPROCEEDINGS{Moody94predictionrisk,

author = {John Moody},

title = {Prediction risk and architecture selection for neural networks},

booktitle = {},

year = {1994},

pages = {147--165},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. We describe two important sets of tools for neural network modeling: prediction risk estimation and network architecture selection. Prediction risk is defined as the expected performance of an estimator in predicting new observations. Estimated prediction risk can be used both for estimating the quality of model predictions and for model selection. Prediction risk estimation and model selection are especially important for problems with limited data. Techniques for estimating prediction risk include data resampling algorithms such as nonlinear cross–validation (NCV) and algebraic formulae such as the predicted squared error (PSE) and generalized prediction error (GPE). We show that exhaustive search over the space of network architectures is computationally infeasible even for networks of modest size. This motivates the use of heuristic strategies that dramatically reduce the search complexity. These strategies employ directed search algorithms, such as selecting the number of nodes via sequential network construction (SNC) and pruning inputs and weights via sensitivity based pruning (SBP) and optimal brain damage (OBD) respectively.

### Citations

1892 | A new look at the statistical model identification - Akaike - 1974 |

1339 | Generalized Additive Models - Hastie, Tibshirani - 1990 |

1284 | Spline Models for Observational Data - Wahba - 1990 |

1256 |
Information theory as an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ..., for example generalized cross--validation (GCV) (Craven and Wahba, 1979; Golub, Heath and Wahba, 1979), Akaike's final prediction error (FPE) (Akaike, 1970), Akaike's information criterion A (AIC) (=-=Akaike, 1973-=-), and predicted squared error (PSE) (see discussion in Barron (1984)), and the recently proposed generalized prediction error (GPE) for nonlinear models (Moody (1991; 1992; 1995)). 2 Prediction Risk ... |

1174 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...sely defined via an objective criterion, such as maximum a posteriori probability (MAP), minimum Bayesian information criterion (BIC) (Akaike, 1977; Schwartz, 1978), minimum description length (MDL) (=-=Rissanen, 1978-=-), or minimum prediction risk (P). In this paper, we focus on the prediction risk as our selection criterion for two reasons. First, it is straightforward to compute, and second, it provides more info... |

727 | validation choice and assessment of statistical predictions - Stone, “Cross - 1974 |

614 | Neural networks and the bias/variance dilemma - Geman, Bienenstock, et al. - 1992 |

429 |
Smoothing noisy data with spline functions
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...and Utans (1992), and Moody and Yarvin (1992). while algebraic estimates in the regression context include various formulae derived for linear models, for example generalized cross--validation (GCV) (=-=Craven and Wahba, 1979-=-; Golub, Heath and Wahba, 1979), Akaike's final prediction error (FPE) (Akaike, 1970), Akaike's information criterion A (AIC) (Akaike, 1973), and predicted squared error (PSE) (see discussion in Barro... |

423 | Optimal Brain Damage
- Cun, Denker, et al.
- 1990
(Show Context)
Citation Context ...has stopped at a local minimum. The second derivatives required for s k can be efficiently computed by a method similar to the backpropagation of first derivatives for weight updates during training (=-=LeCun et al., 1990-=-). The procedure for eliminating weights as described by LeCun et al. (1990) consists of ranking the weights in the network according to increasing s k , removing first one weight or a few weights, th... |

329 |
Estimating the dimension of a model
- Schwartz
- 1978
(Show Context)
Citation Context ...s" the data. The notion of "best fits" can be precisely defined via an objective criterion, such as maximum a posteriori probability (MAP), minimum Bayesian information criterion (BIC) =-=(Akaike, 1977; Schwartz, 1978-=-), minimum description length (MDL) (Rissanen, 1978), or minimum prediction risk (P). In this paper, we focus on the prediction risk as our selection criterion for two reasons. First, it is straightfo... |

275 |
Generalized cross-validation as a method for choosing a good ridge parameter
- Golub, Heath, et al.
- 1979
(Show Context)
Citation Context ...quared error loss function, a number of useful algebraic estimates for the prediction risk have been derived. These include the well known generalized cross--validation (GCV) (Craven and Wahba, 1979; =-=Golub et al., 1979-=-) and Akaike's final prediction error (FPE) (Akaike, 1970) formulas: GCV () = ASE() 1 i 1 \Gamma Q() N j 2 FPE() = ASE() / 1 + Q() N 1 \Gamma Q() N ! : (8) Q() denotes the number of weights of model .... |

227 | Spline Smoothing and Nonparametric Regression - Eubank - 1988 |

173 | A Leisurely Look at the Bootstrap, the Jackknife and Cross-Validation - Efron, Gong - 1983 |

173 | Second Order Derivatives for Network Pruning: Optimal Brain Surgeon
- Hassibi, Stork
- 1993
(Show Context)
Citation Context ...work models. These include the irrelevant hidden unit and irrelevant input hypothesis tests (White, 1989), pruning of units via skeletonization (Mozer and Smolensky, 1990), optimal brain surgeon OBS (=-=Hassibi and Stork, 1993-=-), and principal components pruning PCP (Levin et al., 1994). It is important to note that all these methods, along with OBD and our method of input pruning via SBP, are closely related to the Wald hy... |

172 | The eective number of parameters: an analysis of generalization and regularization in nonlinear learning systems - Moody - 1992 |

123 |
Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment
- Smolensky, Mozer
- 1989
(Show Context)
Citation Context ...e and should be considered when constructing neural network models. These include the irrelevant hidden unit and irrelevant input hypothesis tests (White, 1989), pruning of units via skeletonization (=-=Mozer and Smolensky, 1990-=-), optimal brain surgeon OBS (Hassibi and Stork, 1993), and principal components pruning PCP (Levin et al., 1994). It is important to note that all these methods, along with OBD and our method of inpu... |

112 |
Statistical predictor identification
- Akaike
- 1970
(Show Context)
Citation Context ...text include various formulae derived for linear models, for example generalized cross--validation (GCV) (Craven and Wahba, 1979; Golub, Heath and Wahba, 1979), Akaike's final prediction error (FPE) (=-=Akaike, 1970-=-), Akaike's information criterion A (AIC) (Akaike, 1973), and predicted squared error (PSE) (see discussion in Barron (1984)), and the recently proposed generalized prediction error (GPE) for nonlinea... |

105 | The predictive sample reuse method with applications - Geisser - 1975 |

57 | A Completely Automatic French Curve: Fitting Spline Functions by Cross-Validation - WAHBA, WOLD - 1995 |

55 |
On entropy maximization principle
- Akaike
- 1977
(Show Context)
Citation Context ...hich "best fits" the data. The notion of "best fits" can be precisely defined via an objective criterion, such as maximum a posteriori probability (MAP), minimum Bayesian informati=-=on criterion (BIC) (Akaike, 1977-=-; Schwartz, 1978), minimum description length (MDL) (Rissanen, 1978), or minimum prediction risk (P). In this paper, we focus on the prediction risk as our selection criterion for two reasons. First, ... |

55 | Note on generalization, regularization, and architecture selection in nonlinear learning systems - Moody - 1991 |

46 | Principled architecture selection for neural networks: Application to corporate bond rating prediction - Moody, Utans - 1992 |

34 | Fast Learning in Multi-Resolution Hierarchies - Moody - 1989 |

33 | Architecture selection strategies for neural networks: Application to corporate bond rating prediction - Moody, Utans - 1994 |

33 | Selecting Neural Network Architectures via the Prediction Risk: Application to Corporate Bond Rating Prediction - 128Utans, Moody - 1991 |

31 | Stock performance modeling using neural networks: a comparative study with regression models. Neural Networks 7:375–388 - Refenes, Zapranis, et al. - 1994 |

29 | Cross-validation: A review - Stone - 1978 |

28 | Fast pruning using principal components, in
- Levin, Leen, et al.
- 1994
(Show Context)
Citation Context ...vant input hypothesis tests (White, 1989), pruning of units via skeletonization (Mozer and Smolensky, 1990), optimal brain surgeon OBS (Hassibi and Stork, 1993), and principal components pruning PCP (=-=Levin et al., 1994-=-). It is important to note that all these methods, along with OBD and our method of input pruning via SBP, are closely related to the Wald hypothesis testing procedure (see for example Buse (1982)). I... |

27 |
The recurrent cascade-correlation learning algorithm
- Fahlman
- 1991
(Show Context)
Citation Context ...efore proceeding, we would like to note that many authors have independently proposed iterative network construction algorithms. Probably the best known of these is the cascade correlation algorithm (=-=Fahlman and Lebiere, 1990-=-). Cascade correlation was preceded by Ash (1989); see also Moody (1989). We have not attempted to exhaustively review this area, nor do we claim that SNC is necessarily unique or optimal. 6.1.1 The S... |

25 | Prediction squared error: A criterion for automatic model selection,” Self-Organizing Methods on Modeling - Barron - 1984 |

15 | The likelihood ratio, Wald, and Lagrange multiplier test: An expository note, Am Stat 36:153–157 - Buse - 1982 |

14 | Tukey, Data analysis, including statistics - Mosteller, W - 1968 |

10 | Networks with learned unit response functions - Moody, Yarvin - 1992 |

9 | A generalization error estimate for nonlinear systems - Larsen - 1992 |

5 | Dynamic node creation in backpropagation neural networks - Ash - 1989 |

2 | Neural network model selection using asymptotic jackknife estimator and cross-validation method, in - Liu - 1993 |

1 | The effective number of parameters and generalized prediction error for nonlinear regression.’, Manuscript in preparation - Moody - 1995 |