Results 1  10
of
21
Prediction risk and architecture selection for neural networks
, 1994
"... Abstract. We describe two important sets of tools for neural network modeling: prediction risk estimation and network architecture selection. Prediction risk is defined as the expected performance of an estimator in predicting new observations. Estimated prediction risk can be used both for estimati ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
Abstract. We describe two important sets of tools for neural network modeling: prediction risk estimation and network architecture selection. Prediction risk is defined as the expected performance of an estimator in predicting new observations. Estimated prediction risk can be used both for estimating the quality of model predictions and for model selection. Prediction risk estimation and model selection are especially important for problems with limited data. Techniques for estimating prediction risk include data resampling algorithms such as nonlinear cross–validation (NCV) and algebraic formulae such as the predicted squared error (PSE) and generalized prediction error (GPE). We show that exhaustive search over the space of network architectures is computationally infeasible even for networks of modest size. This motivates the use of heuristic strategies that dramatically reduce the search complexity. These strategies employ directed search algorithms, such as selecting the number of nodes via sequential network construction (SNC) and pruning inputs and weights via sensitivity based pruning (SBP) and optimal brain damage (OBD) respectively.
Flat Minima
, 1997
"... this paper (available on the WorldWide Web; see our home pages) contains pseudocode of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Mller (1993). Acknowledgments ..."
Abstract

Cited by 32 (14 self)
 Add to MetaCart
this paper (available on the WorldWide Web; see our home pages) contains pseudocode of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Mller (1993). Acknowledgments
A Principal Components Approach to Combining Regression Estimates
 Machine Learning
, 1998
"... . The goal of combining the predictions of multiple learned models is to form an improved estimator. A combining strategy must be able to robustly handle the inherent correlation, or multicollinearity, of the learned models while identifying the unique contributions of each. A progression of existin ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
. The goal of combining the predictions of multiple learned models is to form an improved estimator. A combining strategy must be able to robustly handle the inherent correlation, or multicollinearity, of the learned models while identifying the unique contributions of each. A progression of existing approaches and their limitations with respect to these two issues are discussed. A new approach, PCR*, based on principal components regression is proposed to address these limitations. An evaluation of the new approach on a collection of domains reveals that 1) PCR* was the most robust combining method, 2) correlation could be handled without eliminating any of the learned models, and 3) the principal components of the learned models provided a continuum of "regularized" weights from which PCR* could choose. Keywords: Regression, principal components, multiple models, combining estimates. 1. Introduction Combining a set of learned models to improve classification and regression estimat...
Automatic Early Stopping Using Cross Validation: Quantifying the Criteria
 Neural Networks
, 1997
"... Cross validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting ("early stopping"). The exact criterion used for cross validation based early stopping, however, is chosen in an adhoc ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
Cross validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting ("early stopping"). The exact criterion used for cross validation based early stopping, however, is chosen in an adhoc fashion by most researchers or training is stopped interactively. To aid a more wellfounded selection of the stopping criterion, 14 different automatic stopping criteria from 3 classes were evaluated empirically for their efficiency and effectiveness in 12 different classification and approximation tasks using multi layer perceptrons with RPROP training. The experiments show that on the average slower stopping criteria allow for small improvements in generalization (on the order of 4%), but cost about factor 4 longer training time. 1 Training for generalization When training a neural network, one is usually interested in obtaining a network with optimal generalization performance. Genera...
Investigation of the CasCor Family of Learning Algorithms
 NEURAL NETWORKS
, 1996
"... Six learning algorithms are investigated and compared empirically. All of them are based on variants of the candidate training idea of the Cascade Correlation method. The comparison was performed using 42 different datasets from the Proben1 benchmark collection. The results indicate: (1) for these p ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Six learning algorithms are investigated and compared empirically. All of them are based on variants of the candidate training idea of the Cascade Correlation method. The comparison was performed using 42 different datasets from the Proben1 benchmark collection. The results indicate: (1) for these problems it is slightly better not to cascade the hidden units, (2) error minimization candidate training is better that covariance maximization for regression problems but may be a little worse for classification problems, (3) for most learning tasks, considering validation set errors during the selection of the best candidate will not lead to improved networks, but for a few tasks it will. Section  Computational Analysis.
Early Stopping  but when?
 Neural Networks: Tricks of the Trade, volume 1524 of LNCS, chapter 2
, 1997
"... . Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting ("early stopping"). The exact criterion used for validationbased early stopping, however, is usually chosen in an adhoc fa ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
. Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting ("early stopping"). The exact criterion used for validationbased early stopping, however, is usually chosen in an adhoc fashion or training is stopped interactively. This trick describes how to select a stopping criterion in a systematic fashion; it is a trick for either speeding learning procedures or improving generalization, whichever is more important in the particular situation. An empirical investigation on multilayer perceptrons shows that there exists a tradeoff between training time and generalization: From the given mix of 1296 training runs using different 12 problems and 24 different network architectures I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average). 1 Early ...
A Smoothing Regularizer for Feedforward and Recurrent Neural Networks
, 1996
"... We derive a smoothing regularizer for dynamic network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first order Tikhonov stabilizer to dynamic models. For two layer networks with recurrent conn ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We derive a smoothing regularizer for dynamic network models by requiring robustness in prediction performance to perturbations of the training data. The regularizer can be viewed as a generalization of the first order Tikhonov stabilizer to dynamic models. For two layer networks with recurrent connections described by Y (t) = f \Gamma WY (t \Gamma ø) + V X(t) \Delta ; Z(t) = UY (t) ; the training criterion with the regularizer is D = 1 N N X t=1 jjZ(t) \Gamma Z (\Phi; I(t))jj 2 + ae ø 2 (\Phi) ; where \Phi = fU; V; Wg is the network parameter set, Z(t) are the targets, I(t) = fX(s); s = 1; 2; \Delta \Delta \Delta ; tg represents the current and all historical input information, N is the size of the training data set, ae ø 2 (\Phi) is the regularizer, and is a regularization parameter. The closedform expression for the regularizer for timelagged recurrent networks is: ae ø (\Phi) = fljjU jjjjV jj 1 \Gamma fljjW jj h 1 \Gamma e fljjW jj\Gamma1 ø i ; ...
Economic Forecasting: Challenges and Neural Network Solutions
 In Proceedings of the International Symposium on Artificial Neural Networks
, 1995
"... Macroeconomic forecasting is a very difficult task due to the lack of an accurate, convincing model of the economy. The most accurate models for economic forecasting, "black box" time series models, assume little about the structure of the economy. Constructing reliable time series models is challen ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Macroeconomic forecasting is a very difficult task due to the lack of an accurate, convincing model of the economy. The most accurate models for economic forecasting, "black box" time series models, assume little about the structure of the economy. Constructing reliable time series models is challenging due to short data series, high noise levels, nonstationarities, and nonlinear effects. This paper describes these challenges and surveys some neural network solutions to them. Important issues include balancing the bias/variance tradeoff and the noise/nonstationarity tradeoff. The methods surveyed include hyperparameter selection (regularization parameter and training window length), input variable selection and pruning, network architecture selection and pruning, new smoothing regularizers, and committee forecasts. Empirical results are presented for forecasting the U.S. Index of Industrial Production. These demonstrate that, relative to conventional linear time series and regression m...
Variable Selection Using NeuralNetwork Models
, 2000
"... In this paper we propose an approach to variable selection that uses a neuralnetwork model as the tool to determine which variables are to be discarded. The method performs a backward selection by successively removing input nodes in a network trained with the complete set of variables as inputs. I ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In this paper we propose an approach to variable selection that uses a neuralnetwork model as the tool to determine which variables are to be discarded. The method performs a backward selection by successively removing input nodes in a network trained with the complete set of variables as inputs. Input nodes are removed, along with their connections, and remaining weights are adjusted in such a way that the overall inputoutput behavior learnt by the network is kept approximately unchanged. A simple criterion to select input nodes to be removed is developed. The proposed method is tested on a famous example of system identification. Experimental results show that the removal of input nodes from the neural network model improves its generalization ability. In addition, the method compares favorably with respect to other feature reduction methods.
Flat Minimum Search Finds Simple Nets
, 1994
"... We present a new algorithm for finding low complexity neural networks with high generalization capability. The algorithm searches for a "flat" minimum of the error function. A flat minimum is a large connected region in weightspace where the error remains approximately constant. An MDLbased argume ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We present a new algorithm for finding low complexity neural networks with high generalization capability. The algorithm searches for a "flat" minimum of the error function. A flat minimum is a large connected region in weightspace where the error remains approximately constant. An MDLbased argument shows that flat minima correspond to low expected overfitting. Although our algorithm requires the computation of second order derivatives, it has backprop's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms (1) conventional backprop, (2) weight decay, (3) "optimal brain surgeon" / "optimal brain damage".