## Bootstrapping with Noise: An Effective Regularization Technique (1996)

### Cached

### Download Links

- [www.math.tau.ac.il]
- [www.math.tau.ac.il]
- DBLP

### Other Repositories/Bibliography

Venue: | Connection Science |

Citations: | 59 - 15 self |

### BibTeX

@ARTICLE{Raviv96bootstrappingwith,

author = {Yuval Raviv and Nathan Intrator},

title = {Bootstrapping with Noise: An Effective Regularization Technique},

journal = {Connection Science},

year = {1996},

volume = {8},

pages = {355--372}

}

### Years of Citing Articles

### OpenURL

### Abstract

Bootstrap samples with noise are shown to be an effective smoothness and capacity control technique for training feed-forward networks and for other statistical methods such as generalized additive models. It is shown that noisy bootstrap performs best in conjunction with weight decay regularization and ensemble averaging. The two-spiral problem, a highly non-linear noise-free data, is used to demonstrate these findings. The combination of noisy bootstrap and ensemble averaging is also shown useful for generalized additive modeling, and is also demonstrated on the well known Cleveland Heart Data [7]. Keywords: Noise Injection, Combining Estimators, Pattern Classification, Two Spiral Problem Clinical Data Analysis. 1 Introduction The bootstrap technique has become one of the major tools for producing empirical confidence intervals of estimated parameters or predictors [8]. One way to view bootstrap is as a method to simulate noise inherent in the data, and thus, increase effectively t...

### Citations

2542 |
An introduction to the bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ... Two Spiral Problem Clinical Data Analysis. 1 Introduction The bootstrap technique has become one of the major tools for producing empirical confidence intervals of estimated parameters or predictors =-=[8]-=-. One way to view bootstrap is as a method to simulate noise inherent in the data, and thus, increase effectively the number of training patterns. A simple bootstrap procedure amounts to sampling with... |

2503 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...ff i \Gamma E[sf]gff j \Gamma E[sf ]g j ; and is thus interpreted as an independence of the prediction variation around a common mean. The success of ensemble averaging of neural networks in the past =-=[15, 31, 4, 26]-=- is due to the fact that neural networks have in general many local minima, and thus even with the same training set, different local minima are found when starting from different random initial condi... |

1328 | Generalized Additive Models
- Hastie, Tibshirani
- 1990
(Show Context)
Citation Context ...a different approach. Instead of analyzing a method that has hard time with the spiral data, we study a model that is very natural for it. We apply bootstrapping to a generalized additive model (GAM) =-=[16, 17]-=- with a polynomial fit of degree 1 on the same data. We had to optimize the degree of the polynomial and the span degree, which determines the smoothness and the degree of locality of the estimation 3... |

1116 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ... regularization such as weight decay [21, 27, for review], but 5 again, the estimation of the optimal regularization factor should be done on the ensemble-averaged performance. Breiman [4] and Ripley =-=[27]-=- show compelling empirical evidence for the importance of weight decay as a single network stabilizer. Our results confirm this fact under the BEN model. The BEN algorithm ffl Let f(x i ; y i )g be a ... |

741 |
Aha, UCI repository of machine learning databases, in www.ics.uci.edu/∼mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...e demonstrate our method on another well known machine learning problem the prediction of coronary artery disease based on the Cleveland Heart data which reside in the UCI machine learning repository =-=[25]-=-. 4 Results on the spiral data 4.1 Feed-forward network architecture We used Ripley's S-Plus 'nnet' package [27] which implements back--propagation. The minimization criterion is mean squared error wi... |

658 | The cascade-correlation learning architecture
- Falhman, Lebiere
- 1990
(Show Context)
Citation Context ...istent with the training set, however, the single layer feed-forward architecture trained with error back-propagation was unable to find any of them when starting with random initial weights. Fahlman =-=[9] used the Cascade-Co-=-rrelation architecture for this problem. He got better results, but still little "spiralness". Recently Deffuant [5] suggested the "Perceptron Membrane" method that uses piecewise ... |

609 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...a number of factors that have to be applied carefully when trying to regularize an estimator. The regularization is aimed at finding an optimal tradeoff between the variance and bias of the estimator =-=[11]-=-, and for best performance, one has to utilize this decomposition of the 2 error function. The motivation to our approach follows from a key observation regarding the bias variance decomposition, name... |

554 | Stacked generalization
- Wolpert
- 1992
(Show Context)
Citation Context ...ff i \Gamma E[sf]gff j \Gamma E[sf ]g j ; and is thus interpreted as an independence of the prediction variation around a common mean. The success of ensemble averaging of neural networks in the past =-=[15, 31, 4, 26]-=- is due to the fact that neural networks have in general many local minima, and thus even with the same training set, different local minima are found when starting from different random initial condi... |

543 | The meaning and use of the area under the receiver operating characteristic (roc) curve - Hanley, McNeil - 1982 |

504 |
Neural Networks Ensembles
- HANSEN, SALAMON
- 1990
(Show Context)
Citation Context ...ff i \Gamma E[sf]gff j \Gamma E[sf ]g j ; and is thus interpreted as an independence of the prediction variation around a common mean. The success of ensemble averaging of neural networks in the past =-=[15, 31, 4, 26]-=- is due to the fact that neural networks have in general many local minima, and thus even with the same training set, different local minima are found when starting from different random initial condi... |

252 |
Fast-learning variations on back-propagation: An empirical study
- Fahlman
- 1989
(Show Context)
Citation Context ...ion of linear separators. Lang and Witbrock [22] proposed a 2 \Gamma 5 \Gamma 5 \Gamma 5 \Gamma 1 network with short--cuts using 138 weights. They used a variant of the quick--prop learning algorithm =-=[10]-=- with weight decay. They claimed that the problem could not be solved with simpler architecture (i.e. less layers or without short--cuts). Their result on the same data-set seems to give poor generali... |

229 | Nonparametrics: Statistical Methods Based on Ranks - Lehmann - 1974 |

196 | Model of Incremental Concept Formation - Gennari, Langley, et al. - 1992 |

175 |
Introduction to Mathematical Statistics
- HOGG, T
- 1965
(Show Context)
Citation Context ...amma ROC = 0:903 \Sigma 0:001; NNET \Gamma ROC = 0:91 \Sigma 0:002; t = 1:766; DF = 21; P ! 0:045; Z = 1:691; P ! 0:045) or with the optimal 3 hidden-unit network. We have been using both t statistic =-=[20]-=- and the Z statistic of Wilcoxon test [23] which uses a nonparametric rank to test difference in the medians, as it is more robust to outliers. The ROC results suggest that classification error of thi... |

144 |
Understanding robust and exploratory data analysis
- Hoaglin, Mosteller, et al.
- 1983
(Show Context)
Citation Context ...e is suboptimal. Noise levels represent the standard deviation (SD) of the zero-mean noise Gaussian. on the public domain version of Tibshirani in Statlib 8 . The results are summarized by boxplots 9 =-=[19]-=-. Each boxplot is based on 500-900 single network runs. As the ratio between the two classes is different than one, classification results are not a very robust measure for model comparison, since the... |

130 |
Learning to Tell Two Spirals Apart
- Lang, Witbrock
- 1989
(Show Context)
Citation Context ...d for back--propagation networks due to its high non-linearity. It is easy to see that the 2D points of the spirals could not be separated by small combination of linear separators. Lang and Witbrock =-=[22]-=- proposed a 2 \Gamma 5 \Gamma 5 \Gamma 5 \Gamma 1 network with short--cuts using 138 weights. They used a variant of the quick--prop learning algorithm [10] with weight decay. They claimed that the pr... |

112 | Training with noise is equivalent to Tikhonov regularization
- Bishop
- 1995
(Show Context)
Citation Context ... the form y = f(x + ffl); may be more appropriate. In this case, using noise injection to the inputs during training, can improve the generalization properties of the estimator [28]. Recently, Bishop =-=[2]-=- have shown that training with small amounts of noise is locally equivalent to smoothness regularization. In this paper, we give a different interpretation to noise added to the input during training,... |

88 | A simple weight decay can improve generalization - Krogh, Hertz - 1995 |

81 | Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization
- Perrone
- 1993
(Show Context)
Citation Context |

56 |
Creating Artificial Neural Networks that Generalize
- Dow, Sietsma
- 1991
(Show Context)
Citation Context ...lassification problems, the form y = f(x + ffl); may be more appropriate. In this case, using noise injection to the inputs during training, can improve the generalization properties of the estimator =-=[28]-=-. Recently, Bishop [2] have shown that training with small amounts of noise is locally equivalent to smoothness regularization. In this paper, we give a different interpretation to noise added to the ... |

50 |
International application of a new probability algorithm for the diagnosis of coronary artery disease
- Detrano, Janosi, et al.
- 1989
(Show Context)
Citation Context ...trate these findings. The combination of noisy bootstrap and ensemble averaging is also shown useful for generalized additive modeling, and is also demonstrated on the well known Cleveland Heart Data =-=[7]-=-. Keywords: Noise Injection, Combining Estimators, Pattern Classification, Two Spiral Problem Clinical Data Analysis. 1 Introduction The bootstrap technique has become one of the major tools for produ... |

31 |
Constructing hidden units using examples and queries
- Baum, Lang
- 1991
(Show Context)
Citation Context ...ed that the problem could not be solved with simpler architecture (i.e. less layers or without short--cuts). Their result on the same data-set seems to give poor generalization results. Baum and Lang =-=[1]-=- demonstrated that there are many sets of weights that would cause a 2 \Gamma 50 \Gamma 1 network to be consistent with the training set, however, the single layer feed-forward architecture trained wi... |

12 |
Nonparametrics: Statistical Methods Based on
- Lehmann
- 1975
(Show Context)
Citation Context ... ROC = 0:91 \Sigma 0:002; t = 1:766; DF = 21; P ! 0:045; Z = 1:691; P ! 0:045) or with the optimal 3 hidden-unit network. We have been using both t statistic [20] and the Z statistic of Wilcoxon test =-=[23]-=- which uses a nonparametric rank to test difference in the medians, as it is more robust to outliers. The ROC results suggest that classification error of this model could be improved, possibly by ave... |

11 | Analysis of results - Brazdil, Henery - 1994 |

8 |
The meaning and use of the area under a reciever operating characteristic roc curve
- Hanley, McNeil
- 1998
(Show Context)
Citation Context ...e data, then setting up the threshold to 1 will result in a trivial classifier that will produce zero regardless of the input and will have only 10% error. The Receiver Operating Characteristic (ROC) =-=[13, 14]-=- is frequently used in such model comparisons, especially in clinical data [18, for review]. This measure has been used by the contributor of the data [6] and in assessing neural network performance o... |

7 | The Cascade-C orrelation Learning Architectu re - Fahlman, Lebiere - 1990 |

6 |
An algorithm for building regularized, piecewise linear discrimination surfaces: the perceptron membrane
- Deffuant
- 1995
(Show Context)
Citation Context ... any of them when starting with random initial weights. Fahlman [9] used the Cascade-Correlation architecture for this problem. He got better results, but still little "spiralness". Recently=-= Deffuant [5] suggested-=- the "Perceptron Membrane" method that uses piecewise linear surfaces as discriminators, and applied it to the spiral problem. He used 29 perceptrons but had difficulties capturing the struc... |

6 | Assessing test accuracy and its clinical consequences: a primer for receiver operating characteristic curve analysis - Henderson - 1993 |

5 |
Radiographic applications of receiver operating characteristic (ROC) curves
- Goodenough, Rossmann, et al.
- 1974
(Show Context)
Citation Context ...e data, then setting up the threshold to 1 will result in a trivial classifier that will produce zero regardless of the input and will have only 10% error. The Receiver Operating Characteristic (ROC) =-=[13, 14]-=- is frequently used in such model comparisons, especially in clinical data [18, for review]. This measure has been used by the contributor of the data [6] and in assessing neural network performance o... |

5 | Adaptive Automated Diagnosis
- Stensmo
- 1995
(Show Context)
Citation Context ...nce logistic regression was able to obtain 9-fold cross-validation error of about 15.2%. A similar error was obtained by using extensive preprocessing and a Temporal-Difference Reinforcement Learning =-=[29]-=-. Both results are consistent with our feed-forward architecture results with no noise injection and are (as far as we know) the current best results on this data 5 . 3 In this case, the model amounts... |

4 |
Predicting the risk of complications in coronary artery bypass operations using neural networks
- Lippmann, Kukolich, et al.
- 1995
(Show Context)
Citation Context ...odel comparisons, especially in clinical data [18, for review]. This measure has been used by the contributor of the data [6] and in assessing neural network performance on another heart disease data =-=[24]-=-. Figure 6 implies that the performance of neural networks (without noise injection) as measured by error rate and ROC values are slightly worse (not statistically significant) compared with logistic ... |

4 | Creating arti® cial neural networks that generalize - Sietsma, Dow - 1991 |

1 |
Analysis of results (ch. 10
- Brazdil, Henery
- 1994
(Show Context)
Citation Context ...allenging problem to neural networks as deviation from linear structure is very small 6 , and highly nonlinear estimators such as CART, Radial-Basis Functions and K-NN did not do so well on this data =-=[3]-=-. The problem is complementary to the spiral problem that was considered before; There, we attempted to improve performance on a highly nonlinear data which required a large capacity network, while he... |

1 |
Accuracy curves: An alternative graphical representation of probability data
- Detrano
- 1989
(Show Context)
Citation Context ... Receiver Operating Characteristic (ROC) [13, 14] is frequently used in such model comparisons, especially in clinical data [18, for review]. This measure has been used by the contributor of the data =-=[6]-=- and in assessing neural network performance on another heart disease data [24]. Figure 6 implies that the performance of neural networks (without noise injection) as measured by error rate and ROC va... |

1 | Downloaded by [ ] at 03:16 22 September 2011 - Baum, Lang - 1991 |

1 | Introduction to Mathem atical Statistics (3rd edn - Hogg, Craig - 1970 |