## A Simple Trick for Estimating the Weight Decay Parameter (1998)

Venue: | Neural Networks: Tricks of the Trade |

Citations: | 2 - 0 self |

### BibTeX

@INPROCEEDINGS{Rögnvaldsson98asimple,

author = {Thorsteinn S. Rögnvaldsson and Thorsteinn S. R��ognvaldsson},

title = {A Simple Trick for Estimating the Weight Decay Parameter},

booktitle = {Neural Networks: Tricks of the Trade},

year = {1998},

pages = {71--93},

publisher = {Springer}

}

### OpenURL

### Abstract

. We present a simple trick to get an approximate estimate of the weight decay parameter . The method combines early stopping and weight decay, into the estimate = krE(W es )k=k2W es k, where W es is the set of weights at the early stopping point, and E(W ) is the training data fit error. The estimate is demonstrated and compared to the standard cross-validation procedure for selection on one synthetic and four real life data sets. The result is that is as good an estimator for the optimal weight decay parameter value as the standard search estimate, but orders of magnitude quicker to compute. The results also show that weight decay can produce solutions that are significantly superior to committees of networks trained with early stopping. 1 Introduction A regression problem which does not put constraints on the model used is illposed [21], because there are infinitely many functions that can fit a finite set of training data perfectly. Furthermore, real life data sets tend to h...

### Citations

1212 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...ke a judgement on where the optimalsis, but this adds an undesired subjectiveness to the choice. Another is to take a weighted average over the differentsvalues, which is what we use here (see Ripley =-=[19]-=- for a discussion on variants ofsselection methods). Our estimate for the optimalsis the valuesopt = P K k=1 n ksk P K k=1 n k (12) where n k is the number of timessk corresponds to the minimum valida... |

945 |
Solution of Ill-posed Problems
- Tikhonov, Arsenin
- 1977
(Show Context)
Citation Context ...oduce solutions that are significantly superior to committees of networks trained with early stopping. 1 Introduction A regression problem which does not put constraints on the model used is illposed =-=[21]-=-, because there are infinitely many functions that can fit a finite set of training data perfectly. Furthermore, real life data sets tend to have noisy inputs and/or outputs, which is why models that ... |

928 |
Approximation by superposition of a sigmoidal function
- Cybenko
- 1989
(Show Context)
Citation Context ...layer perceptron that learns the training data perfectly by using many internal units, since any continuous function can be constructed with a single hidden layer network with sigmoid units (see e.g. =-=[6]-=-), and we may be happy with any solution and ignore questions on uniqueness. However, a network that has learned the training data perfectly will be very sensitive to changes in the training data. Ful... |

730 | A direct adaptive method for faster back-propagation learning: The RPROP algorithm
- Riedmiller, Braun
- 1993
(Show Context)
Citation Context ...egin by estimatingsin the "traditional" way by searching over the region logs2 [\Gamma6:5; 1:0] in steps of \Delta logs= 0:5. For eachsvalue, we train 10 networks using the Rprop training al=-=gorithm 3 [18]. Eac-=-h network is trained until the total error (7) is minimized, measured by log " 1 100 100 X i=1 j\DeltaE i j k\DeltaW i k # ! \Gamma5; (23) where the sum runs over the most recent 100 epochs, or u... |

646 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...ce The benefit of regularization is often described in the context of model bias and model variance. This originates from the separation of the expected generalization error hE gen i into three terms =-=[8] hE gen i -=-= h Z [y(x) \Gamma f(x)] 2 p(x)dxi 1 Called "stabilizers" by Tikhonov [21]. = Z [OE(x) \Gamma hf(x)i] 2 p(x)dx + h Z [f(x) \Gamma hf(x)i] 2 p(x)dxi + h Z [y(x) \Gamma OE(x)] 2 p(x)dxi = Bias... |

582 |
Ridge regression: biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...or the model prior p(f ), and selecting a value forscorresponds to estimating the parameters for the prior. 2.5 Weight Decay Weight decay [17] is the neural network equivalent to the Ridge Regression =-=[11]-=- method. In this case R(W ) = kWk 2 = P k w 2 k and the error functional is E(W ) = E 0 (W ) + R(W ) = 1 2N N X n=1 [y(n) \Gamma f(W ; x(n))] 2 + kW k 2 ; (7) andsis usually referred to as the weight ... |

557 |
The advanced theory of statistics
- Kendall, Stuart, et al.
- 1983
(Show Context)
Citation Context ... use all the training data. Figure 3 shows the histograms corresponding to the problem presented in Figure 2. When comparing test errors achieved with different methods, we use the Wilcoxon rank test =-=[13]-=-, also called the Mann-Whitney test, and report differences at 95% confidence level. 0 5 10 15 20 -6-5 -4-3 -2-1 0 Weight decay, Wolf sunspots (12-8-1) Estimated log10(lambda) Entries / 0.1 bin 1:st e... |

501 |
Nonlinear Time Series: A Dynamic System Approach
- Tong
- 1990
(Show Context)
Citation Context ...rs of internal units are tried on this task: 15, 10, and 5, and we refer to these experiments as B1, B2, and B3 below. Predicting daily riverflow in two Icelandic rivers. This problem is tabulated in =-=[22]-=-, and the task is to model tomorrow's average flow of water in one of two Icelandic rivers, knowing today's and previous days' waterflow, temperature, and precipitation. The training set consists of 7... |

332 | Regularization theory and neural networks architectures
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ...to the error measure to avoid overfitting. This includes e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Laplacian weight decay [12] [27], and smoothness regularization [2] =-=[9] [14] . Ce-=-rtain forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The benefit of regularization is often described in the context of model bias and model variance. This ori... |

308 | ªWhen Networks Disagree: Ensemble Methods for Hybrid
- Perrone, Cooper
- 1993
(Show Context)
Citation Context ...od estimate when tested empirically. We also demonstrate in this paper that the arduous process of selectingscan be rewarding compared to simpler methods, like e.g. combining networks into committees =-=[16]-=-. The paper is organized as follows: In section 2 we present the background of how and why weight decay or early stopping should be used. In section 3 we review the standard method for selectingsand a... |

127 |
Bayesian back-propagation
- Buntine, Weigend
- 1991
(Show Context)
Citation Context ... 2.4 Bayesian Framework From a Bayesian and maximum likelihood perspective, prior information about the model (f) is weighed against the likelihood of the training data (D) through Bayes theorem (see =-=[4]-=- for a discussion on this). Denote the probability for observing data set D by p(D), the prior distribution of models f by p(f ), and the likelihood for observing the data D, if f is the correct model... |

126 |
Simplifying neural networks by soft weight sharing
- Nowlan, Hinton
- 1992
(Show Context)
Citation Context ...ization" encompasses all techniques which make use of penalty terms added to the error measure to avoid overfitting. This includes e.g. weight decay [17], weight elimination [26], soft weight sha=-=ring [15], Laplacia-=-n weight decay [12] [27], and smoothness regularization [2] [9] [14] . Certain forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The benefit of regularization is ... |

89 | Bayesian regularization and pruning using a Laplace prior - Williams - 1995 |

84 | Experiments on learning by back-propagation
- Plaut, Nowlan, et al.
- 1986
(Show Context)
Citation Context ... on criteria which include other qualities besides their fit to the training data. In the neural network community the two most common methods to avoid overfitting are early stopping and weight decay =-=[17]-=-. Early stopping has the advantage of being quick, since it shortens the training time, but the disadvantage of being poorly defined and not making full use of the available data. Weight decay, on the... |

48 |
Backpropagation, weight-elimination and time series prediction,” in Connectionist Models
- Weigend, Rumelhart, et al.
- 1991
(Show Context)
Citation Context ...ng data. The term "regularization" encompasses all techniques which make use of penalty terms added to the error measure to avoid overfitting. This includes e.g. weight decay [17], weight el=-=imination [26], soft wei-=-ght sharing [15], Laplacian weight decay [12] [27], and smoothness regularization [2] [9] [14] . Certain forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The ben... |

40 | Curvature-driven smoothing: a learning algorithm for feedforward networks
- Bishop
- 1993
(Show Context)
Citation Context ...ded to the error measure to avoid overfitting. This includes e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Laplacian weight decay [12] [27], and smoothness regularization =-=[2] [9] [14] -=-. Certain forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The benefit of regularization is often described in the context of model bias and model variance. This... |

33 | Selecting Neural Network Architectures via the Prediction Risk: Application to Corporate Bond Rating Prediction
- 128Utans, Moody
- 1991
(Show Context)
Citation Context ...ach is to try several values ofsand estimate the out-of-sample error, either by correcting the training error, with some factor or term, or by using cross-validation. The former is done in e.g. [10], =-=[23]-=-, and [24] (see also references therein). The latter is done by e.g. [25]. The method of using validation data for estimating the out-of-sample error is robust but slow since it requires training seve... |

24 | Soft classification, a.k.a. risk estimation, via penalized log likelihood and smoothing spline analysis of variance
- Wahba, Gu, et al.
- 1993
(Show Context)
Citation Context ...try several values ofsand estimate the out-of-sample error, either by correcting the training error, with some factor or term, or by using cross-validation. The former is done in e.g. [10], [23], and =-=[24]-=- (see also references therein). The latter is done by e.g. [25]. The method of using validation data for estimating the out-of-sample error is robust but slow since it requires training several models... |

22 | Overtraining, Regularization, and Searching for Minimum with Application to Neural Nets
- Sjoberg, Ljung
- 1995
(Show Context)
Citation Context ...nection between early stopping and weight decay, if learning starts from small weights, since weight decay applies a potential which forces all weights towards zero. For instance, Sj��oberg and Lj=-=ung [20] sho-=-w that, if a constant learning rate j is used, the number of iterations n at which training is stopped is related to the weight decay parametersroughly ass�� 1 2jn : (9) This does not, however, me... |

22 |
A completely automatic french curve
- Wahba, Wold
- 1975
(Show Context)
Citation Context ...ause much time is spent with selecting a suitable value for the weight decay parameter (), by searching over several values ofsand estimating the out-of-sample performance using e.g. cross validation =-=[25]-=-. In this paper, we present a very simple method for estimating the weight decay parameter, for the standard weight decay case. This method combines early stopping with weight decay, thus merging the ... |

10 | T.: Smoothing regularizers for projective basis function networks
- Moody, Rögnvaldsson
- 1997
(Show Context)
Citation Context ...he error measure to avoid overfitting. This includes e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Laplacian weight decay [12] [27], and smoothness regularization [2] [9] =-=[14] . Certain-=- forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The benefit of regularization is often described in the context of model bias and model variance. This originat... |

8 |
On Bayesian model selection
- Cheeseman
- 1995
(Show Context)
Citation Context ...section to estimate the weight decay parameter . 3 EstimatingsFrom a pure Bayesian point of view, the prior is something we know/assume in advance and do not use the training data to select (see e.g. =-=[5]). There i-=-s consequently no such thing as " selection" in the pure Bayesian model selection scheme. This is of course perfectly fine if the prior is correct. However, if we suspect that our choice of ... |

8 | Adaptive regularization
- Hansen, Rasmussen, et al.
- 1994
(Show Context)
Citation Context ... approach is to try several values ofsand estimate the out-of-sample error, either by correcting the training error, with some factor or term, or by using cross-validation. The former is done in e.g. =-=[10]-=-, [23], and [24] (see also references therein). The latter is done by e.g. [25]. The method of using validation data for estimating the out-of-sample error is robust but slow since it requires trainin... |

8 |
A structural learning algorithm with forgetting of link weights,” Electrotechnical
- Ishikawa
- 1990
(Show Context)
Citation Context ...niques which make use of penalty terms added to the error measure to avoid overfitting. This includes e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Laplacian weight decay =-=[12] [27], and-=- smoothness regularization [2] [9] [14] . Certain forms of "hints" [1] can also be called regularization. 2.3 Bias and Variance The benefit of regularization is often described in the contex... |

5 |
A comparison of the Forecasting Accuracy of Neural Networks with Other Established
- Brace, Schmidt, et al.
- 1991
(Show Context)
Citation Context ...ar output). Predicting Puget Sound Power and Light Co. power load between 7 and 8 a.m. the following day. This data set is taken from the Puget Sound Power and Light Co's power prediction competition =-=[3]-=-. The winner of this competition used a set of linear models, one for each hour of the day. We have selected the subproblem of predicting the load between 7 and 8 a.m. 24 hrs. in advance. This hour sh... |

1 |
Construction of the puget sound forecasting model
- Engle, Clive, et al.
- 1991
(Show Context)
Citation Context ...rks, usingsopt , are significantly better than what a human expert produces, and also significantly better than the results by the winner of the Puget Sound Power and Light Co. Power Load Competition =-=[7]-=-, although the difference is small. The test results are summarized in Figure 5. The performance of the sunspot D1 weight decay Table 2. Relative performance of single networks trained using the estim... |