## Automatic Early Stopping Using Cross Validation: Quantifying the Criteria (1997)

### Cached

### Download Links

- [wwwipd.ira.uka.de]
- [wwwipd.ira.uka.de]
- [page.mi.fu-berlin.de]
- [page.mi.fu-berlin.de]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Networks |

Citations: | 38 - 0 self |

### BibTeX

@ARTICLE{Prechelt97automaticearly,

author = {Lutz Prechelt},

title = {Automatic Early Stopping Using Cross Validation: Quantifying the Criteria},

journal = {Neural Networks},

year = {1997},

volume = {11},

pages = {761--767}

}

### Years of Citing Articles

### OpenURL

### Abstract

Cross validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting ("early stopping"). The exact criterion used for cross validation based early stopping, however, is chosen in an ad-hoc fashion by most researchers or training is stopped interactively. To aid a more well-founded selection of the stopping criterion, 14 different automatic stopping criteria from 3 classes were evaluated empirically for their efficiency and effectiveness in 12 different classification and approximation tasks using multi layer perceptrons with RPROP training. The experiments show that on the average slower stopping criteria allow for small improvements in generalization (on the order of 4%), but cost about factor 4 longer training time. 1 Training for generalization When training a neural network, one is usually interested in obtaining a network with optimal generalization performance. Genera...

### Citations

737 | A direct adaptive method for faster backpropagation learning: The RPROP algorithm
- Reidmiller, Braun
- 1993
(Show Context)
Citation Context ...multaneously, i.e., each single training run returned one result for each of the criteria. This approach reduces the variance of the estimation. All runs were done using the RPROP training algorithm (=-=Riedmiller & Braun, 1993-=-) using the squared error function and the parameters j + = 1:1, j \Gamma = 0:5, \Delta 0 2 0:05 : : : 0:2 randomly per weight, \Delta max = 50, \Delta min = 0, initial weights \Gamma0:5 : : : 0:5 ran... |

715 | The cascade-correlation learning architecture
- Fahlman, Lebiere
- 1990
(Show Context)
Citation Context ...network. The corresponding techniques used in neural network training to reduce the number of parameters, i.e., the number of dimensions of the parameter space, are greedy constructive learning (e.g. =-=Fahlman & Lebiere, 1990-=-), pruning (e.g. Le Cun, Denker & Solla, 1990; Hassibi & Stork, 1992; Levin, Leen & Moody, 1994), or weight sharing (e.g. Nowlan & Hinton, 1992). The corresponding NN techniques for reducing the size ... |

650 | Neural Networks and the bias/variance dilemma - Geman, Bienenstock, et al. |

449 | Optimal brain damage - LeCun - 1990 |

241 | An empirical study of learning speed in back propagation networks
- Fahlman
- 1988
(Show Context)
Citation Context ..., \Delta 0 2 0:05 : : : 0:2 randomly per weight, \Delta max = 50, \Delta min = 0, initial weights \Gamma0:5 : : : 0:5 randomly. RPROP is a fast backpropagation variant similar in spirit to quickprop (=-=Fahlman, 1988-=-). It is about as fast as quickprop but more stable without adjustment of the parameters. RPROP requires epoch learning, i.e., the weights are updated only once per epoch. Therefore, the algorithm is ... |

212 |
Pruning algorithms - a survey
- Reed
- 1991
(Show Context)
Citation Context ...ach parameter dimension are regularization such as weight decay (e.g. Krogh & Hertz, 1992) and others (e.g. Weigend, Rumelhart & Huberman, 1991) or early stopping (Morgan & Bourlard, 1990). See also (=-=Reed, 1993-=-; Fiesler, 1994) for an overview and (Finnoff, Hergert & Zimmermann, 1993) for an experimental comparison. Early stopping is widely used because it is simple to understand and implement and has been r... |

184 | order derivatives for network pruning: Optimal brain surgeon - Hassibi, Stork, et al. - 1993 |

126 |
Simplifying neural networks by soft weight-sharing
- Nowlan, Hinton
- 1992
(Show Context)
Citation Context ...arameter space, are greedy constructive learning (e.g. Fahlman & Lebiere, 1990), pruning (e.g. Le Cun, Denker & Solla, 1990; Hassibi & Stork, 1992; Levin, Leen & Moody, 1994), or weight sharing (e.g. =-=Nowlan & Hinton, 1992-=-). The corresponding NN techniques for reducing the size of each parameter dimension are regularization such as weight decay (e.g. Krogh & Hertz, 1992) and others (e.g. Weigend, Rumelhart & Huberman, ... |

106 |
PROBEN1-A Set of Benchmarks and Benchmarking Rules for Neural Network Training Algorithms
- Prechelt
- 1994
(Show Context)
Citation Context ...r to obtain pure stopping criteria results. In a real application this would be a waste of training data and should be changed. 12 different problems were used, all from the Proben1 NN benchmark set (=-=Prechelt, 1994-=-). These problems form a sample of a quite broad class of domains, but are of course not universally representative of learning; see (Prechelt, 1994) for a discussion of how to characterize the Proben... |

93 | A simple weight decay can improve generalization
- Krogh, Hertz
- 1995
(Show Context)
Citation Context ...n, Leen & Moody, 1994), or weight sharing (e.g. Nowlan & Hinton, 1992). The corresponding NN techniques for reducing the size of each parameter dimension are regularization such as weight decay (e.g. =-=Krogh & Hertz, 1992-=-) and others (e.g. Weigend, Rumelhart & Huberman, 1991) or early stopping (Morgan & Bourlard, 1990). See also (Reed, 1993; Fiesler, 1994) for an overview and (Finnoff, Hergert & Zimmermann, 1993) for ... |

62 | Improving model selection by nonconvergent methods - Finnoff, Hergert, et al. - 1993 |

38 |
Generalization and parameter estimation in feedforward nets: Some experiments
- Morgan, Bourlard
- 2009
(Show Context)
Citation Context ...echniques for reducing the size of each parameter dimension are regularization such as weight decay (e.g. Krogh & Hertz, 1992) and others (e.g. Weigend, Rumelhart & Huberman, 1991) or early stopping (=-=Morgan & Bourlard, 1990-=-). See also (Reed, 1993; Fiesler, 1994) for an overview and (Finnoff, Hergert & Zimmermann, 1993) for an experimental comparison. Early stopping is widely used because it is simple to understand and i... |

30 | Fast pruning using principal components - Levin, Leen, et al. - 1994 |

28 | Comparative bibliography of ontogenic neural networks
- Fiesler
- 1994
(Show Context)
Citation Context ...r dimension are regularization such as weight decay (e.g. Krogh & Hertz, 1992) and others (e.g. Weigend, Rumelhart & Huberman, 1991) or early stopping (Morgan & Bourlard, 1990). See also (Reed, 1993; =-=Fiesler, 1994-=-) for an overview and (Finnoff, Hergert & Zimmermann, 1993) for an experimental comparison. Early stopping is widely used because it is simple to understand and implement and has been reported to be s... |

17 | Optimal stopping and effective machine complexity in learning - Wang, Venkatesh, et al. - 1993 |

16 |
Temporal evolution of generalization during learning in linear networks, Neural Communication 3
- Baldi, Chauvin
- 1991
(Show Context)
Citation Context ...t set (or in real use), assuming that the error on both will be similar. However, the real situation is a lot more complex. Real generalization curves almost always have more than one local minimum. (=-=Baldi & Chauvin, 1991-=-) showed for linear networks with n inputs and n outputs that up to n such local minima are possible; for multi layer networks, the situation is even worse. Thus, it is impossible in general to tell f... |

3 |
Comparative bibliography ofontogenic neural networks
- Fiesler
- 1994
(Show Context)
Citation Context ...er dimension are regularization such asweight decay (e.g. Krogh & Hertz, 1992) and others (e.g. Weigend, Rumelhart & Huberman, 1991) or early stopping (Morgan & Bourlard, 1990). See also (Reed, 1993; =-=Fiesler, 1994-=-) for an overview and (Finno , Hergert & Zimmermann, 1993) for an experimental comparison. Early stopping is widely used because it is simple to understand and implement and has been reported to be su... |

2 | Optimal stopping and e ective machine complexity in learning - Wang, Venkatesh, et al. - 1994 |