## A New Metric-Based Approach to Model Selection (1997)

Venue: | In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97 |

Citations: | 42 - 5 self |

### BibTeX

@INPROCEEDINGS{Schuurmans97anew,

author = {Dale Schuurmans},

title = {A New Metric-Based Approach to Model Selection},

booktitle = {In Proceedings of the Fourteenth National Conference on Artificial Intelligence (AAAI-97},

year = {1997},

pages = {552--558}

}

### Years of Citing Articles

### OpenURL

### Abstract

We introduce a new approach to model selection that performs better than the standard complexitypenalization and hold-out error estimation techniques in many cases. The basic idea is to exploit the intrinsic metric structure of a hypothesis space, as determined by the natural distribution of unlabeled training patterns, and use this metric as a reference to detect whether the empirical error estimates derived from a small (labeled) training sample can be trusted in the region around an empirically optimal hypothesis. Using simple metric intuitions we develop new geometric strategies for detecting overfitting and performing robust yet responsive model selection in spaces of candidate functions. These new metric-based strategies dramatically outperform previous approaches in experimental studies of classical polynomial curve fitting. Moreover, the technique is simple, efficient, and can be applied to most function learning tasks. The only requirement is access to an auxiliary collection ...

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... H 1 ae ::: into polynomials of degree 0; 1; :::, etc. The motivation for studying this task is that it is a classical well-studied problem, that still attracts a lot of interest (Galarza, Rietman, & =-=Vapnik 1996-=-; Cherkassky, Mulier, & Vapnik 1996; Vapnik 1996). Moreover, polynomials create a difficult model selection problem that has a strong tendency to produce catastrophic overfitting effects (Figure 3). A... |

2492 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ... of minimizing prediction error. 4 One could consider more elaborate strategies that choose hypotheses from outside the sequence; e.g., by averaging several hypotheses together (Opitz & Shavlik 1996; =-=Breiman 1994-=-). However, we will not pursue this idea here. percentiles of approximation ratios method 25 50 75 95 100 TRI 1.00 1.03 1.17 1.44 2.42 10CV 1.07 1.24 1.51 7.38 854.3 SRM 1.05 1.24 1.44 4.24 58.3 GCV 1... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...ory which supports this intuition by saying that for h to be reliably near the best function in H we require a training sample size that is proportional to the "complexity" of the hypothesis=-= class H (Vapnik 1982-=-; Pollard 1984; Haussler 1992). This suggests that we must restrict the complexity of our hypothesis class somehow. Of course, this can introduce the opposite problem of underfitting. That is, we migh... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...nd structural risk minimization SRM (Vapnik 1996) (under the formulations reported in (Cherkassky, Mulier, & Vapnik 1996))---and 10-fold cross validation 10CV, a standard hold-out method (Efron 1979; =-=Kohavi 1995-=-). We conducted a simple series of experiments by fixing a uniform domain distribution P X on the unit intervals[0; 1], and then fixing various target functions f : [0; 1] ! IR. To generate training s... |

567 |
Convergence of Stochastic Processes
- Pollard
- 1984
(Show Context)
Citation Context ...ports this intuition by saying that for h to be reliably near the best function in H we require a training sample size that is proportional to the "complexity" of the hypothesis class H (Vap=-=nik 1982; Pollard 1984-=-; Haussler 1992). This suggests that we must restrict the complexity of our hypothesis class somehow. Of course, this can introduce the opposite problem of underfitting. That is, we might restrict H s... |

423 |
Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...roach, including the minimum description length principle (Rissanen 1986), "Bayesian" maximum a posteriori selection, structural risk minimization (Vapnik 1982; 1996), "generalized"=-=; cross validation (Craven & Wahba 1979-=-) (different from real cross validation; below), and regularization (Moody 1992). These strategies differ in the specific complexity values they assign and the particular tradeoff function they optimi... |

372 |
Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...uition by saying that for h to be reliably near the best function in H we require a training sample size that is proportional to the "complexity" of the hypothesis class H (Vapnik 1982; Poll=-=ard 1984; Haussler 1992-=-). This suggests that we must restrict the complexity of our hypothesis class somehow. Of course, this can introduce the opposite problem of underfitting. That is, we might restrict H so severely as t... |

249 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ...ior combination of complexity and empirical error (e.g., the additive combination c i + err(h i )). There are many variants of this basic approach, including the minimum description length principle (=-=Rissanen 1986), "Bayesian&qu-=-ot; maximum a posteriori selection, structural risk minimization (Vapnik 1982; 1996), "generalized" cross validation (Craven & Wahba 1979) (different from real cross validation; below), and ... |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...ximum a posteriori selection, structural risk minimization (Vapnik 1982; 1996), "generalized" cross validation (Craven & Wahba 1979) (different from real cross validation; below), and regula=-=rization (Moody 1992-=-). These strategies differ in the specific complexity values they assign and the particular tradeoff function they optimize, but the basic idea is the same. The other most common strategy is hold-out ... |

146 |
A Conservation Law for Generalization Performance
- Schaffer
- 1994
(Show Context)
Citation Context ...ethods implicitly take some of this information into account, but do so indirectly and less effectively than the metric-based strategies introduced here. Although there is no "free lunch" in=-= general (Schaffer 1994-=-) and we cannot claim to obtain a universal improvement for every model selection problem (Schaffer 1993), we claim that one should be able to exploit additional information about the task (here knowl... |

109 | An experimental and theoretical comparison of model selection methods
- Kearns, Mansour, et al.
- 1995
(Show Context)
Citation Context ...istances greater than 1). Nevertheless, applying our techniques to classification tasks is another important direction for future research. Here we hope to compare our results with the earlier study (=-=Kearns et al. 1995-=-). Acknowledgements Much of this work was performed at the National Research Council Canada. I would like to thank Rob Holte, Joel Martin and Peter Turney for their help in developing the nascent idea... |

105 | Generating accurate and diverse members of a neural-network ensemble
- Opitz, Shavlik
- 1996
(Show Context)
Citation Context ... machine learning goal of minimizing prediction error. 4 One could consider more elaborate strategies that choose hypotheses from outside the sequence; e.g., by averaging several hypotheses together (=-=Opitz & Shavlik 1996-=-; Breiman 1994). However, we will not pursue this idea here. percentiles of approximation ratios method 25 50 75 95 100 TRI 1.00 1.03 1.17 1.44 2.42 10CV 1.07 1.24 1.51 7.38 854.3 SRM 1.05 1.24 1.44 4... |

95 | Neural networks and the bias/variance dilemma. Neural Comp - Geman, Bienenstock, et al. - 1992 |

19 | A comparison of scientific and engineering criteria for Bayesian model selection
- Heckerman, Chickering
- 1996
(Show Context)
Citation Context ...erested in finding a simple model of the underlying phenomenon that gives some insight into its fundamental nature, rather than simply producing a function that predicts well on future test examples (=-=Heckerman & Chickering 1996-=-). However, we will focus on the traditional machine learning goal of minimizing prediction error. 4 One could consider more elaborate strategies that choose hypotheses from outside the sequence; e.g.... |

15 | Characterizing the generalization performance of model selection strategies - Schuurmans, Ungar - 1997 |

10 |
Computers and the Theory of Statistics
- Efron
- 1979
(Show Context)
Citation Context ...ith repeating the pseudo-train pseudo-test split many times and averaging the results to choose the final hypothesis class; e.g., 10-fold cross validation, leave-one-out testing, bootstrapping, etc. (=-=Efron 1979-=-; Weiss & Kulikowski 1991). Repeated testing in this manner does introduce some bias in the error estimates, but the results are still generally better than considering a single hold-out partition (We... |

7 | Comparison of VC method with classical methods for model selection - Cherkassky, Mulier, et al. - 1997 |

4 |
Overfitting avoidance as
- Schaffer
- 1993
(Show Context)
Citation Context ...than the metric-based strategies introduced here. Although there is no "free lunch" in general (Schaffer 1994) and we cannot claim to obtain a universal improvement for every model selection=-= problem (Schaffer 1993-=-), we claim that one should be able to exploit additional information about the task (here knowledge of P X ) to obtain significant improvements across a wide range of problem types and conditions. Ou... |

3 | Applications of model selection techniques to polynomial approximation - Galarza, Rietman, et al. - 1996 |