## Characterizing the Generalization Performance of Model Selection Strategies (1997)

### Cached

### Download Links

- [linc2.cis.upenn.edu]
- [www.cis.upenn.edu]
- [cs.ualberta.ca]
- [www.cs.ualberta.ca]
- [papersdb.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In ICML-97 |

Citations: | 15 - 4 self |

### BibTeX

@INPROCEEDINGS{Schuurmans97characterizingthe,

author = {Dale Schuurmans and Lyle H. Ungar and Dean P. Foster},

title = {Characterizing the Generalization Performance of Model Selection Strategies},

booktitle = {In ICML-97},

year = {1997},

pages = {340--348},

publisher = {Morgan Kaufmann}

}

### OpenURL

### Abstract

: We investigate the structure of model selection problems via the bias/variance decomposition. In particular, we characterize the essential structure of a model selection task by the bias and variance profiles it generates over the sequence of hypothesis classes. This leads to a new understanding of complexity-penalization methods: First, the penalty terms in effect postulate a particular profile for the variances as a function of model complexity--- if the postulated and true profiles do not match, then systematic under-fitting or over-fitting results, depending on whether the penalty terms are too large or too small. Second, it is usually best to penalize according to the true variances of the task, and therefore no fixed penalization strategy is optimal across all problems. We then use this bias/variance characterization to identify the notion of easy and hard model selection problems. In particular, we show that if the variance profile grows too rapidly in relation to the biases t...

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ve combination c i + err(h i ; S)). There are many variants of this basic approach, including generalized cross validation [2], minimum description length principle [10], structural risk minimization =-=[12, 13], "Ba-=-yesian" maximum a posteriori selection, and regularization [8]. These strategies differ in the specific complexity values they assign and the particular tradeoff function they optimize, but the b... |

1491 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...n strategies: (1) if the penalization profile is much 4 This argument could be formalized into precise quantitative statements, for example, by an elementary application of Hoeffding-Chernoff bounds (=-=Hoeffding 1963), but we do not pursue this here. The intuition is clear in any -=-case. " " " " " " " " " " " " " . . . . . . . . . . . . . h i P XY h opt i S c err(h i ) err(h opt i ) c err(h opt i ) err(h i ) Figure... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...ve combination c i + err(h i ; S)). There are many variants of this basic approach, including generalized cross validation [2], minimum description length principle [10], structural risk minimization =-=[12, 13], "Ba-=-yesian" maximum a posteriori selection, and regularization [8]. These strategies differ in the specific complexity values they assign and the particular tradeoff function they optimize, but the b... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...ootnote 4. Alternative hold-out methods An obvious idea in these situations is to consider alternative hold-out--based methods, like 10-fold cross-validation (10CV) or some other resampling procedure =-=[6, 14]-=-. However, it turns out that these strategies are prone to the very same mistakes suffered by penalty-based methods, as Table 4 clearly demonstrates for 10CV. The strikingly bad performance obtained b... |

609 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...pproaches to the problem raises the question of which techniques are best and when. We attempt to answer this question by appealing to the standard bias/variance decomposition of generalization error =-=[4]-=-. In particular, we characterize model selection problems by the bias and variance profiles they generate over the sequence of hypothesis classes. Given this characterization, we address a number of t... |

423 |
Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation
- Craven, Wahba
- 1979
(Show Context)
Citation Context ...es some prior combination of complexity and empirical error (e.g., the additive combination c i + err(h i ; S)). There are many variants of this basic approach, including generalized cross validation =-=[2], minimum -=-description length principle [10], structural risk minimization [12, 13], "Bayesian" maximum a posteriori selection, and regularization [8]. These strategies differ in the specific complexit... |

365 | Computer Systems that Learn - Weiss, Kulikowski - 1995 |

249 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ...d empirical error (e.g., the additive combination c i + err(h i ; S)). There are many variants of this basic approach, including generalized cross validation [2], minimum description length principle =-=[10], structur-=-al risk minimization [12, 13], "Bayesian" maximum a posteriori selection, and regularization [8]. These strategies differ in the specific complexity values they assign and the particular tra... |

173 | Bias plus variance decomposition for zero-one loss functions - Kohavi, Wolpert - 1996 |

169 | The Effective Number of Parameters: An Analysis of generalization and regularization in nonlinear learning systems
- Moody
- 1992
(Show Context)
Citation Context ...c approach, including generalized cross validation [2], minimum description length principle [10], structural risk minimization [12, 13], "Bayesian" maximum a posteriori selection, and regul=-=arization [8]-=-. These strategies differ in the specific complexity values they assign and the particular tradeoff function they optimize, but the basic idea is still the same. The other most common strategy is hold... |

157 |
Real Analysis and Probability
- Ash
- 1972
(Show Context)
Citation Context ...s (as well as Cauchy sequences). This is sufficient to ensure that H is a closed linear subspace of a Hilbert space defined by the inner product hf; gi 4 = R X (f(x) \Gamma g(x)) 2 dP X ; see, e.g., (=-=Ash 1972-=-, Chapter 3). Given these conditions, we can apply the relevant projection theorem to obtain h opt , and the subsequent analysis becomes a simple consequence of generalized Pythagorean relations. Fort... |

126 |
The risk inflation criterion for multiple regression
- Foster, George
- 1994
(Show Context)
Citation Context ...In fact, we tried an entire suite of penalization methods on this task and obtained uniformly poor performance. These included Akaike's AIC, Schwarz's BIC, and Mallow's C p , among others; see e.g., (=-=Foster & George 1994-=-; Cherkassky, Mulier, & Vapnik 1996) for a discussion of several such methods. These results lead us to conclude that complexitypenalization can be an inherently risky strategy. There seems to be a po... |

119 | Overfitting avoidance as bias
- Schaffer
- 1993
(Show Context)
Citation Context ...esult, but it follows from the fact that VAR does not pay explicit attention to the inter-hypothesis distances, and can therefore be fooled from time to time. Of course, we do not expect a free lunch =-=[11]-=- and there are certainly model selection problems where ADJ does not dominate (Table 3). However, the claim is that one should be able to exploit additional information about the task (here knowledge ... |

109 | An experimental and theoretical comparison of model selection methods
- Kearns, Mansour, et al.
- 1995
(Show Context)
Citation Context ...to the bias profile, then disaster results for any penalization strategy that does not use the exact variance profile for the task. 8 Note that this is similar to an observation made by Kearns et al. =-=[5]-=- in the context of learning classifications. However, they do not explicitly invoke a bias/variance characterization of model selection problems to explain their results. 9 We note that VAR performed ... |

95 | Neural networks and the bias/variance dilemma. Neural Comp - Geman, Bienenstock, et al. - 1992 |

46 |
Principled architecture selection for neural networks: Application to corporate bond rating prediction
- Moody, Utans
- 1992
(Show Context)
Citation Context ...hese strategies, let r = i=t be the number of complexity levels being considered per training example. 2 The first penalization strategy we consider is Generalized Cross Validation GCV [2]. Following =-=[9]-=- we can write the adjusted error estimate of this strategy as d err GCV (h i ) = err(h i ; S) + 2r \Gamma r 2 (1 \Gamma r) 2 err(h i ; S): (3) The other penalization strategy we consider is Vapnik's S... |

42 | A new metric-based approach to model selection
- Schuurmans
- 1997
(Show Context)
Citation Context ...he question of whether it is possible to do better on hard problems, or whether we have to live with the potential of making disastrous mistakes. 5 A new model selection technique In a recent paper, (=-=Schuurmans 1997-=-), one of the authors introduces a new strategy for model selection that takes a fundamentally different approach to the problem than previous techniques. This new strategy seems to avoid many of the ... |

10 |
Computers and the Theory of Statistics
- Efron
- 1979
(Show Context)
Citation Context ... with repeating the pseudo-train pseudo-test split many times and averaging the results to choose the final hypothesis class; e.g., 10-fold cross validation, leave-oneout testing, bootstrapping, etc. =-=[3, 14]-=-. The abundance of model selection strategies and different approaches to the problem raises the question of which techniques are best and when. We attempt to answer this question by appealing to the ... |

7 |
Comparison of VC method with classical methods for model selection
- Cherkassky, Mulier, et al.
- 1997
(Show Context)
Citation Context ...err GCV (h i ) = err(h i ; S) + 2r \Gamma r 2 (1 \Gamma r) 2 err(h i ; S): (3) The other penalization strategy we consider is Vapnik's Structural Risk Minimization procedure SRM [13], which following =-=[1]-=- can be formulated d err SRM (h i ) = err(h i ; S) + p ~ r i 1 \Gamma p ~ r j + err(h i ; S); (4) where ~ r = r(1 + ln 1=r) + (ln t)=2t. For our purposes, the key difference between these two policies... |

4 |
Overfitting avoidance as
- Schaffer
- 1993
(Show Context)
Citation Context ... the reason for VAR's failure is that it does not pay explicit attention to the inter-hypothesis distances, and can therefore sometimes be fooled. Of course, we do not expect a free lunch in general (=-=Schaffer 1993-=-), and there are certainly model selection problems where ADJ does not dominate, e.g., Table 3. However, one should be able to exploit additional information about the task (here knowledge of P X ) to... |