Results 11  20
of
63
On Overfitting Avoidance As Bias
 SFI TR
, 1993
"... In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions ..."
Abstract

Cited by 35 (7 self)
 Add to MetaCart
In supervised learning it is commonly believed that penalizing complex functions helps one avoid "overfitting" functions to data, and therefore improves generalization. It is also commonly believed that crossvalidation is an effective way to choose amongst algorithms for fitting functions to data. In a recent paper, Schaffer (1993) presents experimental evidence disputing these claims. The current paper consists of a formal analysis of these contentions of Schaffer's. It proves that his contentions are valid, although some of his experiments must be interpreted with caution.
A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the TrainingTest Split
 Neural Computation
, 1996
"... : We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis paramet ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
(Show Context)
: We give an analysis of the generalization error of cross validation in terms of two natural measures of the difficulty of the problem under consideration: the approximation rate (the accuracy to which the target function can be ideally approximated as a function of the number of hypothesis parameters), and the estimation rate (the deviation between the training and generalization errors as a function of the number of hypothesis parameters). The approximation rate captures the complexity of the target function with respect to the hypothesis model, and the estimation rate captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of cross validation. The bound clearly shows the tradeoffs involved with making fl  the fraction of data saved for testing  too large or too small. By optimizing the bound with respect to fl, we then argue (through a combination of formal analysis, plotting, and ...
Asymptotic optimality of likelihoodbased crossvalidation
 STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY
, 2003
"... Likelihoodbased crossvalidation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection o ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
Likelihoodbased crossvalidation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish a finite sample result for a general class of likelihoodbased crossvalidation procedures (as indexed by the type of sample splitting used, e.g. Vfold crossvalidation). This result implies that the crossvalidation selector performs asymptotically as well (w.r.t. to the KullbackLeibler distance to the true density) as a benchmark model selector which is optimal for each given dataset and depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leaveoneout crossvalidation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihoodbased crossvalidation for the purpose of bandwidth selection with a simulation study. Moreover, we use likelihoodbased crossvalidation in the context of regulatory motif detection in DNA sequences.
Towards Perceptual Intelligence: Statistical Modeling of Human Individual and Interactive Behaviors
 Prediction of Human Behavior, IEEE Intelligent Vehicles
, 1995
"... This thesis presents a computational framework for the automatic recognition and prediction of different kinds of human behaviors from video cameras and other sensors, via perceptually intelligent systems that automatically sense and correctly classify human behaviors, by means of Machine Perception ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
This thesis presents a computational framework for the automatic recognition and prediction of different kinds of human behaviors from video cameras and other sensors, via perceptually intelligent systems that automatically sense and correctly classify human behaviors, by means of Machine Perception and Machine Learning techniques. In the thesis I develop the statistical machine learning algorithms (dynamic graphical models) necessary for detecting and recognizing individual and interactive behaviors. In the case of the interactions two Hidden Markov Models (HMMs) are coupled in a novel architecture called Coupled Hidden Markov Models (CHMMs) that explicitly captures the interactions between them. The algorithms for learning the parameters from data as well as for doing inference with those models are developed and described. Four systems that experimentally evaluate the proposed paradigm are presented: (1) LAFTER, an automatic face detection and tracking system with facial expression recognition; (2) a TaiChi gesture recognition system; (3) a pedestrian surveillance system that recognizes typical human to human interactions; (4) and a SmartCar for driver maneuver recognition. These systems capture human behaviors of different nature and increasing complexity: first, isolated, singleuser facial expressions, then, twohand gestures and humantohuman interactions,...
Evaluating Machine Learning Models for Engineering Problems
 Artificial Intelligence in Engineering
, 1999
"... : The use of machine learning (ML), and in particular, artificial neural networks (ANN), in engineering applications has increased dramatically over the last years. However, by and large, the development of such applications or their report lack proper evaluation. Deficient evaluation practice was o ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
(Show Context)
: The use of machine learning (ML), and in particular, artificial neural networks (ANN), in engineering applications has increased dramatically over the last years. However, by and large, the development of such applications or their report lack proper evaluation. Deficient evaluation practice was observed in the general neural networks community and again in engineering applications through a survey we conducted of articles published in AI in Engineering and elsewhere. This deficient status hinders understanding and prevents progress. This paper goal is to remedy this situation. First, several evaluation methods are discussed with their relative qualities. Second, these qualities are illustrated by using the methods to evaluate ANN performance in two engineering problems. Third, a systematic evaluation procedure for ML is discussed. This procedure will lead to better evaluation of studies, and consequently to improved research and practice in the area of ML in engineering applications...
2010, ‘On overfitting in model selection and subsequent selection bias in performance evaluation
 Journal of Machine Learning Research
"... Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as kfold crossvalidation. The error of such an estimator can be broken down into bia ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as kfold crossvalidation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a nonnegligible variance introduces the potential for overfitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to overfitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of overfitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of overfitting and hence are unreliable. We discuss methods to avoid overfitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on crossvalidation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.
Model selection using Rademacher Penalization
 In Proceedings of the Second ICSC Symposia on Neural Computation (NC2000). ICSC Adademic
, 2000
"... In this paper we describe the use of Rademacher penalization for model selection. As in Vapnik's Guaranteed Risk Minimization (GRM), Rademacher penalization attemps to balance the complexity of the model with its t to the data by minimizing the sum of the training error and a penalty term, whic ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
In this paper we describe the use of Rademacher penalization for model selection. As in Vapnik's Guaranteed Risk Minimization (GRM), Rademacher penalization attemps to balance the complexity of the model with its t to the data by minimizing the sum of the training error and a penalty term, which is an upper bound on the absolute dierence between the training error and the generalization error. However, while the GRM penalty is universal, the computation of the Rademacher penalty is data driven which means that it depends on the distribution of the data and hence one can expect better performance for particular instances of learning problems. We present experimental evidence that shows that Rademacher penalization can be used as an eective method of model selection in learning problems. In particular wehave shown that for the intervals model selection problem, Rademacher penalization outperforms GRM and cross validation (CV) over a wide range of sample sizes. Our experiments also sho...
Performance Prediction for Exponential Language Models
"... We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set crossentropy for ngram language models. We build models over varying domains, data set sizes, and ngram orders, an ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set crossentropy for ngram language models. We build models over varying domains, data set sizes, and ngram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we find a simple relationship that predicts test set performance with a correlation of 0.9997. We analyze why this relationship holds and show that it holds for other exponential language models as well, including classbased models and minimum discrimination information models. Finally, we discuss how this relationship can be applied to improve language model performance. 1
Five principles for studying people’s use of heuristics
 Acta Psychologica Sinica
, 2010
"... Abstract: The fast and frugal heuristics framework assumes that people rely on an adaptive toolbox of simple decision strategies—called heuristics—to make inferences, choices, estimations, and other decisions. Each of these heuristics is tuned to regularities in the structure of the task environment ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
Abstract: The fast and frugal heuristics framework assumes that people rely on an adaptive toolbox of simple decision strategies—called heuristics—to make inferences, choices, estimations, and other decisions. Each of these heuristics is tuned to regularities in the structure of the task environment and each is capable of exploiting the ways in which basic cognitive capacities work. In doing so, heuristics enable adaptive behavior. In this article, we give an overview of the framework and formulate five principles that should guide the study of people’s adaptive toolbox. We emphasize that models of heuristics should be (i) precisely defined; (ii) tested comparatively; (iii) studied in line with theories of strategy selection; (iv) evaluated by how well they predict new data; and (vi) tested in the real world in addition to the laboratory. Key words: fast and frugal heuristics; experimental design; model testing As we write this article, international financial markets are in turmoil. Large banks are going bankrupt almost daily. It is a difficult situation for financial decision makers — regardless of whether they are lay investors trying to make smallscale profits here and there or professionals employed by the finance industry. To safeguard their investments, these decision makers need to be able to foresee uncertain future economic developments, such as which investments are likely to be the safest and which companies are likely to crash next. In times of rapid waves of potentially devastating financial crashes, these informed bets must often be made quickly, with little time for extensive information search or computationally demanding calculations of likely future returns. Lay stock traders in particular have to trust the contents of their memories, relying on incomplete, imperfect
VC Theory of Large Margin MultiCategory Classifiers
"... In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binaryvalued functions, the computation of dichotomies with realvalued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a naïve extension of the results devoted to the computation of dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multicategory classifiers is introduced. Central in this theory are generalized VC dimensions called the γΨdimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γΨdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scalesensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multiclass SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines.