Results 1  10
of
21
Improving predictive inference under covariate shift by weighting the loglikelihood function
 JOURNAL OF STATISTICAL PLANNING AND INFERENCE
, 2000
"... ..."
Akaike’s information criterion and recent developments in information complexity
 Journal of Mathematical Psychology
"... criterion (AIC). Then, we present some recent developments on a new entropic or information complexity (ICOMP) criterion of Bozdogan (1988a, 1988b, 1990, 1994d, 1996, 1998a, 1998b) for model selection. A rationale for ICOMP as a model selection criterion is that it combines a badnessoffit term (su ..."
Abstract

Cited by 67 (5 self)
 Add to MetaCart
criterion (AIC). Then, we present some recent developments on a new entropic or information complexity (ICOMP) criterion of Bozdogan (1988a, 1988b, 1990, 1994d, 1996, 1998a, 1998b) for model selection. A rationale for ICOMP as a model selection criterion is that it combines a badnessoffit term (such as minus twice the maximum log likelihood) with a measure of complexity of a model differently than AIC, or its variants, by taking into account the interdependencies of the parameter estimates as well as the dependencies of the model residuals. We operationalize the general form of ICOMP based on the quantification of the concept of overall model complexity in terms of the estimated inverseFisher information matrix. This approach results in an approximation to the sum of two KullbackLeibler distances. Using the correlational form of the complexity, we further provide yet another form of ICOMP to take into account the interdependencies (i.e., correlations) among the parameter estimates of the model. Later, we illustrate the practical utility and the importance of this new model selection criterion by providing several
Selfconcordant analysis for logistic regression
"... Most of the nonasymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closedform expressions. In this paper, we use and extend tools from the convex optimization literature, namely selfconcordant functions, to provide simple extensio ..."
Abstract

Cited by 22 (12 self)
 Add to MetaCart
(Show Context)
Most of the nonasymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closedform expressions. In this paper, we use and extend tools from the convex optimization literature, namely selfconcordant functions, to provide simple extensions of theoretical results for the square loss to the logistic loss. We apply the extension techniques to logistic regression with regularization by the ℓ2norm and regularization by the ℓ1norm, showing that new results for binary classification through logistic regression can be easily derived from corresponding results for leastsquares regression. 1
Bootstrap estimate of KullbackLeibler information for model selection
 Statistica Sinica
, 1997
"... Estimation of KullbackLeibler amount of information is a crucial part of deriving a statistical model selection procedure which is based on likelihood principle like AIC. To discriminate nested models, we have to estimate it up to the order of constant while the KullbackLeibler information itself ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Estimation of KullbackLeibler amount of information is a crucial part of deriving a statistical model selection procedure which is based on likelihood principle like AIC. To discriminate nested models, we have to estimate it up to the order of constant while the KullbackLeibler information itself is of the order of the number of observations. A correction term employed in AIC is an example to ful ll this requirement but it is a simple minded bias correction to the log maximum likelihood. Therefore there is no assurance that such a bias correction yields a good estimate of KullbackLeibler information. In this paper as an alternative, bootstrap type estimation is considered. We will rst show that both bootstrap estimates proposed by Efron (1983,1986,1993) and Cavanaugh and Shumway(1994) are at least asymptotically equivalent and there exist many other equivalent bootstrap estimates. We also show that all such methods are asymptotically equivalent to a nonbootstrap method, known as TIC (Takeuchi's Information Criterion) which is a generalization of AIC.
Optimal design of regularization term and regularization parameter by subspace information criterion
 Neural Networks
, 2000
"... The problem of designing the regularization term and regularization parameter for linear regression models is discussed. Previously, we derived an approximation to the generalization error called the subspace information criterion (SIC), which is an unbiased estimator of the generalization error wit ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
The problem of designing the regularization term and regularization parameter for linear regression models is discussed. Previously, we derived an approximation to the generalization error called the subspace information criterion (SIC), which is an unbiased estimator of the generalization error with finite samples under certain conditions. In this paper, we apply SIC to regularization learning and use it for (a) choosing the optimal regularization term and regularization parameter from given candidates, and (b) obtaining the closed form of the optimal regularization parameter for a fixed regularization term. The effectiveness of SIC is demonstrated through computer simulations with artificial and real data. Keywords supervised learning, generalization error, linear regression, regularization learning, ridge regression, model selection, regularization parameter, subspace information criterion Optimal Regularization by SIC 2 Nomenclature f(x) : learning target function D: domain of f(x) xm: mth sample point ym: mth sample value ɛm: mth noise (xm,ym) : mth training example M: the number of training examples y: Mdimensional vector consisting of {ym} M m=1 ɛ: Mdimensional vector consisting of {ɛm} M m=1 ϕp(x) : pth basis function θp: pth coefficient µ: the number of basis functions JG: generalization error JTE: training error JR: regularized training error T: regularization matrix α: regularization parameter A: design matrix XT,α: regularization learning matrix U: µdimensional matrix θ: true parameter ˆθT,α: regularization estimate ˆθu: unbiased estimate σ 2: noise variance 1
Model Selection
 In The Handbook Of Financial Time Series
, 2008
"... Model selection has become an ubiquitous statistical activity in the last decades, none the least due to the computational ease with which many statistical models can be fitted to data with the help of modern computing equipment. In this article we provide an introduction to the statistical aspect ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Model selection has become an ubiquitous statistical activity in the last decades, none the least due to the computational ease with which many statistical models can be fitted to data with the help of modern computing equipment. In this article we provide an introduction to the statistical aspects and implications of model selection and we review the relevant literature. 1.1 A General Formulation When modeling data Y, a researcher often has available a menu of competing candidate models which could be used to describe the data. Let M denote the collection of these candidate models. Each model M, i.e., each element of M, can – from a mathematical point of view – be viewed as a collection of probability distributions for Y implied by the model. That is, M is given by M = {Pη: η ∈ H}, where Pη denotes a probability distribution for Y and H represents the ‘parameter ’ space (which can be different across different models M). The ‘parameter ’ space H need not be finitedimensional. Often, the ‘parameter ’ η will be partitioned into (η1, η2) where η1 is a finitedimensional parameter whereas η2 is infinitedimensional. In case the parameterization is identified, i.e., the map η → Pη is injective on H, we will often not distinguish between M and H and will use them synonymously. The model selection problem is now to select – based on the data Y – a model M ̂ = M̂(Y) in M such that M ̂ is a ‘good ’ model for the data Y. Of course, the sense, in which the selected model should be a ‘good ’ model, needs to be made precise and is a crucial point in the analysis. This is particularly important if – as is usually the case – selecting the model M ̂ is not the final
Model Selection for Variable Length Markov Chains and Tuning the Context Algorithm
, 2000
"... We consider the model selection problem in the class of stationary variable length Markov chains (VLMC) on a nite space. The processes in this class are still Markovian of higher order, but with memory of variable length. Various aims in selecting a VLMC can be formalized with dierent nonequivalent ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
We consider the model selection problem in the class of stationary variable length Markov chains (VLMC) on a nite space. The processes in this class are still Markovian of higher order, but with memory of variable length. Various aims in selecting a VLMC can be formalized with dierent nonequivalent risks, such as nal prediction error or expected KullbackLeibler information. We consider the asymptotic behavior of dierent risk functions and show how they can be generally estimated with the same resampling strategy. Such estimated risks then yield new model selection criteria. In particular, we obtain a datadriven tuning of Rissanen's tree structured context algorithm which is a computationally feasible procedure for selection and estimation of a VLMC. Key words and phrases. Bootstrap, zeroone loss, nal prediction error, nitememory source, FSMX model, KullbackLeibler information, L 2 loss, optimal tree pruning, resampling, tree model. Short title: Selecting variable length Mar...
Information and Posterior Probability Criteria for Model Selection in Local Likelihood Estimation
 J Amer. Stat. Ass
, 1998
"... this paper we propose a modification to the methods used to motivate many information and posterior probability criteria for the weighted likelihood case. We derive weighted versions for two of the most widely known criteria, namely the AIC and BIC. Via a simple modification, the criteria are also m ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
this paper we propose a modification to the methods used to motivate many information and posterior probability criteria for the weighted likelihood case. We derive weighted versions for two of the most widely known criteria, namely the AIC and BIC. Via a simple modification, the criteria are also made useful for window span selection. The usefulness of the weighted version of these criteria are demonstrated through a simulation study and an application to three data sets. KEY WORDS: Information Criteria; Posterior Probability Criteria; Model Selection; Local Likelihood. 1. INTRODUCTION Local regression has become a popular method for smoothing scatterplots and for nonparametric regression in general. It has proven to be a useful tool in finding structure in datasets (Cleveland and Devlin 1988). Local regression estimation is a method for smoothing scatterplots (x i ; y i ), i = 1; : : : ; n in which the fitted value at x 0 is the value of a polynomial fit to the data using weighted least squares where the weight given to (x i ; y i ) is related to the distance between x i and x 0 . Stone (1977) shows that estimates obtained using the local regression methods have desirable theoretical properties. Recently, Fan (1993) has studied minimax properties of local linear regression. Tibshirani and Hastie (1987) extend the ideas of local regression to a local likelihood procedure. This procedure is designed for nonparametric regression modeling in situations where weighted least squares is inappropriate as an estimation method, for example binary data. Local regression may be viewed as a special case of local likelihood estimation. Tibshirani and Hastie (1987), Staniswalis (1989), and Loader (1999) apply local likelihood estimation to several types of data where local regressio...
Dynamic Adaptive Partitioning for Nonlinear Time Series
, 1998
"... Introduction Nonparametric methods which are able to adapt to local sparseness of the data are often substantially better than nonadaptive procedures because of the curse of dimensionality, and estimation of the mean as a function of predictor variables with adaptive partitioning schemes has attra ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Introduction Nonparametric methods which are able to adapt to local sparseness of the data are often substantially better than nonadaptive procedures because of the curse of dimensionality, and estimation of the mean as a function of predictor variables with adaptive partitioning schemes has attracted much attention (Breiman et al., 1984; Friedman, 1991; Gersho & Gray, 1992). Some of these schemes have been studied also in the case of stationary time series (Lewis & Stevens, 1991; Nobel, 1997), but none of the schemes use the simple fact, that in the case of a time series, the partition cells themselves typically have a dynamic characteristic. Consider a stationary realvalued pthorder Markov chain Y t (t 2 ZZ) with state vector S t\Gamma1 = (Y t\Gamma1 ; : : : ; Y t\Gammap ) being the first p
Learning under Nonstationarity: Covariate Shift Adaptation by Importance Weighting
 IN J. E. GENTLE , W. HÄRDLE , Y. MORI (EDS), HANDBOOK OF COMPUTATIONAL STATISTICS: CONCEPTS AND METHODS, 2ND EDITION. CHAPTER 31, PP.927–952, SPRINGER, BERLIN
, 2012
"... The goal of supervised learning is to estimate an underlying inputoutput function from its inputoutput training samples so that output values for unseen test input points can be predicted. A common assumption in supervised learning is that the training input points follow the same probability dist ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The goal of supervised learning is to estimate an underlying inputoutput function from its inputoutput training samples so that output values for unseen test input points can be predicted. A common assumption in supervised learning is that the training input points follow the same probability distribution as the test input points. However, this assumption is not satisfied, for example, when outside of the training region is extrapolated. The situation where the training and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called covariate shift. Since almost all existing learning methods assume that the training and test samples are drawn from the same distribution, their fundamental theoretical properties such as consistency or efficiency no longer hold under covariate shift. In this chapter, we review recently proposed techniques for covariate shift adaptation. 1