## Covariate shift adaptation by importance weighted cross validation (2000)

### Cached

### Download Links

Citations: | 69 - 37 self |

### BibTeX

@MISC{Sugiyama00covariateshift,

author = {Masashi Sugiyama and Matthias Krauledat and Klaus-Robert Müller},

title = {Covariate shift adaptation by importance weighted cross validation},

year = {2000}

}

### OpenURL

### Abstract

A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points that will be given in the future test phase. However, this assumption is not satisfied, for example, when the outside of the training region is extrapolated. The situation where the training input points and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called the covariate shift. Under the covariate shift, standard model selection techniques such as cross validation do not work as desired since its unbiasedness is no longer maintained. In this paper, we propose a new method called importance weighted cross validation (IWCV), for which we prove its unbiasedness even under the covariate shift. The IWCV procedure is the only one that can be applied for unbiased classification under covariate shift, whereas alternatives to IWCV exist for regression. The usefulness of our proposed method is illustrated by simulations, and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between training and test sessions. c2000 Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... supervised learning, it is commonly assumed that the input points in the training set and the input points used for testing follow the same probability distribution (e.g., Wahba, 1990; Bishop, 1995; =-=Vapnik, 1998-=-; Duda et al., 2001; Hastie et al., 2001; Schölkopf and Smola, 2002). However, this common assumption is not fulfilled, for example, when we extrapolate outside of the training region 1 or when traini... |

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...ng a method of supervised learning, it is commonly assumed that the input points in the training set and the input points used for testing follow the same probability distribution (e.g., Wahba, 1990; =-=Bishop, 1995-=-; Vapnik, 1998; Duda et al., 2001; Hastie et al., 2001; Schölkopf and Smola, 2002). However, this common assumption is not fulfilled, for example, when we extrapolate outside of the training region 1 ... |

2534 |
An Introduction to the Bootstrap
- Efron, Tibshirani
- 1993
(Show Context)
Citation Context ...ighting idea which was originally used in importance sampling (e.g., Fishman, 1996) could be applied to various statistical procedures, including resampling techniques such as bootstrap (Efron, 1979; =-=Efron and Tibshirani, 1993-=-). An interesting future direction is therefore to develop a family of importance-weighted algorithms following the spirit of this paper and to investigate their statistical properties. Acknowledgment... |

2162 |
Density Estimation for Statistics and Data Analysis. Monographson Statistics and Applied Probability
- Silverman
- 1986
(Show Context)
Citation Context ...ies; they are estimated by maximum likelihood fitting of a single Gaussian model or a Gaussian kernel density estimator with variance determined by Silverman’s rule-of-thumb bandwidth selection rule (=-=Silverman, 1986-=-; Härdle et al., 2004). For estimating the test input density, we draw 100 unlabeled samples following Ptest(x). The simulation results had very similar trends to the case with known densities (theref... |

2030 | Learning with Kernels
- Scholkopf, Smola
- 2002
(Show Context)
Citation Context ...input points in the training set and the input points used for testing follow the same probability distribution (e.g., Wahba, 1990; Bishop, 1995; Vapnik, 1998; Duda et al., 2001; Hastie et al., 2001; =-=Schölkopf and Smola, 2002-=-). However, this common assumption is not fulfilled, for example, when we extrapolate outside of the training region 1 or when training input points are designed by an active learning (experimental de... |

1923 |
D: Pattern Classification
- Duda, Hart, et al.
- 2000
(Show Context)
Citation Context ...arning, it is commonly assumed that the input points in the training set and the input points used for testing follow the same probability distribution (e.g., Wahba, 1990; Bishop, 1995; Vapnik, 1998; =-=Duda et al., 2001-=-; Hastie et al., 2001; Schölkopf and Smola, 2002). However, this common assumption is not fulfilled, for example, when we extrapolate outside of the training region 1 or when training input points are... |

1840 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...e shift. The model selection problem under the covariate shift has been studied so far. For example, a risk estimator in the context of density estimation called Akaike’s information criterion (AIC) (=-=Akaike, 1974-=-) was modified to be still asymptotic unbiased (Shimodaira, 2000) and a risk estimator in linear regression called subspace information criterion (SIC) (Sugiyama and Ogawa, 2001) was similarly extende... |

1273 | Spline models for observational data - Wahba - 1990 |

1269 |
Sample Selection Bias as a Specification Error
- Heckman
- 1979
(Show Context)
Citation Context ...Scheffer, 2007), bioinformatics (Baldi et al., 1998; Borgwardt et al., 2006) or brain-computer interfacing (Wolpaw et al., 2002), the covariate shift phenomenon is conceivable. Sample selection bias (=-=Heckman, 1979-=-) in economics may also include a form of the covariate shift. Illustrative examples of covariate shift situations are depicted in Figures 1 and 3. In this paper, we develop a new learning method and ... |

1240 |
A.: On information and sufficiency
- KULLBACK, LEIBLER
- 1951
(Show Context)
Citation Context ... 10-fold CV, and AIWLDA with optimal λ. The value of λ is selected from {0,0.1,0.2,...,1.0}; chosen values are also described in the table. Table 2 also contains the Kullback-Leibler (KL) divergence (=-=Kullback and Leibler, 1951-=-) from the estimated training input distribution to the estimated test input distribution. Since we want to have an accurate estimate of the KL divergence, we used the test samples for estimating the ... |

966 |
The use of multiple measurements in taxonomic problems
- Fisher
- 1936
(Show Context)
Citation Context ...unction: � �u = sgn �f (t; � � θAIWLS) , where sgn(·) denotes the sign of a scalar. Note that, if Ptrain(x) = Ptest(x), this classification method is equivalent to linear discriminant analysis (LDA) (=-=Fisher, 1936-=-; Duda et al., 2001), given that the class labels are yi ∝ {1/n+,−1/n−}, where n+ and n− are the numbers of positive and negative training samples, respectively. In the following, we rescale the train... |

863 |
The Elements of
- Hastie, Tibshirani, et al.
- 2009
(Show Context)
Citation Context ...nly assumed that the input points in the training set and the input points used for testing follow the same probability distribution (e.g., Wahba, 1990; Bishop, 1995; Vapnik, 1998; Duda et al., 2001; =-=Hastie et al., 2001-=-; Schölkopf and Smola, 2002). However, this common assumption is not fulfilled, for example, when we extrapolate outside of the training region 1 or when training input points are designed by an activ... |

805 |
Bootstrap methods: another look at the jackknife
- Efron
- 1979
(Show Context)
Citation Context ...importance-weighting idea which was originally used in importance sampling (e.g., Fishman, 1996) could be applied to various statistical procedures, including resampling techniques such as bootstrap (=-=Efron, 1979-=-; Efron and Tibshirani, 1993). An interesting future direction is therefore to develop a family of importance-weighted algorithms following the spirit of this paper and to investigate their statistica... |

721 |
Cross-Validatory Choices and Assessment of Statistical Prediction (with Discussion
- Stone
- 1974
(Show Context)
Citation Context ... approach will be evaluated. Model selection is one of the key ingredients in machine learning. However, under the covariate shift, a standard model selection technique such as cross validation (CV) (=-=Stone, 1974-=-; Wahba, 1990) does not work as desired; more specifically, the unbiasedness that guarantees the accuracy of CV does not hold under the covariate shift anymore. To cope with this problem, we propose a... |

561 |
A stochastic approximation method
- ROBBINS, MONRO
- 1951
(Show Context)
Citation Context ...ng significant attention recently (e.g., Bickel, 2006; Candela et al., 2006); note also the large body of work exists in online learning, where the distribution is subject to continuous change (e.g., =-=Robbins and Munro, 1951-=-; Saad, 1998; LeCun et al., 1998; Murata et al., 2002). For further developing learning methods under the changing environment, it is essential to establish and share standard benchmark data sets, for... |

528 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...ing (Bickel and Scheffer, 2007), and bioinformatics (Baldi et al., 1998). Applying IWCV in these application areas would be an interesting direction to be investigated. Active learning (MacKay, 1992; =-=Cohn et al., 1996-=-; Fukumizu, 2000)—also referred to as experimental design in statistics (Kiefer, 1959; Fedorov, 1972; Pukelsheim, 1993)—is the problem of determining the location of training input points {xi} n i=1 s... |

355 |
Theory of optimal experiments
- Fedorov
- 1972
(Show Context)
Citation Context ...tion areas would be an interesting direction to be investigated. Active learning (MacKay, 1992; Cohn et al., 1996; Fukumizu, 2000)—also referred to as experimental design in statistics (Kiefer, 1959; =-=Fedorov, 1972-=-; Pukelsheim, 1993)—is the problem of determining the location of training input points {xi} n i=1 so that the risk is minimized. The covariate shift naturally occurs in the active learning scenario s... |

348 | Brain-computer interfaces for communication and control - Wolpaw, Birbaumer, et al. |

324 |
Monte Carlo: Concepts, Algorithms and Applications
- FISHMAN
- 1995
(Show Context)
Citation Context ...me that the ratio of test and training input densities at training input points, ptest(xi) ptrain(xi) , (3) is finite and known. We refer to the expression (3) as importance à la importance sampling (=-=Fishman, 1996-=-). In practical situations where the importance is unknown, we may replace them by empirical estimates (see Sections 4 and 5). 2.2 Empirical Risk Minimization and Its Importance Weighted Variants A st... |

324 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ...), spam filtering (Bickel and Scheffer, 2007), and bioinformatics (Baldi et al., 1998). Applying IWCV in these application areas would be an interesting direction to be investigated. Active learning (=-=MacKay, 1992-=-; Cohn et al., 1996; Fukumizu, 2000)—also referred to as experimental design in statistics (Kiefer, 1959; Fedorov, 1972; Pukelsheim, 1993)—is the problem of determining the location of training input ... |

320 |
the machine learning approach
- Bioinformatics
- 1998
(Show Context)
Citation Context ...s called the covariate shift (Shimodaira, 2000). For data from many applications such as off-policy reinforcement learning (Shelton, 2001), spam filtering (Bickel and Scheffer, 2007), bioinformatics (=-=Baldi et al., 1998-=-; Borgwardt et al., 2006) or brain-computer interfacing (Wolpaw et al., 2002), the covariate shift phenomenon is conceivable. Sample selection bias (Heckman, 1979) in economics may also include a form... |

205 | Locality preserving projections - He, Niyogi - 2004 |

166 |
Optimal Design of Experiments
- Pukelsheim
- 1993
(Show Context)
Citation Context ...d be an interesting direction to be investigated. Active learning (MacKay, 1992; Cohn et al., 1996; Fukumizu, 2000)—also referred to as experimental design in statistics (Kiefer, 1959; Fedorov, 1972; =-=Pukelsheim, 1993-=-)—is the problem of determining the location of training input points {xi} n i=1 so that the risk is minimized. The covariate shift naturally occurs in the active learning scenario since the training ... |

156 | Semi-supervised learning on Riemannian manifolds - Belkin, Niyogi - 2004 |

145 | D.: Domain adaptation for statistical classifiers - Daumé, Marcu - 2006 |

133 | Optimal spatial filtering of single trial EEG during imagined hand movement
- Ramoser, Müller-Gerking, et al.
- 2000
(Show Context)
Citation Context ...d to the second alternative. By this means, it is possible to operate devices which are connected to the computer. For classification of bandpower estimates of appropriately preprocessed EEG signals (=-=Ramoser et al., 2000-=-; Pfurtscheller and da Silva, 1999; Lemm et al., 2005), LDA has shown to work very well (Wolpaw et al., 2002; Dornhege et al., 2004; Babiloni et al., 2000). On the other hand, strong non-stationarity ... |

130 | Correcting sample selection bias by unlabeled data
- Huang, Smola, et al.
(Show Context)
Citation Context ...odology we propose in this paper is valid for any parameter learning method; this means that, e.g., an importance weighted variant of support vector machines (Vapnik, 1998; Schölkopf and Smola, 2002; =-=Huang et al., 2007-=-) or graph regularization techniques (Bousquet et al., 2004; Belkin and Niyogi, 2004; Hein, 2006) can also be employed. 2. For a correctly specified model, an estimator is said to be consistent if it ... |

125 | Efficient backprop
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...g., Bickel, 2006; Candela et al., 2006); note also the large body of work exists in online learning, where the distribution is subject to continuous change (e.g., Robbins and Munro, 1951; Saad, 1998; =-=LeCun et al., 1998-=-; Murata et al., 2002). For further developing learning methods under the changing environment, it is essential to establish and share standard benchmark data sets, for example, the projects supported... |

111 | Improving predictive inference under covariate shift by weighting the log-likelihood function - Shimodaira |

109 | Analysis of representations for domain adaptation
- Ben-David, Blitzer, et al.
- 2007
(Show Context)
Citation Context ...rning is correctly specified. When this is not true, we may need to reasonably restrict the type of distribution change for meaningful estimations (see, for example, Zadrozny, 2004; Fan et al., 2005; =-=Ben-David et al., 2007-=-; Yamazaki et al., 2007, for theoretical analyses). The covariate shift setting which we discussed in this paper could be regarded as one of such restrictions. Another interesting restriction on the d... |

95 |
Semiparametric Models
- Härdle, Müller, et al.
- 2004
(Show Context)
Citation Context ...imated by maximum likelihood fitting of a single Gaussian model or a Gaussian kernel density estimator with variance determined by Silverman’s rule-of-thumb bandwidth selection rule (Silverman, 1986; =-=Härdle et al., 2004-=-). For estimating the test input density, we draw 100 unlabeled samples following Ptest(x). The simulation results had very similar trends to the case with known densities (therefore we omit the detai... |

84 | Support vector machines and the bayes rule in classi cation - Lin - 1999 |

69 | da Silva F. Event-related EEG/MEG synchronization and desynchronization: basic principles - Pfurtscheller, Lopes - 1999 |

58 | Boosting bit rates in noninvasive EEG single-trial classifications by feature combination and multiclass paradigms
- Dornhege, Blankertz, et al.
- 2004
(Show Context)
Citation Context ...ion of bandpower estimates of appropriately preprocessed EEG signals (Ramoser et al., 2000; Pfurtscheller and da Silva, 1999; Lemm et al., 2005), LDA has shown to work very well (Wolpaw et al., 2002; =-=Dornhege et al., 2004-=-; Babiloni et al., 2000). On the other hand, strong non-stationarity effects have been often observed in brain signals between training and test sessions (Vidaurre et al., 2004; Millán, 2004; Shenoy e... |

52 | Integrating structured biological data by kernel maximum mean discrepancy
- Borgwardt, Gretton, et al.
- 2006
(Show Context)
Citation Context ...te shift (Shimodaira, 2000). For data from many applications such as off-policy reinforcement learning (Shelton, 2001), spam filtering (Bickel and Scheffer, 2007), bioinformatics (Baldi et al., 1998; =-=Borgwardt et al., 2006-=-) or brain-computer interfacing (Wolpaw et al., 2002), the covariate shift phenomenon is conceivable. Sample selection bias (Heckman, 1979) in economics may also include a form of the covariate shift.... |

51 |
Optimum experimental designs
- Kiefer
- 1959
(Show Context)
Citation Context ... these application areas would be an interesting direction to be investigated. Active learning (MacKay, 1992; Cohn et al., 1996; Fukumizu, 2000)—also referred to as experimental design in statistics (=-=Kiefer, 1959-=-; Fedorov, 1972; Pukelsheim, 1993)—is the problem of determining the location of training input points {xi} n i=1 so that the risk is minimized. The covariate shift naturally occurs in the active lear... |

51 | On estimation of characters obtained in statistical procedure of recognition - Luntz, Brailovsky - 1969 |

50 | Introduction,” in On-line learning in neural networks - Saad - 1998 |

47 | Active learning in multilayer perceptrons - Fukumizu - 1996 |

45 | Spatio-spectral filters for improving the classification of single trial EEG
- Lemm, Blankertz, et al.
- 2005
(Show Context)
Citation Context ...le to operate devices which are connected to the computer. For classification of bandpower estimates of appropriately preprocessed EEG signals (Ramoser et al., 2000; Pfurtscheller and da Silva, 1999; =-=Lemm et al., 2005-=-), LDA has shown to work very well (Wolpaw et al., 2002; Dornhege et al., 2004; Babiloni et al., 2000). On the other hand, strong non-stationarity effects have been often observed in brain signals bet... |

45 | Algebraic Analysis for Nonidentifiable Learning - Watanabe - 2001 |

43 | G.: The Berlin Brain-Computer Interface: EEG-based communication without subject training - Blankertz, Dornhege, et al. - 2006 |

43 | Asymptotics for and against cross-validation - Stone - 1977 |

43 | Subspace Information Criterion for Model Selection - Sugiyama, Ogawa - 2001 |

42 | The non-invasive berlin braincomputer interface: Fast acquisition of effective performance in untrained subjects
- Blankertz, Dornhege, et al.
- 2007
(Show Context)
Citation Context ...ets obtained from 5 different subjects (see Table 1 for specification), where the task is binary classification of EEG signals. The experimental setting is described in more detail in the references (=-=Blankertz et al., 2007-=-, 2006; Sugiyama et al., 2006). Note that training samples and unlabeled/test samples are gathered in different recording sessions, so the nonstationarity in brain signals may change the distributions... |

42 |
Input-dependent estimation of generalization error under covariate shift
- Sugiyama, Müller
- 2005
(Show Context)
Citation Context ...t IWCV gives an almost unbiased estimate of the risk even under the covariate shift. Model selection under the covariate shift has been studied so far only by few researchers (e.g., Shimodaira, 2000; =-=Sugiyama and Müller, 2005-=-)—existing methods have a number of limitations, for example, in the loss function, parameter learning method, and model. In particular, the existing methods can not be applied to classification scena... |

34 | Classifier ensembles for changing environments
- Kuncheva
- 2004
(Show Context)
Citation Context ...ng learning methods under the changing environment, it is essential to establish and share standard benchmark data sets, for example, the projects supported by PASCAL (Candela et al., 2005) or EPSRC (=-=Kuncheva, 2006-=-), Common benchmark data sets can be used to evaluate the experimental performance of proposed and related methods. Finally, the importance-weighting idea which was originally used in importance sampl... |

33 | Dirichlet-enhanced spam filtering based on biased samples
- Bickel, Scheffer
- 2007
(Show Context)
Citation Context ...ut values given input points are unchanged is called the covariate shift (Shimodaira, 2000). For data from many applications such as off-policy reinforcement learning (Shelton, 2001), spam filtering (=-=Bickel and Scheffer, 2007-=-), bioinformatics (Baldi et al., 1998; Borgwardt et al., 2006) or brain-computer interfacing (Wolpaw et al., 2002), the covariate shift phenomenon is conceivable. Sample selection bias (Heckman, 1979)... |

33 | Pool-based active learning in approximate linear regression - Sugiyama, Nakajima - 2009 |

32 | Measure Based Regularization
- Bousquet, Chapelle, et al.
- 2004
(Show Context)
Citation Context ...ter learning method; this means that, e.g., an importance weighted variant of support vector machines (Vapnik, 1998; Schölkopf and Smola, 2002; Huang et al., 2007) or graph regularization techniques (=-=Bousquet et al., 2004-=-; Belkin and Niyogi, 2004; Hein, 2006) can also be employed. 2.3 Cross Validation Estimate of Risk The value of the tuning parameter, say λ in Eq. (4), controls the trade-off between the consistency a... |