## A kernel method for the two sample problem (2007)

### Cached

### Download Links

- [www.kyb.tuebingen.mpg.de]
- [www.kyb.mpg.de]
- [www.kyb.mpg.de]
- [www.cs.cmu.edu]
- [www.gatsby.ucl.ac.uk]
- [arxiv.org]
- [www.dbs.informatik.uni-muenchen.de]
- [www.dbs.ifi.lmu.de]
- [books.nips.cc]
- [www.kyb.tuebingen.mpg.de]
- [www.kyb.mpg.de]
- [www.kyb.mpg.de]
- [www.gatsby.ucl.ac.uk]
- [www.cs.cmu.edu]
- [www.kyb.mpg.de]

Venue: | ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 19 |

Citations: | 40 - 13 self |

### BibTeX

@INPROCEEDINGS{Gretton07akernel,

author = {Arthur Gretton and Karsten Borgwardt and Malte Rasch and Bernhard Schölkopf and Alexander Smola},

title = {A kernel method for the two sample problem},

booktitle = {ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 19},

year = {2007},

pages = {513--520},

publisher = {MIT Press}

}

### OpenURL

### Abstract

We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg. a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

### Citations

3028 |
UCI repository of machine learning databases
- BLAKE, MERZ
- 1998
(Show Context)
Citation Context ...sets: the census income dataset from the UCI KDD archive (CNUM), the protein homology dataset from the 2004 KDD Cup (BIO) (Caruana and Joachims, 2004), and the forest dataset from the UCI ML archive (=-=Blake and Merz, 1998-=-). For the final dataset, we performed univariate matching of attributes (FOREST) and multivariate matching of tables (FOREST10D) from two different databases, where each table represents one type of ... |

2519 |
Density Estimation for Statistics and Data Analysis
- Silverman
- 1986
(Show Context)
Citation Context ...rges more slowly than an RKHS-based test, also following Anderson et al. (1994). Before proceeding, we motivate this discussion with a short overview of the Parzen window estimate and its properties (=-=Silverman, 1986-=-). We assume a distribution p on R d , which has an associated density function also written p to minimise notation. The Parzen window estimate of this density from an i.i.d. sample X of size m is ˆp(... |

2196 | Learning with Kernels
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ... ∣∣ D D D − 2ǫ > 2 − 2 8 = 4 > 0. [Ep [f ∗ ] − Eq [f ∗ ]] /‖f ∗ ‖ H ≥ D/(4 ‖f ∗ ‖ H ) > 0, 4We now review some properties of H that will allow us to express the MMD in a more easily computable form (=-=Schölkopf and Smola, 2002-=-). Since H is an RKHS, the operator of evaluation δx mapping f ∈ H to f(x) ∈ R is continuous. Thus, by the Riesz representation theorem, there is a feature mapping φ(x) from X to R such that f(x) = 〈f... |

2107 |
Numerical Recipes in C: The Art of Scientific Computing
- Press, Flannery, et al.
- 1992
(Show Context)
Citation Context ... for x ≥ 0. (9) This allows for an efficient characterization of the distribution under the null hypothesis H0. Efficient numerical approximations to (9) can be found in numerical analysis handbooks (=-=Press et al., 1994-=-). The distribution under the alternative, p ̸= q, however, is unknown. The Kolmogorov metric is, in fact, a special instance of MMD[F, p, q] for a certain Banach space (Müller, 1997, Theorem 5.2) Pro... |

2076 |
An Introduction to Probability Theory and its Applications
- Feller
- 1971
(Show Context)
Citation Context ...y estimates (Anderson et al., 1994) is a special case of the biased MMD in equation (6). Denote by Dr(p, q) := ‖p − q‖r the Lr distance. For r = 1 the distance Dr(p, q) is known as the Levy distance (=-=Feller, 1971-=-), and for r = 2 we encounter distance measures derived from the Renyi entropy (Gokcay and Principe, 2002). Assume that ˆp and ˆq are given as kernel density estimates with kernel κ(x − x ′ ∑ ), that ... |

1551 | Probability inequalities for sums of bounded random variables - HOEFFDING - 1963 |

1430 |
Independent component analysis, a new concept
- Comon
- 1994
(Show Context)
Citation Context ... := Prx,y. We wish to determine whether this distribution factorizes, i.e. whether q := Prx Pry is the same as p. One application of such an independence measure is in independent component analysis (=-=Comon, 1994-=-), where the goal is to find a linear mapping of the observations xi to obtain mutually independent outputs. Kernel methods were employed to solve this problem by Bach and Jordan (2002); Gretton et al... |

849 | Kernel Methods for Pattern Analysis - Shawe-Taylor, Cristianini - 2004 |

818 | The hungarian method for the assignment problem
- Kuhn
- 1955
(Show Context)
Citation Context ...Bπ(i))‖2 . If we define Cij = ‖µi(Ai) − µi(Bj)‖2 , then this is the same as minimizing the sum over Ci,π(i). This is the linear assignment problem, which costs O(n3 ) time using the Hungarian method (=-=Kuhn, 1955-=-). While this may appear to be a crude heuristic, it nonetheless defines a semi-metric on the sample spaces X and Y and the corresponding distributions p and q. This follows from the fact that matchin... |

670 |
Approximation Theorems of Mathematical Statistics
- Serfling
- 1980
(Show Context)
Citation Context ...estimate of MMD2 , although it does not have minimum variance, since we are ignoring the cross-terms k(xi, yi) of which there are only O(n). The minimum variance estimate is almost identical, though (=-=Serfling, 1980-=-, Section 5.1.4). The biased statistic in (2) may also be easily computed following the above reasoning. Substituting the empirical estimates µ[X] := 1 ∑m m i=1 φ(xi) and µ[Y ] := 1 ∑n n i=1 φ(yi) of ... |

546 | Estimating the support of a highdimensional distribution - Scholkopt, Platt, et al. - 2001 |

495 |
Real Analysis and Probability
- Dudley
- 1989
(Show Context)
Citation Context ...tics literature can also be considered in defining the MMD. Indeed, Lemma 1 defines an MMD with F the space of bounded continuous realvalued functions, which is a Banach space with the supremum norm (=-=Dudley, 2002-=-, p. 158). We now describe two further metrics on the space of probability distributions, the Kolmogorov-Smirnov and Earth Mover’s distances, and their associated function classes. i=1 82.5.1 Kolmogo... |

491 | The earth mover’s distance as a metric for image retrieval - Rubner, Tomasi, et al. - 2000 |

472 |
Statistical Inference
- Casella, Berger
- 1990
(Show Context)
Citation Context ...to zero at rate O(m 2 ), assuming m = n. To put this convergence rate in perspective, consider a test of whether two normal distributions have equal means, given they have unknown but equal variance (=-=Casella and Berger, 2002-=-, Exercise 8.41). In this case, the test statistic has a Student-t distribution with n + m − 2 degrees of freedom, and its error probability converges at the same rate as our test. It is worth noting ... |

401 | On the method of bounded differences - MCDIARMID - 1989 |

340 | Kernel independent component analysis - Bach, Jordan - 2002 |

274 | Rademacher and Gaussian complexities: risk bounds and structural results - Bartlett, Mendelson - 2002 |

271 |
Functions of positive and negative type and their connection with the theory of integral equations
- Mercer
- 1909
(Show Context)
Citation Context ...m 2 m∑ i,j=1 i k(xi − xj) + 1 n 2 i n∑ i,j=1 ] 2 k(yi − yj) − 2 mn dz (15) m,n ∑ i,j=1 k(xi − yj), (16) where k(x − y) = ∫ κ(x − z)κ(y − z)dz. Note that by its definition k(x − y) is a Mercer kernel (=-=Mercer, 1909-=-), as it can be viewed as inner product between κ(x − z) and κ(y − z) on the domain X. A disadvantage of the Parzen window interpretation is that when the Parzen window estimates are consistent (which... |

205 | A class of statistics with asymptotically normal distribution - Hoeffding - 1948 |

195 | Support vector machines for multiple-instance learning
- Andrews, Tsochantaridis, et al.
- 2002
(Show Context)
Citation Context ... whether some instances in the domain have the desired property, rather than making a statement regarding the distribution of those instances. Taking this into account leads to an improved algorithm (=-=Andrews et al., 2003-=-). 7.3 Kernel Measures of Independence We next demonstrate the application of MMD in determining whether two random variables x and y are independent. In other words, assume that pairs of random varia... |

170 | On the influence of the kernel on the consistency of support vector machines - Steinwart - 2002 |

118 | Multi-instance kernels - Gärtner, Flach, et al. - 2002 |

113 | Analysis of representations for domain adaptation - Ben-David, Blitzer, et al. - 2006 |

102 | Detecting change in data stream - BEN-DAVID, GEHRKE, et al. - 2004 |

99 | Measuring statistical dependence with hilbert-schmidt norms - Gretton, Bousquet, et al. - 2005 |

76 | Protein function prediction via graph kernels - Borgwardt, Ong, et al. - 2005 |

70 | A kernel method for the twosample-problem - Gretton, Borgwardt, et al. - 2006 |

60 |
Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests
- Friedman, Rafsky
- 1979
(Show Context)
Citation Context ...nates their computing time; we return to this point in our experiments (Section 8). Two possible generalisations of the Kolmogorov-Smirnov test to the multivariate case were studied in (Bickel, 1969; =-=Friedman and Rafsky, 1979-=-). The approach of Friedman and Rafsky (FR Smirnov) in this case again requires a minimal spanning tree, and has a similar cost to their multivariate runs test. A more recent multivariate test was int... |

57 | Estimating divergence functionals and the likelihood ratio by convex risk minimization - Nguyen, Wainwright, et al. |

57 | A Hilbert space embedding for distributions - Smola, Gretton, et al. - 2007 |

54 | Integrating structured biological data by kernel maximum mean discrepancy - Borgwardt, Gretton, et al. - 2006 |

53 | Kernel measures of conditional dependence - Fukumizu, Gretton, et al. - 2008 |

52 | Information Theoretic Clustering - Gokcay, Príncipe - 2002 |

50 | R.: Data domain description by support vectors - Tax - 1999 |

45 | A kernel statistical test of independence - Gretton, Fukumizu, et al. - 2007 |

44 | Theory of classification: a survey of recent advances - Boucheron, Bousquet, et al. - 2004 |

39 | Unifying divergence minimization and statistical inference via convex duality
- Altun, Smola
- 2006
(Show Context)
Citation Context ...e. a second set of observations drawn from the same distribution. While not the key focus of the present paper, such bounds can be used in the design of inference principles based on moment matching (=-=Altun and Smola, 2006-=-; Dudík and Schapire, 2006; Dudík et al., 2004). 4.2 Bound on the Unbiased Statistic and Test While the previous bounds are of interest since the proof strategy can be used for general function classe... |

37 |
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates
- Anderson, Hall, et al.
- 1994
(Show Context)
Citation Context ...nd Tajvidi (2002) consider only tens of points in their experiments. Yet another approach is to use some distance (e.g. L1 or L2) between Parzen window estimates of the densities as a test statistic (=-=Anderson et al., 1994-=-; Biau and Gyorfi, 2005), based on the asymptotic distribution of this distance given p = q. When the L2 norm is used, the test statistic is related to those we present here, although it is arrived at... |

36 | Injective hilbert space embeddings of probability measures - Sriperumbudur, Gretton, et al. - 2008 |

34 | On the Estimation of the Discrepancy between Empirical Curves of Distribution for Two Independent Samples,” Bulletin Mathématique de l’Université de Moscou, v - Smirnov - 1939 |

30 | Support Vector Learning. R. Oldenbourg - Schölkopf - 1997 |

28 | Nonparametric quantile estimation - Takeuchi, Le, et al. - 2007 |

27 | Bhattacharyya and expected likelihood kernels - Jebara, Kondor - 2003 |

23 | Universal kernels.” The - Micchelli, Xu, et al. |

20 |
A consistent test for bivariate dependence
- Feuerverger
- 1993
(Show Context)
Citation Context ...KHSs with Gaussian kernels, the empirical ∆ 2 may also be interpreted in terms of a smoothed difference between the joint empirical characteristic function (ECF) and the product of the marginal ECFs (=-=Feuerverger, 1993-=-; Kankainen, 1995). This interpretation does not hold in all cases, however, e.g. for kernels on strings, graphs, and other structured spaces. An illustration of the witness function f ∈ F from Defini... |

20 | A Distribution Free Version of the Smirnov Two Sample Test in the p-Variate Case - Bickel - 1969 |

19 | On the asymptotic properties of a nonparametric L1-test of homogeneity - Biau, Györfi - 2005 |

19 |
Consistent Testing of Total Independence Based on the Empirical Characteristic Function
- Kankainen
- 1995
(Show Context)
Citation Context ...kernels, the empirical ∆ 2 may also be interpreted in terms of a smoothed difference between the joint empirical characteristic function (ECF) and the product of the marginal ECFs (Feuerverger, 1993; =-=Kankainen, 1995-=-). This interpretation does not hold in all cases, however, e.g. for kernels on strings, graphs, and other structured spaces. An illustration of the witness function f ∈ F from Definition 2 is provide... |

17 | Permutation Tests for Equality of Distributions in High-Dimensional Settings - Hall, Tajvidi - 2002 |

17 |
Integral probability metrics and their generating classes of functions
- Müller
- 1997
(Show Context)
Citation Context ... class in the finite sample setting. We thus define a more general 3class of statistic, for as yet unspecified function classes F, to measure the disparity between p and q (Fortet and Mourier, 1953; =-=Müller, 1997-=-). Definition 2 Let F be a class of functions f : X → R and let p, q, X, Y be defined as above. We define the maximum mean discrepancy (MMD) as MMD [F, p, q] := sup (Ex∼p[f(x)] − Ey∼q[f(y)]) . (1) f∈F... |