## Integrating structured biological data by kernel maximum mean discrepancy (2006)

### Cached

### Download Links

- [eprints.pascal-network.org]
- [bioinformatics.oxfordjournals.org]
- [www.cs.cmu.edu]
- [www.gatsby.ucl.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | IN ISMB |

Citations: | 54 - 15 self |

### BibTeX

@INPROCEEDINGS{Borgwardt06integratingstructured,

author = {Karsten M. Borgwardt and Arthur Gretton and Malte J. Rasch and Hans-Peter Kriegel and Bernhard Schölkopf and Alex J. Smola},

title = {Integrating structured biological data by kernel maximum mean discrepancy},

booktitle = {IN ISMB},

year = {2006},

pages = {2006},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology. Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors. Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.

### Citations

2210 | Learning with kernels - Schölkopf, Smola - 2002 |

1551 | Probability inequalities of Sums of Bounded Random Variables - Hoeffding - 1963 |

681 |
Approximation Theorems of Mathematical Statistics
- Serfling
- 1980
(Show Context)
Citation Context ...ount by which one density exceeds the other, insofar as the smoothness constraint permits it. 3 Although the expression of MMD 2 (F, X, Y ) in Corollary 2.3 is the minimum variance unbiased estimate (=-=Serfling, 1980-=-), a more tractable unbiased expression can be found in the case where m = n, with a slightly higher variance (the distinction is in practice irrelevant, since the terms that differ decay much faster ... |

497 |
Real Analysis and Probability
- Dudley
- 1989
(Show Context)
Citation Context ...re space (assuming that it exists). 2 Then one may rewrite MMD as (1) MMD [F, p, q] = �µp − µq� H . (2) The main ideas for the proof can be summarized as follows. It is known from probability theory (=-=Dudley, 2002-=-, Lemma 9.3.2) that under the stated conditions, a sufficient condition for p = q is that for all continuous functions f, we have R f dp = R f dq. Such functions f, however, can be arbitrarily well ap... |

488 | Statistical Inference - Casella, Berger - 2002 |

288 | Gene expression correlates of clinical prostate cancer behavior - Singh - 2002 |

207 | A class of statistics with asymptotically normal distribution - Hoeffding - 1948 |

170 | On the influence of the kernel on the consistency of support vector machines - Steinwart - 2001 |

77 | Protein function prediction via graph kernels
- Borgwardt, Ong, et al.
(Show Context)
Citation Context ...ng �φ(x) − φ(x ′ )� 2 = k(x, x) + k(x ′ , x ′ ) − 2k(x, x ′ ). Graph matching, however, is NP-hard, hence no such kernel can exist. That said, there exists a number of useful graph kernels. See e.g. (=-=Borgwardt et al., 2005-=-) for further details. 2.4 Kernel Choice So far, we have focused on the case of universal kernels. These kernels have various favorable properties, including that • universal kernels are strictly posi... |

61 | Multivariate Generalizations of the WaldWolfowitz and Smirnov Two-Sample Tests - FRIEDMAN, RAFSKY - 1979 |

55 | Hilbertian metrics and positive definite kernels on probability measures - Hein, Bousquet - 2005 |

46 | Getting the noise out of gene arrays - Marshall - 2004 |

39 |
Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates
- Anderson, Hall, et al.
- 1994
(Show Context)
Citation Context ...to compute; Hall and Tajvidi (2002) consider only tens of points in their experiments. Another approach is to use some distance (e.g. L1 or L2) between estimates of the densities as a test statistic (=-=Anderson et al., 1994-=-; Biau and Gyorfi, 2005), based on the asymptotic distribution of this distance given p = q. One problem with the approach of Biau and Gyorfi (2005), however, is that it requires the space to be parti... |

37 | Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res - Gruvberger, Ringner, et al. - 2001 |

32 | Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes - Warnat, Eils, et al. - 2005 |

25 |
Semigroup kernels on measures
- Cuturi, Fukumizu, et al.
- 2005
(Show Context)
Citation Context ...roduct between vectors obtained by connecting a point from one distribution to a point from the other. For detailed discussions of the problem of defining kernels between distributions and sets, see (=-=Cuturi et al., 2005-=-; Hein and Bousquet, 2005). 2.2 MMD Tests We now propose a two-sample test based on the asymptotic distribution of an unbiased estimate of MMD 2 , which applies in the case where F is a unit ball in a... |

20 | A Distribution Free Version of the Smirnov Two Sample Test in the p-Variate Case - Bickel - 1969 |

19 | On the asymptotic properties of a nonparametric L1-test of homogeneity - Biau, Györfi - 2005 |

17 | Permutation Tests for Equality of Distributions in High-Dimensional Settings - Hall, Tajvidi - 2002 |

16 | SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms - Bulcke, Leemput, et al. - 2006 |

14 |
A Generalized T Test and Measure of Multivariate Dispersion
- Hotelling
- 1951
(Show Context)
Citation Context ...ethods Various empirical methods have been proposed to determine whether two distributions are different. The first test we consider, and the simplest, is a multivariate generalization of the t-test (=-=Hotelling, 1951-=-), which assumes both distributions are multivariate Gaussian with unknown, identical covariance structure. This test is not model-free in the sense of MMD (and the tests described below) — indeed, it... |

14 |
Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response
- Monti, Savage, et al.
- 2005
(Show Context)
Citation Context ...Wolf Smirnov Same accepted 100 100 95 96 Same rejected 0 0 5 4 Different accepted 0 100 0 22 Different rejected 100 0 100 78 Comparing samples from different and identical tumor subtypes of lymphoma (=-=Monti et al., 2005-=-). H0 is hypothesis that p = q. Repetitions 100, sample size (each) 25, dimension of sample vectors: 2,118 lymphoma subtypes by using a combination of several clustering algorithms. Hence MMD confirms... |

14 |
Crossplatform comparability of microarray technology: Intra-platform consistency and appropriate data analysis procedures are essential
- Shi, Tong, et al.
- 2005
(Show Context)
Citation Context ...an extremely negative picture of cross-platform comparability — and hence the reliability and reproducibility — of microarray results, due to the various platforms and data analysis methods employed (=-=Shi et al., 2005-=-). It is therefore crucial for bioinformatics to develop computational methods that allow us to determine whether results achieved across platforms are comparable. In this article, we present a novel ... |

11 | Convergence de la réparation empirique vers la réparation théorique - Fortet, Mourier - 1953 |

4 | Evaluation of the similarity of gene expression data estimated with SAGE and Affymetrix genechips - Ruissen, Ruijter, et al. - 2005 |

2 |
Comparison of the predictive accuracy of dna array-based multigene classifiers across cdna arrays and affymetrix genechips
- Stec, Wang, et al.
- 2005
(Show Context)
Citation Context ...sing given the high dimensionality of the data. As inter-platform comparability of microarray data is reported to be modest in many recent publications (van Ruissen et al., 2005; Carter et al., 2005; =-=Stec et al., 2005-=-), MMD is very successful in detecting these differences in our experiments. We also note that our sample sizes are relatively small, which makes problematic the assumption of both the MMD and Friedma... |

1 | On the asymptotic properties of a nonparametric l 1-test statistic of homogeneity - Biau, Gyorfi |

1 | Scaling to very very large corpora for natural language disambiguation - unknown authors |

1 | Accentuation automatique des textes par des m6thodes probabilistes. Techniques et sciences informatique - unknown authors - 1994 |