## A Hilbert space embedding for distributions (2007)

### Cached

### Download Links

- [www.kyb.tuebingen.mpg.de]
- [www.kyb.mpg.de]
- [www.cs.cmu.edu]
- [www.gatsby.ucl.ac.uk]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Algorithmic Learning Theory: 18th International Conference |

Citations: | 57 - 28 self |

### BibTeX

@INPROCEEDINGS{Smola07ahilbert,

author = {Alex Smola and Arthur Gretton and Le Song and Bernhard Schölkopf},

title = {A Hilbert space embedding for distributions},

booktitle = {In Algorithmic Learning Theory: 18th International Conference},

year = {2007},

pages = {13--31},

publisher = {Springer-Verlag}

}

### OpenURL

### Abstract

Abstract. We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in two-sample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, entropy, or Kullback-Leibler divergence, we require sophisticated space partitioning and/or

### Citations

9738 | The Nature of Statistical Learning Theory
- Vapnik
- 2000
(Show Context)
Citation Context ...s of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning =-=[1, 2, 3, 4]-=-, however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent e... |

2196 | Learning with Kernels
- Smola, Scholkopf
- 2002
(Show Context)
Citation Context ...s of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning =-=[1, 2, 3, 4]-=-, however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent e... |

1551 | Probability inequalities for sums of bounded random variables - HOEFFDING - 1963 |

1430 |
Independent component analysis, a new concept
- Comon
- 1994
(Show Context)
Citation Context ...etermine whether this distribution factorizes. Having a measure of (in)dependence between random variables is a very useful tool in data analysis. One application is in independent component analysis =-=[40]-=-, where the goal is to find a linear mapping of the observations xi to obtain mutually independent outputs. One of the first algorithms to gain popularity was InfoMax, which relies on information theo... |

1225 |
Spatial interaction and the statistical analysis of lattice systems
- Besag
- 1974
(Show Context)
Citation Context ...ective if the kernels kc are all universal. The same decomposition holds for the empirical counterpart µ[X]. The condition of full support arises from the conditions of the HammersleyClifford Theorem =-=[25, 26]-=-: without it, not all conditionally independent random variables can be represented as the product of potential functions. Corollary 1 implies that we will be able to perform all subsequent operations... |

971 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...variate case: the Glivenko-Cantelli lemma allows one to bound deviations between empirical and expected means for functions of bounded variation, as generalized by the work of Vapnik and Chervonenkis =-=[20, 21]-=-. However, the Glivenko-Cantelli lemma also leads to the Kolmogorov-Smirnov statistic comparing distributions by comparing their cumulative distribution functions. Moreover, corresponding q-q plots ca... |

849 |
Kernel Methods for Pattern Analysis
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...d the Laplace kernel k(x, x ′ ) = exp (−λ ‖x − x ′ ‖). Good kernel functions have been defined on texts, graphs, time series, dynamical systems, images, and structured objects. For recent reviews see =-=[11, 12, 13]-=-. An alternative view, which will come in handy when designing algorithms is that of a feature map. That is, we will consider maps x → φ(x) such that k(x, x ′ ) = 〈φ(x), φ(x ′ )〉 and likewise f(x) = 〈... |

670 |
Approximation Theorems of Mathematical Statistics
- Serfling
- 1980
(Show Context)
Citation Context ...)] − 2Ex,y [k(x, y)] + Ey,y ′ [k(y, y′ )] ,6 Alex Smola et al. where x ′ is an independent copy of x, and y ′ an independent copy of y. An unbiased empirical estimator of D2 (Px,Py) is a U-statistic =-=[28]-=-, ˆD 2 ∑ 1 (X, Y ) := m(m−1) h((xi, yi), (xj, yj)), (6) where i̸=j h((x, y), (x ′ , y ′ )) := k(x, x ′ ) − k(x, y ′ ) − k(y, x ′ ) + k(y, y ′ ). An equivalent interpretation, also in [16], is that we ... |

577 | Linear models and empirical Bayes methods for assessing differential expression in microarray experiments
- Smyth
- 2004
(Show Context)
Citation Context ...dea is to normalize each feature by ¯sj = sj++sj− instead. Subsequently the (xj+ − xj−)/¯sj are used to score features. Moderated t-score is similar to t-statistic and is used for microarray analysis =-=[57]-=-. Its normalization for the jth feature is derived via a Bayes approach as ˜sj = m¯s2 j + m0¯s 2 0 m + m0 (17) where ¯sj is from (16), and ¯s0 and m0 are hyperparameters for the prior distribution on ... |

470 | Graphical Models, Exponential Families, and Variational Inference
- Wainwright
- 2008
(Show Context)
Citation Context ...set is commonly referred to as the marginal polytope. Such mappings have become a standard tool in deriving efficient algorithms for approximate inference in graphical models and exponential families =-=[22, 23]-=-. We are interested in the properties of µ[P] in the case where P satisfies the conditional independence relations specified by an undirected graphical model. In [24], it is shown for this case that t... |

421 |
Learning to Classify Text Using Support Vector Machines
- Joachims
- 2002
(Show Context)
Citation Context ...s of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation. Kernel methods are widely used in supervised learning =-=[1, 2, 3, 4]-=-, however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent e... |

401 |
On the method of bounded differences
- MCDIARMID
- 1989
(Show Context)
Citation Context ...y) − ˆ D 2 (X, Y )| is bounded by 4 √ log(2/δ)/m.A Hilbert Space Embedding for Distributions 7 Note that an alternative uniform convergence bound is provided in [30], based on McDiarmid’s inequality =-=[31]-=-. The second theorem appeared as [30, Theorem 8], and describes the asymptotic distribution of ˆ D 2 (X, Y ). When Px ̸= Py, this distribution is given by [28, Section 5.5.1]; when Px = Py, it follows... |

340 | Kernel independent component analysis
- Bach, Jordan
- 2002
(Show Context)
Citation Context ... on information theoretic quantities [41]. Recent developments using cross-covariance or correlation operators between Hilbert space representations have since improved on these results significantly =-=[42, 43, 44]-=-; in particular, a faster and more accurate quasi-Newton optimization procedure for kernel ICA is given in [45]. In the following we re-derive one of the above kernel independence measures using mean ... |

308 | Gaussian processes for machine learning
- Rasmussen
- 2006
(Show Context)
Citation Context |

296 |
Methods of information geometry
- Amari, Nagaoka
- 2000
(Show Context)
Citation Context ...used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches =-=[5, 6]-=- have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component anal... |

274 | Rademacher and Gaussian complexities: risk bounds and structural results
- Bartlett, Mendelson
- 2002
(Show Context)
Citation Context ...f the kernel k is universal, then the mean map µ : Px → µ[Px] is injective. Moreover, we have fast convergence of µ[X] to µ[Px] as shown in [17, Theorem 15]. Denote by Rm(H,Px) the Rademacher average =-=[18]-=- associated with Px and H via Rm(H,Px) = 1 m Ex1,...,xmEσ1,...,σm [ m∑ ] sup σif(xi) . (3) ∣ ∣ ‖f‖ H ≤1 Rm(H,Px) can be used to measure the deviation between empirical means and expectations [17]. The... |

170 | On the influence of the kernel on the consistency of support vector machines
- Steinwart
- 2002
(Show Context)
Citation Context ...φ(x ′ )〉 and likewise f(x) = 〈w, φ(x)〉, where w is a suitably chosen “weight vector” (w can have infinite dimension, e.g. in the case of a Gaussian kernel). Many kernels are universal in the sense of =-=[14]-=-. That is, their Hilbert spaces H are dense in the space of continuous bounded functions C0(X) on the compact domain X. For instance, the Gaussian and Laplacian RBF kernels share this property. This i... |

160 | Agglomerative information bottleneck
- Slonim, Tishby
- 1999
(Show Context)
Citation Context ...tion, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, =-=[8]-=- in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, e... |

137 | Correcting sample selection bias by unlabeled data
- Huang, Smola, et al.
(Show Context)
Citation Context ...anging function of x, or if the loss measuring the discrepancy between y and its estimate is highly non-smooth, this problem is difficult to solve. However, under regularity conditions spelled out in =-=[37]-=-, one may show that by minimizing m∑ ∆ := βik(xi, ·) − µ[X ∥ ′ ] ∥ i=1 subject to βi ≥ 0 and ∑ i βi = 1, we will obtain weights which achieve this task. The idea here is that the expected loss with th... |

136 | Support Vector Learning
- Schölkopf
- 1997
(Show Context)
Citation Context ...am) data, we have both spatial and temporal structure in the signal. That said, few algorithms take full advantage of this when performing independent component analysis [50]. The pyramidal kernel of =-=[51]-=- is one possible choice for dependent random variables. 2.4 Feature Extraction Kernel measures of statistical dependence need not be applied only to the analysis of independent components. To the cont... |

125 | Local learning algorithms
- Bottou, Vapnik
- 1992
(Show Context)
Citation Context ... observations in X and a weighting scheme such that the error at x ′ is approximated well. In practice, this leads to a local sample weighting scheme, and consequently an algorithm for local learning =-=[39]-=-. Our key advantage, however, is that we do not need to define the shape of the neighborhood in which we approximate the error at x ′. Instead, this is automatically taken care of via the choice of th... |

121 | Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces
- Fukumizu, Bach, et al.
- 2004
(Show Context)
Citation Context ...mpirical means with respect to Px and X, respectively, by taking inner products with the means in the RKHS, µ[Px] and µ[X]. The representations µ[Px] and µ[X] are attractive for the following reasons =-=[15, 16]-=-: Theorem 1. If the kernel k is universal, then the mean map µ : Px → µ[Px] is injective. Moreover, we have fast convergence of µ[X] to µ[Px] as shown in [17, Theorem 15]. Denote by Rm(H,Px) the Radem... |

118 | Multi-instance kernels
- Gärtner, Flach, et al.
- 2002
(Show Context)
Citation Context ...tween sets and distributions. If we have multisets and sample weights for instances we may easily include this in the computation of µ[X]. It turns out that (29) is exactly the set kernel proposed by =-=[62]-=-, when dealing with multiple instance learning. This notion was subsequently extended to deal with intermediate density estimates by [63]. We have therefore that in situations where estimation problem... |

115 |
Markov Fields on Finite Graphs and Lattices. Unpublished manuscript
- Hammersley, Clifford
- 1971
(Show Context)
Citation Context ...ective if the kernels kc are all universal. The same decomposition holds for the empirical counterpart µ[X]. The condition of full support arises from the conditions of the HammersleyClifford Theorem =-=[25, 26]-=-: without it, not all conditionally independent random variables can be represented as the product of potential functions. Corollary 1 implies that we will be able to perform all subsequent operations... |

111 |
J.A.: Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...used in supervised learning [1, 2, 3, 4], however they are much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches =-=[5, 6]-=- have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component anal... |

99 | Measuring statistical dependence with hilbert-schmidt norms
- Gretton, Bousquet, et al.
- 2005
(Show Context)
Citation Context ... on information theoretic quantities [41]. Recent developments using cross-covariance or correlation operators between Hilbert space representations have since improved on these results significantly =-=[42, 43, 44]-=-; in particular, a faster and more accurate quasi-Newton optimization procedure for kernel ICA is given in [45]. In the following we re-derive one of the above kernel independence measures using mean ... |

96 | Near-optimal nonmyopic value of information in graphical models
- Krause, Guestrin
- 2005
(Show Context)
Citation Context ...re much less established in the areas of testing, estimation, and analysis of probability distributions, where information theoretic approaches [5, 6] have long been dominant. Recent examples include =-=[7]-=- in the context of construction of graphical models, [8] in the context of feature extraction, and [9] in the context of independent component analysis. These methods have by and large a common issue:... |

74 | Rademacher penalties and structural risk minimization
- Koltchinskii
- 1902
(Show Context)
Citation Context ... sets of observations have been drawn from the same distribution. Note that there is a strong connection between Theorem 2 and uniform convergence results commonly used in Statistical Learning Theory =-=[19, 16]-=-. This is captured in the theorem below: Theorem 3. Let F be the unit ball in the reproducing kernel Hilbert space H. Then the deviation between empirical means and expectations for any f ∈ F is bound... |

70 |
A kernel method for the twosample-problem
- Gretton, Borgwardt, et al.
- 2006
(Show Context)
Citation Context ...mpirical means with respect to Px and X, respectively, by taking inner products with the means in the RKHS, µ[Px] and µ[X]. The representations µ[Px] and µ[X] are attractive for the following reasons =-=[15, 16]-=-: Theorem 1. If the kernel k is universal, then the mean map µ : Px → µ[Px] is injective. Moreover, we have fast convergence of µ[X] to µ[Px] as shown in [17, Theorem 15]. Denote by Rm(H,Px) the Radem... |

67 |
Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory Probab
- Vapnik, Chervonenkis
- 1981
(Show Context)
Citation Context ...variate case: the Glivenko-Cantelli lemma allows one to bound deviations between empirical and expected means for functions of bounded variation, as generalized by the work of Vapnik and Chervonenkis =-=[20, 21]-=-. However, the Glivenko-Cantelli lemma also leads to the Kolmogorov-Smirnov statistic comparing distributions by comparing their cumulative distribution functions. Moreover, corresponding q-q plots ca... |

66 |
Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome
- Ein-Dor, Zuk
(Show Context)
Citation Context ...nal-to-noise ratio selection. Below we give some examples, mainly when a linear kernel k(x, x ′ ) = 〈x, x ′ 〉. For more details see [53]. Pearson’s Correlation is commonly used in microarray analysis =-=[54, 55]-=-. It is defined as Rj := 1 m∑ ( ) ( ) xij − xj yi − y where (10) m sxj sy i=1 xj = 1 m∑ xij and y = m 1 m∑ yi m s 2 xj = 1 m i=1 i=1 m∑ (xij − xj) 2 and s 2 y = 1 m∑ (yi − y) m 2 . (11) i=1 This means... |

55 | Hilbertian metrics and positive definite kernels on probability measures
- Hein, Bousquet
- 2005
(Show Context)
Citation Context ...s simply by dealing with mean operators on the corresponding maximal cliques. 1.4 Choosing the Hilbert Space Identifying probability distributions with elements of Hilbert spaces is not new: see e.g. =-=[27]-=-. However, this leaves the obvious question of which Hilbert space to employ. We could informally choose a space with a kernel equalling the Delta distribution k(x, x ′ ) = δ(x, x ′ ), in which case t... |

54 | Integrating structured biological data by kernel maximum mean discrepancy
- Borgwardt, Gretton, et al.
- 2006
(Show Context)
Citation Context ...have the theoretical appeal of making no assumptions on the distributions, they produce very weak tests. We find the test arising from Theorem 5 performs considerably better in practice. In addition, =-=[36]-=- demonstrate that this test performs very well in circumstances of high dimension and low sample size (i.e. when comparing microarray data), as well as being the only test currently applicable for str... |

39 | Kernel methods in machine learning
- Hofmann, Schölkopf, et al.
- 2008
(Show Context)
Citation Context ...d the Laplace kernel k(x, x ′ ) = exp (−λ ‖x − x ′ ‖). Good kernel functions have been defined on texts, graphs, time series, dynamical systems, images, and structured objects. For recent reviews see =-=[11, 12, 13]-=-. An alternative view, which will come in handy when designing algorithms is that of a feature map. That is, we will consider maps x → φ(x) such that k(x, x ′ ) = 〈φ(x), φ(x ′ )〉 and likewise f(x) = 〈... |

39 | Unifying divergence minimization and statistical inference via convex duality
- Altun, Smola
- 2006
(Show Context)
Citation Context ...he moderated-t statistic. 2.5 Density Estimation General setting Obviously, we may also use the connection between mean operators and empirical means for the purpose of estimating densities. In fact, =-=[59, 17, 60]-=- show that this may be achieved in the following fashion: maximize H(Px) subject to ‖µ[X] − µ[Px]‖ ≤ ǫ. (21) Px Here H is an entropy-like quantity (e.g. Kullback Leibler divergence, Csiszar divergence... |

37 | Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates - Anderson, Hall, et al. - 1994 |

37 |
Funktionen von beschränkter Variation in der Theorie der Gleichverteilung. Annali di Mathematica Pura ed Applicata
- Hlawka
- 1961
(Show Context)
Citation Context ...hat ̂ Px = Px. This raises the question of how well expectations with respect to Px are approximated by those with respect to ̂ Px. This can be answered by an extension of the KoksmaHlawka inequality =-=[61]-=-. i=116 Alex Smola et al. Lemma 1. Let ǫ > 0 and let ǫ ′ ∥ := ∥µ[X] − µ[ ̂ ∥ Px] ∥. Under the assumptions of Theorem 2 we have that with probability at least 1 − exp(−ǫ2mR−2 ), ∣ sup ∣Ex∼Px[f(x)] − E... |

35 |
Supervised feature selection via dependence estimation. ICML
- Song, Smola, et al.
- 2007
(Show Context)
Citation Context ...endent components. To the contrary, we may also use them to extract highly dependent random variables, i.e. features. This procedure leads to variable selection algorithms with very robust properties =-=[52]-=-. The idea works as follows: given a set of patterns X and a set of labels Y , find a subset of features from X which maximizes m −2 tr HKHL. Here L is the kernel matrix on the labels. In the most gen... |

32 |
Replicated microarray data. Statistica Sinica 12: 31–46
- Lönnstedt, TP
- 2002
(Show Context)
Citation Context ... where Γ(·) is the gamma function, ′ denotes derivative, zj = ln(¯s 2 j ) and ¯z = 1 ∑d d j=1 zj. B-statistic is the logarithm of the posterior odds (lods) that a feature is differentially expressed. =-=[58, 57]-=- show that, for large number of features, the B-statistic is given by Bj = a + b˜t 2 j, (20) where both a and b are constant (b > 0), and ˜tj is the moderated-t statistic for the jth feature. Here we ... |

29 | Maximum entropy distribution estimation with generalized regularization. Learning Theory
- Dud́ık, Schapire
- 2006
(Show Context)
Citation Context ...he moderated-t statistic. 2.5 Density Estimation General setting Obviously, we may also use the connection between mean operators and empirical means for the purpose of estimating densities. In fact, =-=[59, 17, 60]-=- show that this may be achieved in the following fashion: maximize H(Px) subject to ‖µ[X] − µ[Px]‖ ≤ ǫ. (21) Px Here H is an entropy-like quantity (e.g. Kullback Leibler divergence, Csiszar divergence... |

27 | Correcting sample selection bias in maximum entropy density estimation
- Dudík, Schapire, et al.
- 2006
(Show Context)
Citation Context ...he moderated-t statistic. 2.5 Density Estimation General setting Obviously, we may also use the connection between mean operators and empirical means for the purpose of estimating densities. In fact, =-=[59, 17, 60]-=- show that this may be achieved in the following fashion: maximize H(Px) subject to ‖µ[X] − µ[Px]‖ ≤ ǫ. (21) Px Here H is an entropy-like quantity (e.g. Kullback Leibler divergence, Csiszar divergence... |

27 | Bhattacharyya and expected likelihood kernels
- Jebara, Kondor
- 2003
(Show Context)
Citation Context .... It turns out that (29) is exactly the set kernel proposed by [62], when dealing with multiple instance learning. This notion was subsequently extended to deal with intermediate density estimates by =-=[63]-=-. We have therefore that in situations where estimation problems are well described by distributions we inherit the consistency properties of the underlying RKHS simply by using a universal set kernel... |

23 | Exponential families for conditional random fields
- Altun, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...dels and exponential families [22, 23]. We are interested in the properties of µ[P] in the case where P satisfies the conditional independence relations specified by an undirected graphical model. In =-=[24]-=-, it is shown for this case that the sufficient statistics decompose along the maximal cliques of the conditional independence graph. More formally, denote by C set of maximal cliques of the graph G a... |

20 |
A consistent test for bivariate dependence
- Feuerverger
- 1993
(Show Context)
Citation Context ...RKHSs with Gaussian kernels, the empirical ∆ 2 may also be interpreted in terms of a smoothed difference between the joint empirical characteristic function (ECF) and the product of the marginal ECFs =-=[46, 47]-=-. This interpretation does not hold in all cases, however, e.g. for kernels on strings, graphs, and other structured spaces. An illustration of the witness function of the equivalent optimization prob... |

20 |
et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature 415
- Veer, Da, et al.
- 2002
(Show Context)
Citation Context ...nal-to-noise ratio selection. Below we give some examples, mainly when a linear kernel k(x, x ′ ) = 〈x, x ′ 〉. For more details see [53]. Pearson’s Correlation is commonly used in microarray analysis =-=[54, 55]-=-. It is defined as Rj := 1 m∑ ( ) ( ) xij − xj yi − y where (10) m sxj sy i=1 xj = 1 m∑ xij and y = m 1 m∑ yi m s 2 xj = 1 m i=1 i=1 m∑ (xij − xj) 2 and s 2 y = 1 m∑ (yi − y) m 2 . (11) i=1 This means... |

19 |
Consistent Testing of Total Independence Based on the Empirical Characteristic Function
- Kankainen
- 1995
(Show Context)
Citation Context ...RKHSs with Gaussian kernels, the empirical ∆ 2 may also be interpreted in terms of a smoothed difference between the joint empirical characteristic function (ECF) and the product of the marginal ECFs =-=[46, 47]-=-. This interpretation does not hold in all cases, however, e.g. for kernels on strings, graphs, and other structured spaces. An illustration of the witness function of the equivalent optimization prob... |

18 |
P.: Least dependent component analysis based on mutual information
- Stögbauer, Kraskov, et al.
- 2004
(Show Context)
Citation Context ...s, where information theoretic approaches [5, 6] have long been dominant. Recent examples include [7] in the context of construction of graphical models, [8] in the context of feature extraction, and =-=[9]-=- in the context of independent component analysis. These methods have by and large a common issue: to compute quantities such as the mutual information, entropy, or Kullback-Leibler divergence, we req... |

18 |
A unifying frame work for independent component analysis
- Lee, Girolami, et al.
- 1999
(Show Context)
Citation Context ...is to find a linear mapping of the observations xi to obtain mutually independent outputs. One of the first algorithms to gain popularity was InfoMax, which relies on information theoretic quantities =-=[41]-=-. Recent developments using cross-covariance or correlation operators between Hilbert space representations have since improved on these results significantly [42, 43, 44]; in particular, a faster and... |

15 | Continuous Univariate Distributions: Volume 1. Second Edition - Johnson, Kotz, et al. - 1994 |

13 | A.: An efficient alternative to svm based recursive feature elimination with applications in natural language processing and bioinformatics
- Bedo, Sanderson, et al.
(Show Context)
Citation Context ...sses at the jth feature, (xj+ −xj−), is useful for scoring individual features. With different normalization of the data and the labels, many variants can be derived. To obtain the centroid criterion =-=[56]-=- use vj := λxj+ −(1−λ)xj− for λ ∈ (0, 1) as the score3 for feature j. Features are subsequently selected according to the absolute value |vj|. In experiments the authors typically choose λ = 1 2 . For... |