## Positive Definite Kernels in Machine Learning (2009)

### BibTeX

@MISC{Cuturi09positivedefinite,

author = {Marco Cuturi},

title = {Positive Definite Kernels in Machine Learning},

year = {2009}

}

### OpenURL

### Abstract

This survey is an introduction to positive definite kernels and the set of methods they have inspired in the machine learning literature, namely kernel methods. We first discuss some properties of positive definite kernels as well as reproducing kernel Hibert spaces, the natural extension of the set of functions {k(x, ·), x ∈ X} associated with a kernel k defined on a space X. We discuss at length the construction of kernel functions that take advantage of well-known statistical models. We provide an overview of numerous data-analysis methods which take advantage of reproducing kernel Hilbert spaces and discuss the idea of combining several kernels to improve the performance on certain tasks. We also provide a short cookbook of different kernels which are particularly useful for certain datatypes such as images, graphs or speech segments. Remark: This report is a draft. Comments and suggestions will be highly appreciated. Summary We provide in this survey a short introduction to positive definite kernels and the set of methods they have inspired in machine learning, also known as kernel methods. The main idea behind kernel methods is the following. Most datainference tasks aim at defining an appropriate decision function f on a set of objects of interest X. When X is a vector space of dimension d, say R d, linear functions fa(x) = a T x are one of the easiest and better understood choices, notably for regression, classification or dimensionality reduction. Given a positive definite kernel k on X, that is a real-valued function on X × X which quantifies effectively how similar two points x and y are through the value k(x, y), kernel methods are algorithms which estimate functions f of the form f: x ∈ X → f(x) = ∑ αik(xi, x), (1) i∈I

### Citations

9002 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...pper, 2002) or the popular support vector machine (Cortes and Vapnik, 1995). The theoretical justifications for such tools can be found in the statistical learning literature (Cucker and Smale, 2002; =-=Vapnik, 1998-=-) but also in subsequent convergence and consistency analysis carried out for specific techniques (Fukumizu et al., 2007; Vert and Vert, 2005; Bach, 2008b). Kernel design embodies the research trend p... |

3701 | L.: Convex Optimization - Boyd, Vandenberghe - 2004 |

2175 | Support-vector networks
- CORTES, V
- 1995
(Show Context)
Citation Context ...el machines as introduced in Section 4. Examples of the latter include algorithms such as gaussian processes with sparse representations (Csató and Opper, 2002) or the popular support vector machine (=-=Cortes and Vapnik, 1995-=-). The theoretical justifications for such tools can be found in the statistical learning literature (Cucker and Smale, 2002; Vapnik, 1998) but also in subsequent convergence and consistency analysis ... |

2026 |
Principal Component Analysis
- Jolliffe
- 1986
(Show Context)
Citation Context ...ncipal components of a data sample. This approach can be naturally generalized to a “kernelized algorithm”. Principal component analysis can be used as a novelty detection tool for multivariate data (=-=Jolliffe, 2002-=-, §10.1) assuming the underlying data can be reasonably approximated by a Gaussian distribution. Given a sample X = 32(x1, · · · , xn) of n points drawn i.d.d from the distribution of interest in Rd ... |

1858 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...= X ∗ , the dual of X and c(f(x), y) = (y−f(x)) 2 , minimizing Rλ c is known as least-square regression when λ = 0; ridge regression (Hoerl, 1962) when λ > 0 and J is the Euclidian 2-norm; the lasso (=-=Tibshirani, 1996-=-) when λ > 0 and J is the 1-norm. • When X = [0, 1], Y = R, F is the space of m-times differentiable functions on [0, 1] and J = ∫ ( )2 (m) f (t) dt, we obtain regression by natural splines [0,1] of o... |

1281 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ...l studied field of mathematics known as approximation theory, rooted a few centuries ago in polynomial interpolation of given couples of points, and developed in statistics through spline regression (=-=Wahba, 1990-=-) and basis expansions (Hastie et al., 2001, §5). empirical risk minimization: statistical learning theory starts its course when a probabilistic knowledge about the generation of the points (x, y) is... |

1052 | Nonlinear component analysis as a kernel Eigenvalue problem - Smola, J, et al. - 1998 |

970 | The use of multiple measurements in taxonomic problems - FISHER - 1936 |

784 |
Theory of reproducing kernels
- Aronszajn
- 1950
(Show Context)
Citation Context ...pace H exists is a reproducing kernel and we usually write Hk for this space which is unique. It turns out that both Definitions 1 and 2 are equivalent, a result known as the Moore-Aronszajn theorem (=-=Aronszajn, 1950-=-). First, a reproducing kernel is p.d., since it suffices to write the expansion of Equation (2) to obtain the squared norm of the function ∑n i=1 cik(xi, ·), that is n∑ n∑ 2 cicj k (xi, xj) = cik(xi,... |

590 |
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
- Scholkopf, Smola
- 2001
(Show Context)
Citation Context ... a brief cookbook of kernels in Section 6, that is a short description of kernels for complex objects such as strings, texts, graphs and images. 2This survey is built on earlier references, notably (=-=Schölkopf and Smola, 2002-=-; Schölkopf et al., 2004; Shawe-Taylor and Cristianini, 2004). Whenever adequate we have tried to enrich this presentation with slightly more theoretical insights from (Berg et al., 1984; Berlinet and... |

396 | Exploiting generative models in discriminative classiers - Jaakkola, Haussler - 1999 |

387 |
Learning to classify text using support vector machines
- Joachims
- 2002
(Show Context)
Citation Context ...ts: most kernels used in practice on texts stem from the use of the popular bag-of-words (BoW) representations, that is sparse word count vectors taken against very large dictionaries. The monograph (=-=Joachims, 2002-=-) shows how the variations of the BoW can be used in conjunction with simple kernels such as the ones presented in Section 3.1.2. From a methodological point of view, much of the approach relies rathe... |

375 | An introduction to kernel-based learning algorithms
- Müller, Mika, et al.
- 2001
(Show Context)
Citation Context ...esentation with slightly more theoretical insights from (Berg et al., 1984; Berlinet and Thomas-Agnan, 2003), notably in Section 2. Topics covered in this survey overlap with some of the sections of (=-=Muller et al., 2001-=-) and more recently (Hofmann et al., 2008). The latter references cover in more detail kernel machines, such as the support vector machine for binary or multi-class classification. This presentation i... |

368 | Convolution kernels on discrete structures - Haussler - 1999 |

366 | The pyramid match kernel: discriminative classification with sets of image features - Grauman, Darrell - 2005 |

324 | Kernel independent component analysis
- Bach, Jordan
- 2002
(Show Context)
Citation Context ... kernel CCA: the second optimization, first coined down kernel-CCA by Akaho (2001), is ill-posed if Equation (13) is used directly with a finite sample, and requires a regularization as explained in (=-=Bach and Jordan, 2002-=-; Fukumizu et al., 2007). Namely, the direct maximization corrn X,Y [f, g] (f, g) = argmax √ f∈X,g∈Y varn X [f] varn Y [g] is likely to result in degenerated directions where varn X [f] or varnY [g] i... |

319 | Fisher discriminant analysis with kernels - Mika, Ratsch, et al. - 1999 |

313 | Regularization theory and neural networks architectures
- Girosi, Jones, et al.
- 1995
(Show Context)
Citation Context ... dt, we obtain regression by natural splines [0,1] of order m. This setting actually corresponds to the usage of thin-base splines which can also be regarded as a rkHs type method (Wahba, 1990), see (=-=Girosi et al., 1995-=-, Table 3) for other examples. • When X is an arbitrary set endowed with a kernel k and Y = {−1, 1}, F = Hk, J = ‖ · ‖Hk and the hinge loss c(f(x), y) = (1 − yf(x))+ is used, we obtain the support vec... |

299 | Choosing multiple parameters for support vector machines
- Chapelle, Vapnik, et al.
- 2002
(Show Context)
Citation Context ...tion error over a grid of acceptable parameters is a reasonable method which often yields satisfactory results. This approach is non-tractable when the number of parameters reaches but a few values. (=-=Chapelle et al., 2002-=-; Bousquet and Herrmann, 2003; Frölich et al., 2004) and more recently (Keerthi et al., 2007) have proposed different schemes to tune the parameters of a Gaussian kernel on Rd . The authors usually as... |

289 |
The spectrum kernel: a string kernel for SVM protein classification
- Leslie, Eskin, et al.
- 2002
(Show Context)
Citation Context ...creet and few, the easiest approach is arguably to map them as histograms of shorter substrings, also known as n-grams, and compare those histograms directly. This approach was initially proposed by (=-=Leslie et al., 2002-=-) with subsequent refinements to either incorporate more knowledge about the tokens transitions (Cuturi and Vert, 2005; 42Leslie et al., 2003) or improve computational speed (Teo and Vishwanathan, 20... |

280 |
Some results on Tchebycheffian spline functions
- Kimeldorf, Wahba
- 1971
(Show Context)
Citation Context ...inite sample of points. This result is known as the representer theorem and explains why so many linear algorithms can be “kernelized” when trained on finite datasets. Theorem 5 (Representer Theorem (=-=Kimeldorf and Wahba, 1971-=-)) Let X be a set endowed with a kernel k and Hk its corresponding rkHs. Let {xi}1≤i≤n be a finite set of points of X and let Ψ : R n+1 → R be any function that is strictly increasing with respect to ... |

277 | Multiple kernel learning, conic duality, and the SMO algorithm
- Bach, Lanckriet, et al.
- 2004
(Show Context)
Citation Context ...te programming to compute optimal linear combinations of kernels, the shift of study has progressively evolved towards computationally efficient alternatives to define useful additive mixtures as in (=-=Bach et al., 2004-=-; Sonnenburg et al., 2006; Rakotomamonjy et al., 2007). A theoretical foundation for this line of research can be found in Micchelli and Pontil (2006). We follow the exposition used in (Rakotomamonjy ... |

255 |
Functions of positive and negative type and their connection with the theory of integral equations
- Mercer
- 1909
(Show Context)
Citation Context ... ∥ ∥ i,j=1 which is non-negative. To prove the opposite in a general setting, that is not limited to the case where X is compact which is the starting hypothesis of the Mercer representation theorem (=-=Mercer, 1909-=-) reported in (Schölkopf and Smola, 2002, p.37), we refer the reader to the progressive construction of the rkHs associated with a kernel k and its index set X presented in (Berlinet and Thomas-Agnan,... |

224 | On the mathematical foundations of learning
- Cucker, Smale
- 2001
(Show Context)
Citation Context ...esentations (Csató and Opper, 2002) or the popular support vector machine (Cortes and Vapnik, 1995). The theoretical justifications for such tools can be found in the statistical learning literature (=-=Cucker and Smale, 2002-=-; Vapnik, 1998) but also in subsequent convergence and consistency analysis carried out for specific techniques (Fukumizu et al., 2007; Vert and Vert, 2005; Bach, 2008b). Kernel design embodies the re... |

224 | Large scale multiple kernel learning
- Sonnenburg, Rätsch, et al.
- 2006
(Show Context)
Citation Context ...ompute optimal linear combinations of kernels, the shift of study has progressively evolved towards computationally efficient alternatives to define useful additive mixtures as in (Bach et al., 2004; =-=Sonnenburg et al., 2006-=-; Rakotomamonjy et al., 2007). A theoretical foundation for this line of research can be found in Micchelli and Pontil (2006). We follow the exposition used in (Rakotomamonjy et al., 2007). Recall, as... |

199 | Anomaly detection: A survey
- Chandola, Banerjee, et al.
- 2009
(Show Context)
Citation Context ...) f ∈ Hn f(xi) ≤ ρ − ξi, ξi ≥ 0 novelty detection and kernel-PCA: Novelty detection refers to the task of detecting patterns in a given data set that do not conform to an established normal behavior (=-=Chandola et al., 2009-=-). novelty detection can be implemented in practice by using the level sets of a density estimator. A new observation is intuitively labelled as abnormal if it lies within a region of low density of t... |

192 | 2000: A discriminative framework for detecting remote protein homologies - Jaakkola, Diekhans, et al. |

184 | Diffusion kernels on graphs and other discrete input spaces
- Kondor, Lafferty
- 2002
(Show Context)
Citation Context ...he square map r : M → M 2 can be used on a matrix (Schölkopf et al., 2002). More complex constructions are the computation of the diffusion kernel on elements of a graph through its Laplacian matrix (=-=Kondor and Lafferty, 2002-=-), or direct transformations of the kernel matrix through unlabelled data (Sindhwani et al., 2005). strict and semi-definite positiveness: functions for which the sum in Equation (2) is (strictly) pos... |

159 | Using the Fisher kernel method to detect remote protein homologies - Jaakkola, Diekhans, et al. - 1999 |

159 | The context-tree weighting method: basic properties
- Willems, Shtarkov, et al.
- 1995
(Show Context)
Citation Context ... structure of the chain and mixtures of Dirichlet priors for the transition parameters. This setting yields closed computational formulas for the kernel through previous work led in universal coding (=-=Willems et al., 1995-=-; Catoni, 2004). The computations can be carried in a number of elementary operations that is linear in the lengths of the inputs x and y. marginalized kernels: in the framework of sequence analysis f... |

155 | Consistency of the group lasso and multiple kernel learning
- Bach
(Show Context)
Citation Context ...learning literature (Cucker and Smale, 2002; Vapnik, 1998) but also in subsequent convergence and consistency analysis carried out for specific techniques (Fukumizu et al., 2007; Vert and Vert, 2005; =-=Bach, 2008-=-b). Kernel design embodies the research trend pionneered in Jaakkola and Haussler (1999); Haussler (1999); Watkins (2000) of incorporating contextual knowledge on the objects of interest to define ker... |

144 | Marginalized kernels between labelled graphs
- Kashima, Tsuda, et al.
- 2003
(Show Context)
Citation Context ... elementary operations that is linear in the lengths of the inputs x and y. marginalized kernels: in the framework of sequence analysis first (Tsuda et al., 2002b), and then in comparisons of graphs (=-=Kashima et al., 2003-=-), further attention was given to latent variable models to define kernels in a way that also generalized the Fisher kernel. In a latent variable model, the probability of emission of an element x is ... |

135 | Sparseness of support vector machines
- Steinwart
- 2003
(Show Context)
Citation Context ...licitly replaced by a finite expansion n∑ f = aik(xi, ·), (17) i=1 and the corresponding set of feasible solutions f ∈ H by f ∈ Hn and more simply a ∈ R n using Equation (17). The reader may consult (=-=Steinwart and Christmann, 2008-=-) for an exhaustive treatment. 29kernel graph inference: we quote another example of a supervized rkHs method. In the context of supervised graph inference, Vert and Yamanishi (2005) consider a set o... |

128 | Sparse on-line Gaussian processes
- Csató, Opper
- 2002
(Show Context)
Citation Context ...h as the one we cover in Section 6, paired with efficient kernel machines as introduced in Section 4. Examples of the latter include algorithms such as gaussian processes with sparse representations (=-=Csató and Opper, 2002-=-) or the popular support vector machine (Cortes and Vapnik, 1995). The theoretical justifications for such tools can be found in the statistical learning literature (Cucker and Smale, 2002; Vapnik, 19... |

126 | Transformation invariance in pattern recognition - tangent distance and tangent propagation
- Simard, Cun, et al.
- 1998
(Show Context)
Citation Context ...nted, as above, as a 3D histogram of red/green/blue intensities. Although this representation inquires considerable loss of information, it is often used for image retrieval. 41The tangent distance (=-=Simard et al., 1998-=-) is a distance computed between two shapes x and y to assess how different they are to each other by finding an optimal series of elementary transformations (rotations, translations) that produces y ... |

125 | Mismatch string kernels for svm protein classification
- Leslie, Eskin, et al.
(Show Context)
Citation Context ...ograms directly. This approach was initially proposed by (Leslie et al., 2002) with subsequent refinements to either incorporate more knowledge about the tokens transitions (Cuturi and Vert, 2005; 42=-=Leslie et al., 2003-=-) or improve computational speed (Teo and Vishwanathan, 2006). higher level transition modeling with HMM’s: rather than using simple n-gram counts descriptors, Jaakkola et al. (2000) use more elaborat... |

122 | Dynamic alignment kernels
- Watkins
- 1999
(Show Context)
Citation Context ...ntify their similarity. Since the computation of the tangent distance requires the optimization of a criterion, the framework is related to other attempts at designing a kernel from of distance, e.g (=-=Watkins, 2000-=-; Shimodaira et al., 2002a; Vert et al., 2004; Cuturi, 2007; Cuturi et al., 2007). 40Figure 5: A complex image such as the monkey above can be summarized through color histograms, represented, as abo... |

118 | Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces
- Fukumizu, Bach, et al.
- 2004
(Show Context)
Citation Context ... ψi(·) = kY(yi, ·) − 1 n n∑ kX(xi, ·), j=1 n∑ kY(yi, ·), j=1 27are the centered projections of (xi) and (yj) in HX and HY respectively. The topic of supervised dimensionality reduction, explored in (=-=Fukumizu et al., 2004-=-), is also linked to the kernel-CCA approach. The author look for a sparse representation of the data that will select an effective subspace for X and delete all directions in X that are not correlate... |

108 | Beyond the point cloud: from transductive to semi-supervised learning
- Sindhwani, Niyogi, et al.
- 2005
(Show Context)
Citation Context ...gate decision functions trained on the separated modalities. A wide range of techniques have been designed to do so through convex optimization and the use of unlabelled data (Lanckriet et al., 2004; =-=Sindhwani et al., 2005-=-). Kernels can thus be seen as atomic elements that focus on certain types of similarities for the objects, which can be combined through so-called multiple kernel learning methods as will be exposed ... |

104 | Probability product kernels - Jebara, Kondor, et al. |

99 |
Kernel Methods for Pattern Analysis. Camb
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...a short description of kernels for complex objects such as strings, texts, graphs and images. 2This survey is built on earlier references, notably (Schölkopf and Smola, 2002; Schölkopf et al., 2004; =-=Shawe-Taylor and Cristianini, 2004-=-). Whenever adequate we have tried to enrich this presentation with slightly more theoretical insights from (Berg et al., 1984; Berlinet and Thomas-Agnan, 2003), notably in Section 2. Topics covered i... |

98 | Learning the kernel function via regularization - Micchelli, Pontil |

97 | Reproducing Kernel Hilbert Spaces in Probability and Statistics - Berlinet, Thomas-Agnan - 2004 |

96 |
Branching processes with biological applications
- Jagers
(Show Context)
Citation Context ...dtkt. t∈S The weight dt penalizes the complexity of a given subtree t by considering its number of nodes. In practice the weights dt are defined with an analogy to branching process priors for trees (=-=Jagers, 1975-=-). Bach (2008a) proposed a similar setting to optimize directly the weights dt using a variation of the 37Multiple Kernel Learning framework. The originality of the approach is to take advantage of t... |

87 | Diffusion kernels on statistical manifolds - Lafferty, Lebanon - 2005 |

81 |
Statistical learning theory and stochastic optimization, ser
- Catoni
(Show Context)
Citation Context ...n and mixtures of Dirichlet priors for the transition parameters. This setting yields closed computational formulas for the kernel through previous work led in universal coding (Willems et al., 1995; =-=Catoni, 2004-=-). The computations can be carried in a number of elementary operations that is linear in the lengths of the inputs x and y. marginalized kernels: in the framework of sequence analysis first (Tsuda et... |

80 | Positive definite functions on spheres - Schoenberg - 1942 |

77 | Exploring Large Feature Spaces with Hierarchical Multiple Kernel - Bach |

76 |
Marginalized kernels for biological sequences
- Tsuda, Kin, et al.
(Show Context)
Citation Context ...ˆ θ (x, y) = e σ2 (∇ˆ θ ln pθ(x)−∇ˆ θ ln pθ(y)) T J −1 ˆθ (∇ˆ θ ln pθ(x)−∇ˆ θ ln pθ(y)) , (6) extensions to the Fisher kernel: the proposal of the Fisher kernel fostered further research, notably in (=-=Tsuda et al., 2002-=-a; Smith and Gales, 2002). The motivation behind these contributions was to overcome the limiting assumption that the parameter ˆ θ on which the score vectors are evaluated is unique and fits the whol... |

74 |
Matrix Mathematics: Theory, Facts and Formulas
- Bernstein
- 2009
(Show Context)
Citation Context ...erties of the set of kernel functions such as its closure under pointwise and tensor products are directly inherited from well known results in Kronecker and Schur (or Hadamard) algebras of matrices (=-=Bernstein, 2005-=-, §7). kernel matrices created using other kernel matrices: kernel matrices for a sample X can be obtained by applying transformations r that conserve positive definiteness to a prior Gram matrix KX. ... |