## Bayesian kernel methods (2003)

Venue: | LNAI 2600 |

Citations: | 4 - 0 self |

### BibTeX

@INPROCEEDINGS{Smola03bayesiankernel,

author = {Alexander J. Smola and Bernhard Schölkopf},

title = {Bayesian kernel methods},

booktitle = {LNAI 2600},

year = {2003},

pages = {65--117},

publisher = {Springer}

}

### OpenURL

### Abstract

Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.

### Citations

9021 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ... lie within an interval of ±σ around θML(Y ) and where equal amounts of observations yi exceed θML(Y ) by more than σ from above and below. ε-insensitive Density: For computational convenience V=-=apnik [75] i-=-ntroduced another variant of density model, based on the ε-insensitive loss function. It is essentially a Laplacian distribution, where in a neighborhood of size ε around its mean all data is equall... |

8603 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ributions enjoy a high degree of popularity in Bayesian methods. Besides, the normal distribution is the least informative distribution (largest entropy) among all distributions with bounded variance =-=[7]. As Figu-=-re 3 indicates, a single Gaussian may not always be sufficient to capture the important properties of p(Y,Y ′ |θ)p(θ). A more elaborate parametric model qφ(θ) ofp(θ|Y,Y ′ ), such as a mixture... |

8142 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...imple enough q(θ). Furthermore, by maximizing L via a suitable choice of q, we maximize a lower bound on ln p(θ). Expectation Maximization: Another method for approximating p(Y,Y ′ ) was suggested=-= in [10], n-=-amely that one maximizes the integrand in (19) jointly over the unknown variable Y ′ and the latent variable θ. While this is clearly not equivalent to solving the integral, there are many cases wh... |

4855 |
Neural Networks for Pattern Recognition
- Bishop
- 1994
(Show Context)
Citation Context ...t of observations X ′ := {x ′ 1,... ,x ′ m ′}. For notational convenience we sometimes use Z := {(x1,y1),... ,(xm,ym)} instead of X, Y . We begin with an overview over the fundamental ideas (s=-=ee also [3, 40, 46, 63, 56, 62] -=-for more details). 1 2.1 Maximum Likelihood and Bayes Rule Assume that we are given a set of observations Y which are drawn from a probability distribution pθ(Y ). It then follows that for a given va... |

4692 |
Topics in Matrix Analysis
- Horn, Johnson
- 1991
(Show Context)
Citation Context ...add the kernel function chosen by positive diagonal pivoting [12] to the selected subset, in order to ensure that the n × n sub-matrix remains invertible. See numerical mathematics textbooks, such as=-= [28]-=-, for more detail on update rules.s4.6 Hardness and Approximation Results Bayesian Kernel Methods 97 It is worthwhile to study the theoretical guarantees on the performance of the algorithm (as descri... |

2044 | Online learning with kernels
- Kivinen, Smola, et al.
- 2004
(Show Context)
Citation Context ...es, and Relevance Vector Machines (Section 6), which assume that the contribution of each kernel function is governed by a normal distribution with its own variance. ⋆ The present article is based o=-=n [62]. S. -=-Mendelson, A.J. Smola (Eds.): Advanced Lectures on Machine Learning, LNAI 2600, pp. 65–117, 2003. c○ Springer-Verlag Berlin Heidelberg 2003s66 A.J. Smola and B. Schölkopf Readers interested in a ... |

1963 |
Matrix Computations
- Golub, Loan
(Show Context)
Citation Context ...by ˜K = U ⊤ KsubU where U ∈ R n×m and Ksub ∈ R n×n (73) with n ≪ m, however, we may compute (72) much more efficiently. For instance, it follows immediately from the Sherman-Woodbury-Morris=-=on formula [22], (V + RHR ⊤ ) −1 = V −1 − V −1 R(H �-=-��1 + R ⊤ V −1 R) −1 R ⊤ V −1 , (74) that we obtain the following update rule for ˜ K, � αnew = 1 − U ⊤ � K −1 sub + UCU⊤�−1 � UC (U ⊤ KsubUCαold − c). (75) 9 Strict... |

1675 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ...) care about the behavior of the estimator on inputs x looking like faces. The specific benefit of this strategy is that it provides us with a correspondence between linear programming regularization =-=[43, 2, 65, 6]-=- and Bayesian priors over function spaces, by analogy to regularization in Reproducing Kernel Hilbert Spaces and Gaussian Processes. 12 5.1 Examples of Factorizing Priors Let us now study some of the ... |

1352 |
Practical optimization
- Gill, Murray, et al.
- 1981
(Show Context)
Citation Context ...up to their expected value. This means that in order to obtain p(Y,Y ′ ) we need to integrate out the latent variable θ. This is achieved as follows: p(Y,Y ′ � )= p(Y,Y ′ � ,θ)dθ = p(Y,Y =-=′ |θ)p(θ)dθ. (19) E-=-q. (19) may or may not be computable in closed form. Hence there exist various strategies to deal with the problem of obtaining p(Y,Y ′ ). We list some of them below. Exact Solution: If we can solve... |

1250 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...s the special and somewhat simpler case of regression with additive Gaussian noise, since here the latent variabless88 A.J. Smola and B. Schölkopf θ, θ ′ can be integrated out. We refer the reade=-=r to [84, 15, 11, 54] -=-and the references therein for integration methods based on Markov Chain Monte Carlo approximations (see also [63] for a more recent overview). More specifically, assume that both K and σ2 (the addit... |

1116 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...t of observations X ′ := {x ′ 1,... ,x ′ m ′}. For notational convenience we sometimes use Z := {(x1,y1),... ,(xm,ym)} instead of X, Y . We begin with an overview over the fundamental ideas (s=-=ee also [3, 40, 46, 63, 56, 62] -=-for more details). 1 2.1 Maximum Likelihood and Bayes Rule Assume that we are given a set of observations Y which are drawn from a probability distribution pθ(Y ). It then follows that for a given va... |

1082 |
Practical Methods of Optimization
- Fletcher
- 1987
(Show Context)
Citation Context ...izer of the log posterior, it is far from clear that this update rule is always convergent (to prove the latter, we would need to show that the initial guess of α lies within the radius of attraction=-= [53, 13, 19, 38]-=-. Nonetheless, this approximation turns out to work in practice, and the implementation of the update rule is relatively simple. The major stumbling block if we want to apply (72) to large problems is... |

1045 |
Introduction to Linear and Nonlinear Programming
- Luenberger
- 1973
(Show Context)
Citation Context ...izer of the log posterior, it is far from clear that this update rule is always convergent (to prove the latter, we would need to show that the initial guess of α lies within the radius of attraction=-= [53, 13, 19, 38]-=-. Nonetheless, this approximation turns out to work in practice, and the implementation of the update rule is relatively simple. The major stumbling block if we want to apply (72) to large problems is... |

929 |
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ...e call the latter an improper prior). The Laplacian prior corresponds to the regularization functional employed in sparse coding approaches, such as wavelet dictionaries [6], coding of natural images =-=[48]-=-, independent component analysis [37], and linear programming regression [66, 65]. In the following, we focus on (100). It is straightforward to see that the MAP estimate can be obtained by minimizing... |

835 | An Introduction to Variational Methods for Graphical Models
- Jordan, Ghahramani, et al.
- 1998
(Show Context)
Citation Context ...mprove the approximation of (19). A common strategy is to resort to variational methods. The details are rather technical and go beyond the scope of this section. The interested reader is referred to =-=[36]-=- for an overview, and to [4] for an application to the Relevance Vector Machine of Section 6. The following theorem describes the basic idea. Theorem 1 (Variational Approximation of Densities). Denote... |

743 |
Table of integrals, series, and products
- Gradshteyn, Ryzhik
- 1965
(Show Context)
Citation Context ...e inverse of σ and ¯σ. Figure 12 depicts the scaling behavior for non-informative priors. Further choices are possible and can be obtained by consulting tables of Legendre transformations, such as =-=in [23].s35 30 25 20 15-=- 10 5 Density p(x) 0 0 2 4 6 8 10 2 1 0 −1 −2 −3 Bayesian Kernel Methods 109 − log p(x) −4 0 2 4 6 8 10 Fig. 12. p(θi) for a Gamma hyperprior (a = b =10 −4 ). Left: p(θi); Right: − log... |

610 |
Introduction to Numerical Analysis
- Stoer, Bulirsch
- 1992
(Show Context)
Citation Context ...ons of the negative log posterior, and minimize the latter iteratively. This strategy is referred to as the Laplace approximation 9 [71, 84, 63]; the Newton-Raphson method, in numerical analysis (see =-=[72, 53]-=-); or the Fisher scoring method, in statistics. A necessary condition for the minimum of a differentiable function g is that its first derivative be 0. For convex functions, this requirement is also s... |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...t of observations X ′ := {x ′ 1,... ,x ′ m ′}. For notational convenience we sometimes use Z := {(x1,y1),... ,(xm,ym)} instead of X, Y . We begin with an overview over the fundamental ideas (s=-=ee also [3, 40, 46, 63, 56, 62] -=-for more details). 1 2.1 Maximum Likelihood and Bayes Rule Assume that we are given a set of observations Y which are drawn from a probability distribution pθ(Y ). It then follows that for a given va... |

581 |
A Treatise on the Theory of Bessel Functions
- Watson
- 1962
(Show Context)
Citation Context ... 2ω 2 s2 i � . (121)s108 A.J. Smola and B. Schölkopf Performing a Laplace transform leads to p(θi) = 1 �√ � 8 BesselK0 , (122) 2πθi θi where BesselK is the modified Bessel function of th=-=e second kind [79]. See Figu-=-re 11 for more properties of this function. x Density 104 10 8 6 4 2 p(x) 0 0 2 4 6 8 10 15 10 5 0 −5 −10 − log p(x) −15 0 2 4 6 8 10 Fig. 11. p(θi) for a normal hyperprior. Left: p(θi); Rig... |

564 | Probabilistic inference using Markov chain Monte Carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...θ) from p(Y,Y ′ ,θ) and thereby performing inference about the distribution of Y ′ . Various methods to carry out such samplings exist. Typically one uses Markov Chain Monte Carlo (MCMC) methods=-=. See [47] f-=-or details and further references. The advantage of such methods is that given sufficient computational resources, we are able to obtain a very good estimate on the distribution of Y ′ . Furthermore... |

556 | Sparse Bayesian learning and the relevance vector machine
- Tipping
- 2001
(Show Context)
Citation Context ...roduce individual (hyper)parameters si. The resulting prior, m − p(θ|X, s) =(2π) 2 � m� i=1 si � 1 2 i=1 � exp − 1 m� 2 i=1 siθ 2 i � , (101) leads to the construction of the Releva=-=nce Vector Machine [73]-=- and very sparse function expansions. Finally, the assumption underlying the Laplacian prior (100) is that only very few basis functions will be nonzero. The specific form of the prior is why we will ... |

549 |
Mathematical Methods of Statistics
- Cramer
- 1946
(Show Context)
Citation Context ...Proof. To prove the first part, we need only check is that y(x) and any linear combination � j y(xj) (for arbitrary x ′ j ∈ X) converge to a normal distribution. By application of a theorem of C=-=ramér [8], -=-this is sufficient to prove that y(x) is distributed according to a Gaussian Process. The random variable y(x) is a sum of m independent random variables with bounded variance (since k(x, x ′ ) is b... |

397 | Exploiting generative models in discriminative classifiers
- Jaakkola, Haussler
- 1998
(Show Context)
Citation Context ...x ′ ) = exp (−ω�x − x ′ �) (Laplacian kernel) (42) k(x, x ′ )=(〈x, x ′ 〉 + c) p with c>0 (Polynomial kernel) (43) and the Gaussian RBF kernel of (39). For further details on the c=-=hoice of kernels see [62, 25, 78, 31, 52, 77]-=- and the references therein. 3.4 Regression Let us put the previous discussion to practical use. For the sake of simplicity, we begin with regression (we study the classification setting in the next s... |

369 | Convolution kernels on discrete structures
- Haussler
- 1999
(Show Context)
Citation Context ...x ′ ) = exp (−ω�x − x ′ �) (Laplacian kernel) (42) k(x, x ′ )=(〈x, x ′ 〉 + c) p with c>0 (Polynomial kernel) (43) and the Gaussian RBF kernel of (39). For further details on the c=-=hoice of kernels see [62, 25, 78, 31, 52, 77]-=- and the references therein. 3.4 Regression Let us put the previous discussion to practical use. For the sake of simplicity, we begin with regression (we study the classification setting in the next s... |

318 |
Sparse approximate solutions to linear systems
- Natarajan
(Show Context)
Citation Context ...heoretical guarantees on the performance of the algorithm (as described in Algorithm 1.1). It turns out that our technique closely resembles a Sparse Linear Approximation problem studied by Natarajan =-=[44]: Given A ∈ R m×n , b ��-=-� R m , and ɛ>0, find x ∈ R n with minimal number of nonzero entries such that �Ax − b�2 ≤ ɛ. If we define we may write A = � σ 2 K + K ⊤ K � 1 2 and b := A −1 Ky, (90) L(α) = 1 ... |

284 | Using the Nyström method to speed up kernel machines,” in Advances in neural information processing systems
- Williams, Seeger
(Show Context)
Citation Context ...y is to project k(xi,x) on a random subset of dimensions, and express the missing terms as a linear combination of the resulting sub-matrix (this is the Nyström method proposed by Seeger and Williams=-= [83]-=-). We might also construct a randomized sparse greedy algorithm to select the dimensions (see [68] for details), or resort to a positive diagonal pivoting strategy [12]. An approximation of K by its l... |

265 |
Estimation with quadratic loss
- James, Stein
- 1961
(Show Context)
Citation Context ...ce we also plot the width of the confidence interval. If σ 2 were 0 we would obtain yi, however, with σ 2 > 0 we end up shrinking yi towards 0, similarly to the shrinkage estimator of James and Stei=-=n [33]. -=-Other Additive Noise: While we may in general not be able to integrate out θ from p(Y |θ), the noise models of regression typically allow one to perform several simplifications when it comes to esti... |

223 | Gaussian processes for regression
- Williams, Rasmussen
- 1996
(Show Context)
Citation Context ...hem takes place by means of the covariance matrix of a normal distribution. It turns out that this is a convenient way of extending Bayesian modeling of linear estimators to nonlinear situations (cf. =-=[82, 80, 63]). F-=-urthermore, it represents the counterpart of the “kernel trick” in methods minimizing the regularized risk. We present the basic ideas, and relegate details on efficient implementation of the opti... |

215 | The Relevance Vector Machine
- Tipping
- 2000
(Show Context)
Citation Context ...19). A common strategy is to resort to variational methods. The details are rather technical and go beyond the scope of this section. The interested reader is referred to [36] for an overview, and to =-=[4] -=-for an application to the Relevance Vector Machine of Section 6. The following theorem describes the basic idea. Theorem 1 (Variational Approximation of Densities). Denote by θ, Y random variables wi... |

210 | Robust linear programming discrimination of two linearly inseparable sets
- Bennett, Mangasarian
- 1992
(Show Context)
Citation Context ...) care about the behavior of the estimator on inputs x looking like faces. The specific benefit of this strategy is that it provides us with a correspondence between linear programming regularization =-=[43, 2, 65, 6]-=- and Bayesian priors over function spaces, by analogy to regularization in Reproducing Kernel Hilbert Spaces and Gaussian Processes. 12 5.1 Examples of Factorizing Priors Let us now study some of the ... |

196 |
Sequential updating of conditional probabilities on directed graphical structures. Networks
- Spiegelhalter, Lauritzen
- 1990
(Show Context)
Citation Context ...her. A possible solution is to make successive quadratic approximations of the negative log posterior, and minimize the latter iteratively. This strategy is referred to as the Laplace approximation 9 =-=[71, 84, 63]-=-; the Newton-Raphson method, in numerical analysis (see [72, 53]); or the Fisher scoring method, in statistics. A necessary condition for the minimum of a differentiable function g is that its first d... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...hem takes place by means of the covariance matrix of a normal distribution. It turns out that this is a convenient way of extending Bayesian modeling of linear estimators to nonlinear situations (cf. =-=[82, 80, 63]). F-=-urthermore, it represents the counterpart of the “kernel trick” in methods minimizing the regularized risk. We present the basic ideas, and relegate details on efficient implementation of the opti... |

194 | Efficient svm training using low-rank kernel representations
- Fine, Scheinberg
- 2001
(Show Context)
Citation Context ...rather than O(m 3 ). Numerically more stable, yet efficient and easily implementable methods than the Sherman-Woodbury-Morrison method exist, however their discussion would be somewhat technical. See =-=[69, 12, 21]-=- for further details and references. There are several ways to obtain a good approximation of (73). One way is to project k(xi,x) on a random subset of dimensions, and express the missing terms as a l... |

193 | Support vector method for function approximation, regression estimation, and signal processing
- Vapnik, Golowich, et al.
- 1997
(Show Context)
Citation Context ...e whole set. The advantage of this somewhat peculiar estimator is that optimization problems arising from it have a lower number of active constraints. This was exploited in Support Vector regression =-=[75, 76]. B-=-esides estimating the mean of a real-valued random variable y we may also have to deal with discrete valued ones. For simplicity we consider only binary y, that is y ∈{±1}. In this case it is most ... |

190 | Feature selection via concave minimization and support vector machines
- Bradley, Mangasarian
- 1998
(Show Context)
Citation Context ... on the locations xi include γ(θ) =1− e p|θ| with p>0 (feature selection prior), (98) γ(θ) =θ 2 (weight decay prior), (99) γ(θ) =|θ| (Laplacian prior). (100) The prior given by (98) was int=-=roduced in [5, 14]-=- and is log-concave. While the latter characteristic is unfavorable in general, since the corresponding optimization problem exhibits many local minima, the negative log-posterior becomes strictly con... |

179 | Kernel Principal Component Analysis
- Schölkopf, Müller
- 1999
(Show Context)
Citation Context ...ocess. The use of (105) is impossible for GP priors, unless we diagonalize the matrix K explicitly and render it positive definite by replacing λi with |λi|. This is a very costly procedure (see als=-=o [61, 24])-=- as it involves computing the eigensystem of K. 5.3 Estimation Since one of the aims of using a Laplacian prior on the coefficients θi is to achieve sparsity of the expansion, it does not appear sens... |

178 | Sparse greedy matrix approximation for machine learning
- Smola, Schölkopf
- 2000
(Show Context)
Citation Context ...r combination of the resulting sub-matrix (this is the Nyström method proposed by Seeger and Williams [83]). We might also construct a randomized sparse greedy algorithm to select the dimensions (see=-= [68]-=- for details), or resort to a positive diagonal pivoting strategy [12]. An approximation of K by its leading principal components, as often done in machine learning, is usually undesirable, since the ... |

154 | The evidence framework applied to classification networks
- MacKay
- 1992
(Show Context)
Citation Context ...tly normal, hence stem from an underlying Gaussian Process. We can calculate the covariance function k as follows. Let f(x) := (f1(x),... ,fn(x)), then k(x, x ′ )=Cov(y(x),y(x ′ )) = f(x) ⊤ Σf(=-=x ′ ). (41) In ot-=-her words, starting from a parametric model, where we would want to estimate the coefficients β we arrived at a Gaussian Process with covariance function f(x) ⊤ Σf(x ′ ). One special case is of ... |

151 | Bayesian methods for adaptive models
- MacKay
- 1991
(Show Context)
Citation Context |

149 |
Handbook of Matrices
- Lütkepohl
- 1996
(Show Context)
Citation Context ...n Y ′ . We know that p(Y,Y ′ ) is given by m+m′ 1 − (2π) 2 − (det Σ) 2 exp � − 1 2 � Y − µY Y ′ − µY ′ � ⊤ � ΣYY ΣYY ′ ΣY ′ Y ΣY ′ Y ′ � −1 � Y − =-=µY Y ′ − µY ′ Writing out the inverse of Σ (see e.g., [39]) -=-and collecting terms yields the above result. In the next section we will use the above example to perform Gaussian Process prediction. For the moment just note that once we know p(Y,Y ′ )itisvery e... |

149 |
Spline models for observational data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ... (det K) 2 exp Y := Kα, (36) � − 1 2 α⊤ � Kα . (37) Taking logs, we see that this term is identical to penalty term arising from the regularized risk framework (cf. the chapter on Support V=-=ectors and [62, 77, 75, 26]-=-). This result thus connects Gaussian process priors and estimators using the Reproducing Kernel Hilbert Space framework: Kernels favoring smooth functions translate immediately into covariance kernel... |

142 |
Matching pursuit in a time-frequency dictionary
- Mallat, Zhang
(Show Context)
Citation Context ...e expand the projection operator P into the matrix Pnew := [Pold, ei] ∈ Rm×(n+1) and seek the best ei such that Pnew minimizes minβ L(Pnewβ). Note that this method is very similar to Matching Pur=-=suit [42] a-=-nd to iterative reduced set Support Vector algorithms [60], with the difference that the target to be approximated (the full solution α) is only given implicitly via L(α). Recently Zhang [85] proved... |

140 | Evaluation of Gaussian Processes and Other Methods for Non-Linear Regression
- Rasmussen
- 1996
(Show Context)
Citation Context ...s the special and somewhat simpler case of regression with additive Gaussian noise, since here the latent variabless88 A.J. Smola and B. Schölkopf θ, θ ′ can be integrated out. We refer the reade=-=r to [84, 15, 11, 54] -=-and the references therein for integration methods based on Markov Chain Monte Carlo approximations (see also [63] for a more recent overview). More specifically, assume that both K and σ2 (the addit... |

133 | Bayesian classification with gaussian processes
- Williams, Barber
- 1998
(Show Context)
Citation Context ...s the special and somewhat simpler case of regression with additive Gaussian noise, since here the latent variabless88 A.J. Smola and B. Schölkopf θ, θ ′ can be integrated out. We refer the reade=-=r to [84, 15, 11, 54] -=-and the references therein for integration methods based on Markov Chain Monte Carlo approximations (see also [63] for a more recent overview). More specifically, assume that both K and σ2 (the addit... |

123 | Maximum entropy discrimination
- Jaakkola, Meila, et al.
- 1999
(Show Context)
Citation Context ...erforming Bayesian inference. These work by sampling from the posterior distribution rather than computing an approximation of the mode. On the model side, the maximum entropy discrimination paradigm =-=[64, 7, 30]-=- is a worthy concept in its own right, powerful enough to spawn a whole family of new inference algorithms both with [30] and without [34] kernels. The main idea is to seek the least informative estim... |

122 | Dynamic alignment kernels
- Watkins
- 1999
(Show Context)
Citation Context ...x ′ ) = exp (−ω�x − x ′ �) (Laplacian kernel) (42) k(x, x ′ )=(〈x, x ′ 〉 + c) p with c>0 (Polynomial kernel) (43) and the Gaussian RBF kernel of (39). For further details on the c=-=hoice of kernels see [62, 25, 78, 31, 52, 77]-=- and the references therein. 3.4 Regression Let us put the previous discussion to practical use. For the sake of simplicity, we begin with regression (we study the classification setting in the next s... |

113 |
Input space vs. feature space in kernel-based methods
- Schölkopf, Mika, et al.
- 1999
(Show Context)
Citation Context ...Pold, ei] ∈ Rm×(n+1) and seek the best ei such that Pnew minimizes minβ L(Pnewβ). Note that this method is very similar to Matching Pursuit [42] and to iterative reduced set Support Vector algori=-=thms [60], -=-with the difference that the target to be approximated (the full solution α) is only given implicitly via L(α). Recently Zhang [85] proved lower bounds on the rate of sparse approximation schemes. I... |

111 | Sparse greedy gaussian process regression
- Smola, Bartlett
- 2000
(Show Context)
Citation Context ...bound on the approximation quality of minima of quadratic forms and is thus applicable to (67). For convenience we rewrite (67) in terms of θ = Kα. Theorem 2 (Approximation Bounds for Quadratic Form=-=s [67]). Denote by K ∈ R m×m a symm-=-etric positive definite matrix, y, α ∈ R m , and define the two quadratic forms L(α) := −y ⊤ Kα + 1 2 α⊤ (σ 2 K + K ⊤ K)α, (78) L ∗ (α) := −y ⊤ α + 1 2 α⊤ (σ 2 1 + K)α. (7... |

109 |
Probabilities for SV machines
- Platt
- 2000
(Show Context)
Citation Context ...probit model. Furthermore, it indicates, that should one use the linear soft margin as a cheap proxy for optimization purposes, the logistic is a more adequate model to fit the densities subsequently =-=[51], -=-whereas, for the quadratic soft margin the probit model is to be preferred. 2.3 Inference Besides the problem of finding suitable estimates of θ for pθ(Y ), which can be used to understand the way o... |

89 |
An introduction to probabilistic graphical models
- Jordan
- 2002
(Show Context)
Citation Context ...iously chosen constant. Proponents of this strategy claim rapid convergence due to the good mixing properties of the dynamical system. Finally, we left the field of graphical models (see for instance =-=[71, 32, 36, 35]-=- and the references therein) completely untouched. These algorithms model the dependency structure between different random variables in a rather explicit fashion and use efficient approximate inferen... |