Results 1  10
of
22
Sparse Bayesian Learning and the Relevance Vector Machine
, 2001
"... This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classication tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the `relevance vec ..."
Abstract

Cited by 552 (5 self)
 Add to MetaCart
This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classication tasks utilising models linear in the parameters. Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the `relevance vector machine' (RVM), a model of identical functional form to the popular and stateoftheart `support vector machine' (SVM). We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while oering a number of additional advantages. These include the benets of probabilistic predictions, automatic estimation of `nuisance' parameters, and the facility to utilise arbitrary basis functions (e.g. non`Mercer' kernels).
Fast Marginal Likelihood Maximisation for Sparse Bayesian Models
 Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics
, 2003
"... The 'sparse Bayesian' modelling approach, as exemplified by the 'relevance vector machine ', enables sparse classification and regression functions to be obtained by linearlyweighting a small nmnber of fixed basis functions from a large dictionary of potential candidates. Such a model conveys ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
The 'sparse Bayesian' modelling approach, as exemplified by the 'relevance vector machine ', enables sparse classification and regression functions to be obtained by linearlyweighting a small nmnber of fixed basis functions from a large dictionary of potential candidates. Such a model conveys a nmnber of advantages over the related and very popular 'support vector machine', but the necessary 'training' procedure optimisation of the marginal likelihood function is typically much slower. We describe a new and highly accelerated algorithm which exploits recentlyelucidated properties of the marginal likelihood function to enable maximisation via a principled and efficient sequential addition and deletion of candidate basis functions.
Moderating the Outputs of Support Vector Machine Classifiers
 IEEE Transactions on Neural Networks
, 1999
"...  In this paper, we extend the use of moderated outputs to the support vector machine (SVM) by making use of a relationship between SVM and the evidence framework. The moderated output is more in line with the Bayesian idea that the posterior weight distribution should be taken into account upon pre ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
 In this paper, we extend the use of moderated outputs to the support vector machine (SVM) by making use of a relationship between SVM and the evidence framework. The moderated output is more in line with the Bayesian idea that the posterior weight distribution should be taken into account upon prediction, and it also alleviates the usual tendency of assigning overly high condence to the estimated class memberships of the test patterns. Moreover, the moderated output derived here can be taken as an approximation to the posterior class probability. Hence, meaningful rejection thresholds can be assigned and outputs from several networks can be directly compared. Experimental results on both articial and realworld data are also discussed. KeywordsSupport vector machine, Evidence framework, Moderated output, Bayesian I. Introduction I N recent years, there has been a lot of interest in studying the support vector machine (SVM) [1], [2], [3], [4], [5], [6], [7]. SVM is based on the i...
Bayesian support vector regression using a unified loss function
 IEEE Transactions on Neural Networks
, 2004
"... In this paper, we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
In this paper, we use a unified loss function, called the soft insensitive loss function, for Bayesian support vector regression. We follow standard Gaussian processes for regression to set up the Bayesian framework, in which the unified loss function is used in the likelihood evaluation. Under this framework, the maximum a posteriori estimate of the function values corresponds to the solution of an extended support vector regression problem. The overall approach has the merits of support vector regression such as convex quadratic programming and sparsity in solution representation. It also has the advantages of Bayesian methods for model adaptation and error bars of its predictions. Experimental results on simulated and realworld data sets indicate that the approach works well even on large data sets.
Sollich P: Model Selection for Support Vector Machine Classification. Neurocomputing 2003
"... We address the problem of model selection for Support Vector Machine (SVM) classification. For fixed functional form of the kernel, model selection amounts to tuning kernel parameters and the slack penalty coefficient C. We begin by reviewing a recently developed probabilistic framework for SVM clas ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
We address the problem of model selection for Support Vector Machine (SVM) classification. For fixed functional form of the kernel, model selection amounts to tuning kernel parameters and the slack penalty coefficient C. We begin by reviewing a recently developed probabilistic framework for SVM classification. An extension to the case of SVMs with quadratic slack penalties is given and a simple approximation for the evidence is derived, which can be used as a criterion for model selection. We also derive the exact gradients of the evidence in terms of posterior averages and describe how they can be estimated numerically using Hybrid Monte Carlo techniques. Though computationally demanding, the resulting gradient ascent algorithm is a useful baseline tool for probabilistic SVM model selection, since it can locate maxima of the exact (unapproximated) evidence. We then perform extensive experiments on several benchmark data sets. The 1 aim of these experiments is to compare the performance of probabilistic model selection criteria with alternatives based on estimates of the test error, namely the socalled “span estimate ” and Wahba’s Generalized Approximate CrossValidation (GACV) error. We find that all the “simple ” model criteria (Laplace evidence approximations, and the Span and GACV error estimates) exhibit multiple local optima with respect to the hyperparameters. While some of these give performance that is competitive with results from other approaches in the literature, a significant fraction lead to rather higher test errors. The results for the evidence gradient ascent method show that also the exact evidence exhibits local optima, but these give test errors which are much less variable and also consistently lower than for the simpler model selection criteria.
Bayesian framework for least squares support vector machine classifiers, Gaussian processes and kernel fisher discriminant analysis
 NEURAL COMPUTATION
, 2002
"... The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless,the training of MLPs suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In Support Vector Machin ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
The Bayesian evidence framework has been successfully applied to the design of multilayer perceptrons (MLPs) in the work of MacKay. Nevertheless,the training of MLPs suffers from drawbacks like the nonconvex optimization problem and the choice of the number of hidden units. In Support Vector Machines (SVMs) for classification,as introduced by Vapnik,a nonlinear decision boundary is obtained by mapping the input vector first in a nonlinear way to a high dimensional kernelinduced feature space in which a linear large margin classifier is constructed. Practical expressions are formulated in the dual space in terms of the related kernel function and the solution follows from a (convex) quadratic programming (QP) problem. In Least Squares SVMs (LSSVMs), the SVM problem formulation is modified by introducing a least squares cost function and equality instead of inequality constraints and the solution follows from a linear system in the dual space. Implicitly,the least squares formulation corresponds to a regression formulation and is also related to kernel
Bayesian Support Vector Regression
 In Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics
, 2001
"... We show that the Bayesian evidence framework can be applied to both epsilonsupport vector regression (epsilonSVR) and nusupport vector regression (nuSVR) algorithms. Standard SVR training can be regarded as performing level one inference of the evidence framework, while levels two and three allo ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We show that the Bayesian evidence framework can be applied to both epsilonsupport vector regression (epsilonSVR) and nusupport vector regression (nuSVR) algorithms. Standard SVR training can be regarded as performing level one inference of the evidence framework, while levels two and three allow automatic adjustments of the regularization and kernel parameters respectively, without the need of a validation set.
Bayesian Gaussian Process Classification with the EMEP Algorithm
"... Abstract—Gaussian process classifiers (GPCs) are Bayesian probabilistic kernel classifiers. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a Gaussian process prior over this l ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
Abstract—Gaussian process classifiers (GPCs) are Bayesian probabilistic kernel classifiers. In GPCs, the probability of belonging to a certain class at an input location is monotonically related to the value of some latent function at that location. Starting from a Gaussian process prior over this latent function, data are used to infer both the posterior over the latent function and the values of hyperparameters to determine various aspects of the function. Recently, the expectation propagation (EP) approach has been proposed to infer the posterior over the latent function. Based on this work, we present an approximate EM algorithm, the EMEP algorithm, to learn both the latent function and the hyperparameters. This algorithm is found to converge in practice and provides an efficient Bayesian framework for learning hyperparameters of the kernel. A multiclass extension of the EMEP algorithm for GPCs is also derived. In the experimental results, the EMEP algorithms are as good or better than other methods for GPCs or Support Vector Machines (SVMs) with crossvalidation. Index Terms—Gaussian process classification, Bayesian methods, kernel methods, expectation propagation, EMEP algorithm. 1
Bayesian approach to feature selection and parameter tuning for Support Vector Machine classifiers
"... ..."
Modelbased transductive learning of the kernel matrix
 Machine Learning
, 2006
"... This paper addresses the problem of transductive learning of the kernel matrix from a probabilistic perspective. We define the kernel matrix as a Wishart process prior and construct a hierarchical generative model for kernel matrix learning. Specifically, we consider the target kernel matrix as a r ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
This paper addresses the problem of transductive learning of the kernel matrix from a probabilistic perspective. We define the kernel matrix as a Wishart process prior and construct a hierarchical generative model for kernel matrix learning. Specifically, we consider the target kernel matrix as a random matrix following the Wishart distribution with a positive definite parameter matrix and a degree of freedom. This parameter matrix, in turn, has the inverted Wishart distribution (with a positive definite hyperparameter matrix) as its conjugate prior and the degree of freedom is equal to the dimensionality of the feature space induced by the target kernel. Resorting to a missing data problem, we devise an expectationmaximization (EM) algorithm to infer the missing data, parameter matrix and feature dimensionality in a maximum a posteriori (MAP) manner. Using different settings for the target kernel and hyperparameter matrices, our model can be applied to different types of learning problems. In particular, we consider its application in a semisupervised learning setting and present two classification methods. Classification experiments are reported on some benchmark data sets with encouraging results. In addition, we also devise the EM algorithm for kernel matrix completion.