Results 1 - 10
of
41
Gaussian processes for machine learning
, 2006
"... Abstract. We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperpar ..."
Abstract
-
Cited by 152 (2 self)
- Add to MetaCart
Abstract. We give a basic introduction to Gaussian Process regression models. We focus on understanding the role of the stochastic process and how it is used to define a distribution over functions. We present the simple equations for incorporating training data and examine how to learn the hyperparameters using the marginal likelihood. We explain the practical advantages of Gaussian Process and end with conclusions and a look at the current trends in GP work. Supervised learning in the form of regression (for continuous outputs) and classification (for discrete outputs) is an important constituent of statistics and machine learning, either for analysis of data sets, or as a subgoal of a more complex problem. Traditionally parametric 1 models have been used for this purpose. These have a possible advantage in ease of interpretability, but for complex data sets, simple parametric models may lack expressive power, and their more complex counterparts (such as feed forward neural networks) may not be easy to work with in practice. The advent of kernel machines, such as Support Vector Machines and Gaussian Processes has opened the possibility of flexible models which are practical to work with. In this short tutorial we present the basic idea on how Gaussian Process models can be used to formulate a Bayesian framework for regression. We will focus on understanding the stochastic process and how it is used in supervised learning. Secondly, we will discuss practical matters regarding the role of hyperparameters in the covariance function, the marginal likelihood and the automatic Occam’s razor. For broader introductions to Gaussian processes, consult [1], [2]. 1 Gaussian Processes In this section we define Gaussian Processes and show how they can very naturally be used to define distributions over functions. In the following section we continue to show how this distribution is updated in the light of training examples. 1 By a parametric model, we here mean a model which during training “absorbs ” the information from the training data into the parameters; after training the data can be discarded.
Implementing approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations: A manual for the inla-program
, 2008
"... Structured additive regression models are perhaps the most commonly used class of models in statistical applications. It includes, among others, (generalised) linear models, (generalised) additive models, smoothing-spline models, state-space models, semiparametric regression, spatial and spatio-temp ..."
Abstract
-
Cited by 44 (13 self)
- Add to MetaCart
Structured additive regression models are perhaps the most commonly used class of models in statistical applications. It includes, among others, (generalised) linear models, (generalised) additive models, smoothing-spline models, state-space models, semiparametric regression, spatial and spatio-temporal models, log-Gaussian Cox-processes, geostatistical and geoadditive models. In this paper we consider approximate Bayesian inference in a popular subset of structured additive regression models, latent Gaussian models, where the latent field is Gaussian, controlled by a few hyperparameters and with non-Gaussian response variables. The posterior marginals are not available in closed form due to the non-Gaussian response variables. For such models, Markov chain Monte Carlo methods can be implemented, but they are not without problems, both in terms of convergence and computational time. In some practical applications, the extent of these problems is such that Markov chain Monte Carlo is simply not an appropriate tool for routine analysis. We show that, by using an integrated nested Laplace approximation and its simplified version, we can directly compute very accurate approximations to the posterior marginals. The main benefit of these approximations
Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities
- Neural Computation
, 2002
"... In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimat ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
In this work, we discuss practical methods for the assessment, comparison, and selection of complex hierarchical Bayesian models. A natural way to assess the goodness of the model is to estimate its future predictive capability by estimating expected utilities. Instead of just making a point estimate, it is important to obtain the distribution of the expected utility estimate, as it describes the uncertainty in the estimate. The distributions of the expected utility estimates can also be used to compare models, for example, by computing the probability of one model having a better expected utility than some other model. We propose an approach using crossvalidation predictive densities to obtain expected utility estimates and Bayesian bootstrap to obtain samples from their distributions. We also discuss the probabilistic assumptions made and properties of two practical cross-validation methods, importance sampling and k-fold cross-validation. As illustrative examples, we use MLP neural networks and Gaussian Processes (GP) with Markov chain Monte Carlo sampling in one toy problem and two challenging real-world problems.
Bayesian Approach for Neural Networks - Review and Case Studies
- Neural Networks
, 2001
"... We give a short review on the Bayesian approach for neural network learning and demonstrate the advantages of the approach in three real applications. We discuss the Bayesian approach with emphasis on the role of prior knowledge in Bayesian models and in classical error minimization approaches. The ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
We give a short review on the Bayesian approach for neural network learning and demonstrate the advantages of the approach in three real applications. We discuss the Bayesian approach with emphasis on the role of prior knowledge in Bayesian models and in classical error minimization approaches. The generalization capability of a statistical model, classical or Bayesian, is ultimately based on the prior assumptions. The Bayesian approach permits propagation of uncertainty in quantities which are unknown to other assumptions in the model, which may be more generally valid or easier to guess in the problem. The case problems studied in this paper include a regression, a classification, and an inverse problem. In the most thoroughly analyzed regression problem, the best models were those with less restrictive priors. This emphasizes the major advantage of the Bayesian approach, that we are not forced to guess attributes that are unknown, such as the number of degrees of freedom in the model, non-linearity of the model with respect to each input variable, or the exact form for the distribution of the model residuals.
Bayesian Inference for the Uncertainty Distribution
, 2000
"... this paper we are interested in the case where the computer model is computationally expensive, to the extent that the Monte Carlo approach is not practical. This is because the sample of outputs will usually need to be large, to be certain of obtaining accurate inferences about the distribution of ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
this paper we are interested in the case where the computer model is computationally expensive, to the extent that the Monte Carlo approach is not practical. This is because the sample of outputs will usually need to be large, to be certain of obtaining accurate inferences about the distribution of Y . Thus we need to find a way of learning about the uncertainty distribution without having to run the algorithm a large number of times. We consider a Bayesian approach, which uses the information from each single evaluation to learn about the algorithm as a whole, and so reduces the total number of evaluations needed. One immediate question is whether deriving the distribution of Y should be of interest when the model is unlikely to predict reality correctly. Firstly we note that even a good model can be rendered ineffective by an unknown input, if the resulting uncertainty in Y is high. In general our goal is simply to quantify the information that is lost by not knowing the exact value of an input in a model. A decision to invest more resources in learning the true value of an input could follow from an uncertainty analysis. However, we have made the simplification here that the user of the model would only evaluate the model at X given the value of X. Even if X is known, the user may choose to run the model at a range of inputs, dependent on X,
Variational Bayesian multinomial probit regression with Gaussian process priors
- Neural Computation
, 2005
"... It is well known in the statistics literature that augmenting binary and polychotomous response models with Gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
It is well known in the statistics literature that augmenting binary and polychotomous response models with Gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favour of Gaussian Process (GP) priors over functions, and employing variational approximations to the full posterior we obtain efficient computational methods for Gaussian Process classification in the multi-class setting 1. The model augmentation with additional latent variables ensures full a posteriori class coupling whilst retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multi-class Informative Vector Machines (IVM), emerge in a very natural and straightforward manner. This is the first time that a fully Variational Bayesian treatment for multi-class GP classification has been developed without having to resort to additional explicit approximations to the non-Gaussian likelihood term. Empirical comparisons with exact analysis via MCMC and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation. 1
Transformation-Invariant Clustering and Dimensionality Reduction Using EM
- IEEE Trans. Pattern Analysis and Machine Intelligence
, 2000
"... Clustering and dimensionality reduction are simple, effective ways to derive useful representations of data, such as images. These procedures often are used as preprocessing steps for more sophisticated pattern analysis techniques. (In fact, these procedures often perform as well as or better than m ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Clustering and dimensionality reduction are simple, effective ways to derive useful representations of data, such as images. These procedures often are used as preprocessing steps for more sophisticated pattern analysis techniques. (In fact, these procedures often perform as well as or better than more sophisticated pattern analysis techniques.) However, in situations where each input has been randomly transformed (e.g., by translation, rotation and shearing in images), these methods tend to extract cluster centers and submanifolds that account for variations in the input due to transformations, instead of more interesting and potentially useful structure. For example, if images of a human face are clustered, it would be more useful for the different clusters to represent different poses and expressions, instead of different translations and rotations. We describe a way to add transformation invariance to mixture models, factor analyzers and mixtures of factor analyzers by approximatin...
Assesing approximations for gaussian process classification
- In Advances in Neural Information Processing Systems
, 2005
"... Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and cor ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations.
Estimating ratios of normalizing constants using linked importance sampling
, 2005
"... Abstract. Ratios of normalizing constants for two distributions are needed in both Bayesian statistics, where they are used to compare models, and in statistical physics, where they correspond to differences in free energy. Two approaches have long been used to estimate ratios of normalizing constan ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract. Ratios of normalizing constants for two distributions are needed in both Bayesian statistics, where they are used to compare models, and in statistical physics, where they correspond to differences in free energy. Two approaches have long been used to estimate ratios of normalizing constants. The ‘simple importance sampling ’ (SIS) or ‘free energy perturbation ’ method uses a sample drawn from just one of the two distributions. The ‘bridge sampling ’ or ‘acceptance ratio ’ estimate can be viewed as the ratio of two SIS estimates involving a bridge distribution. For both methods, difficult problems must be handled by introducing a sequence of intermediate distributions linking the two distributions of interest, with the final ratio of normalizing constants being estimated by the product of estimates of ratios for adjacent distributions in this sequence. Recently, work by Jarzynski, and independently by Neal, has shown how one can view such a product of estimates, each based on simple importance sampling using a single point, as an SIS estimate on an extended state space. This ‘Annealed Importance Sampling ’ (AIS) method produces an exactly unbiased estimate for the ratio of normalizing constants even when the Markov transitions used do not reach equilibrium. In this paper, I show how a corresponding ‘Linked Importance Sampling ’ (LIS) method can be constructed in which the estimates for individual ratios are similar to bridge sampling estimates. As a further
Model-based transductive learning of the kernel matrix
- Machine Learning
, 2006
"... This paper addresses the problem of transductive learning of the kernel matrix from a probabilistic per-spective. We define the kernel matrix as a Wishart process prior and construct a hierarchical generative model for kernel matrix learning. Specifically, we consider the target kernel matrix as a r ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper addresses the problem of transductive learning of the kernel matrix from a probabilistic per-spective. We define the kernel matrix as a Wishart process prior and construct a hierarchical generative model for kernel matrix learning. Specifically, we consider the target kernel matrix as a random matrix following the Wishart distribution with a positive definite parameter matrix and a degree of freedom. This parameter matrix, in turn, has the inverted Wishart distribution (with a positive definite hyperpa-rameter matrix) as its conjugate prior and the degree of freedom is equal to the dimensionality of the feature space induced by the target kernel. Resorting to a missing data problem, we devise an expectation-maximization (EM) algorithm to infer the missing data, parameter matrix and feature dimensionality in a maximum a posteriori (MAP) manner. Using different settings for the target kernel and hyperparameter matrices, our model can be applied to different types of learning problems. In particular, we consider its application in a semi-supervised learning setting and present two classification methods. Classification experiments are reported on some benchmark data sets with encouraging results. In addition, we also devise the EM algorithm for kernel matrix completion.

