## University of Cambridge.

### BibTeX

@MISC{_universityof,

author = {},

title = {University of Cambridge.},

year = {}

}

### OpenURL

### Abstract

Bayesian inference offers us a powerful tool with which to tackle the problem of data modelling. However, the performance of Bayesian methods is crucially dependent on being able to find good models for our data. The principal focus of this thesis is the development of models based on Gaussian process priors. Such models, which can be thought of as the infinite extension of several existing finite models, have the flexibility to model complex phenomena while being mathematically simple. In this thesis, I present a review of the theory of Gaussian processes and their covariance functions and demonstrate how they fit into the Bayesian framework. The efficient implementation of a Gaussian process is discussed with particular reference to approximate methods for matrix inversion based on the work of Skilling (1993). Several regression problems are examined. Nonstationary covariance functions are developed for the regression of neuron spike data and the use of Gaussian processes to model the potential energy surfaces of weakly bound molecules is discussed. Classification methods based on Gaussian processes are implemented using variational methods. Existing bounds (Jaakkola and Jordan 1996) for the sigmoid function are used to tackle binary problems and multi-dimensional bounds on the softmax function are presented for the multiple class case. The performance of the variational classifier is compared with that of other methods using the CRABS and PIMA datasets (Ripley 1996) and the problem of predicting the cracking of welds based on their chemical composition is also investigated. The theoretical calculation of the density of states of crystal structures is discussed in detail. Three possible approaches to the problem are described based on free energy minimization, Gaussian processes and the theory of random matrices. Results from these approaches are compared with the state-of-the-art techniques (Pickard 1997)

### Citations

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...ch input. Many other models have been devised where the parameters and hyperparameters can be given physical meaning: Hierarchical mixtures of experts (Waterhouse and Robinson 1995), Belief networks (=-=Pearl 1988-=-) and certain fuzzy logic architectures (Mendel 1994) to name a few. Using models with interpretable parameters and hyperparameters also helps us express our priors easily as we can target the desired... |

2100 |
Matrix computations
- Golub, Loan
- 1983
(Show Context)
Citation Context ... small the departure from perfect orthogonalization becomes far more pronounced. The error in the orthogonalization can be expressed in terms of the dot product of the normalized vectors ek and ek+1 (=-=Golub and Loan 1990-=-): jeTk+1ekj = jg Tk+1ekj + 101\Gamma pjjCN jj2 jgk+1j (3.58) where p is the precision. jjCN jj2 is the 2-norm of CN which is equal to the largest eigenvalue of CN . Thus the error in orthogonalizatio... |

935 | Reversible jump Markov chain Monte Carlo computation and Bayesian model determination
- Green
- 1995
(Show Context)
Citation Context ...ve continuous searches over the hyperparameters and discrete searches over the architecture parameters. The continuous searches are preferable because, while automated discrete searches are possible (=-=Green 1995-=-), more sophisticated gradient based algorithms are available for continuous searches which are not appropriate for the discrete case. Ideally we would like to remove any uncertainty in architecture a... |

779 |
Methods of conjugate gradients for solving linear systems
- Hestenes, Stiefel
- 1952
(Show Context)
Citation Context ...d a way to invert a matrix which does not exhibit such poor scaling with N . Many algorithms exist for the approximate inversion of matrices. Most of these are based on conjugate gradient techniques (=-=Hestenes and Stiefel 1952-=-) in the form of the Lanczos algorithm (Lanczos 1950). The goal of these algorithms is to find good approximations to a function of the inverse of a matrix (often the result of applying the inverse to... |

594 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...log posterior with respect to \Theta . A gradient based optimization algorithm can then be used to find the most probable hyperparameters or a Monte Carlo sampling method (such as Hybrid Monte Carlo (=-=Neal 1993-=-)) can be used to integrate over the hyperparameters. I shall refer to this as a direct implementation. Direct implementations can run into numerical problems when the ratio of the highest and lowest ... |

429 | A Practical Bayesian Framework for Backpropagation Networks
- Mackay
- 1992
(Show Context)
Citation Context ...ariance function. Having given examples of both stationary and non-stationary covariance functions, I show how the hyperparameters of a Gaussian process can be determined using Evidence maximization (=-=MacKay 1992-=-a). I also discuss the pros and cons of Evidence maximization in comparison to a Markov chain Monte Carlo approach. Previous work on Gaussian processes is then briefly reviewed and the relationship of... |

230 | Gaussian processes for regression
- Williams, Rasmussen
- 1996
(Show Context)
Citation Context ...he investigation of Gaussian processes is far from new, only recently have people viewed them from a Bayesian perspective with the aim of using them as models for regression and classification tasks (=-=Williams and Rasmussen 1996-=-; Barber and Williams 1996; Neal 1997). It is the aim of this thesis to extend this work with special emphasis on non-stationary forms of covariance functions, efficient implementation of the Gaussian... |

189 | Statistical theory of the energy levels of complex systems, i, ii and iii - Dyson - 1962 |

178 | Probability, Frequency and Reasonable Expectation - Cox - 1946 |

155 | Probable Networks and Plausible Predictions - A Review of Practical Bayesian Methods for Supervised Neural Networks
- MacKay
- 1995
(Show Context)
Citation Context ...eural network do not tell us a great deal about the data but the hyperparameters, if defined in a sensible manner, can lead to a greater understanding. For example, automatic relevance determination (=-=MacKay 1995-=-b; Neal 1996), the process of determining whether individual inputs cause significant variation in the outputs, can be performed by specifying different regularization classes for weights connected to... |

147 | Evaluation of Gaussian processes and other methods for non-linear regression
- Rasmussen
- 1996
(Show Context)
Citation Context ...ion of Gaussian processes is far from new, only recently have people viewed them from a Bayesian perspective with the aim of using them as models for regression and classification tasks (Williams and =-=Rasmussen 1996-=-; Barber and Williams 1996; Neal 1997). It is the aim of this thesis to extend this work with special emphasis on non-stationary forms of covariance functions, efficient implementation of the Gaussian... |

132 | Keeping neural networks simple by minimizing the description length of the weights - Hinton, Camp - 1993 |

128 | Monte Carlo implementation of Gaussian process models for Bayesian regression and classi
- Neal
- 1997
(Show Context)
Citation Context ... only recently have people viewed them from a Bayesian perspective with the aim of using them as models for regression and classification tasks (Williams and Rasmussen 1996; Barber and Williams 1996; =-=Neal 1997-=-). It is the aim of this thesis to extend this work with special emphasis on non-stationary forms of covariance functions, efficient implementation of the Gaussian process framework for large training... |

115 | Autoencoders, minimum description length and helmholtz free energy - Hinton, Zemel - 1993 |

85 | On the Distribution of Spacing Between Zeros of Zeta Function - Odlyzko - 1987 |

83 |
Iterative minimization techniques for ab initio total-energy calculations: molecular dynamics and conjugate gradients
- Payne, Teter, et al.
- 1992
(Show Context)
Citation Context ...least, infeasibly computationally expensive have been tackled successfully. Within solid state physics, there has been much use of such ab initio techniques to investigate the properties of crystals (=-=Payne et al. 1992-=-). One area that has provoked interest has been that of calculating core-edge Electron Energy Loss Spectra (EELS). EELS are experimentally generated using Scanning Transmission Electron Microscopes fi... |

82 |
Harmonic Analysis and the Theory of Probability
- Bochner
- 1955
(Show Context)
Citation Context ...USING GAUSSIAN PROCESSES 15 a positive bounded symmetric measure G(:), Cstat(h; \Theta ) = Z 1 \Gamma 1 \Deltas\Deltas\DeltasZ 1 \Gamma 1 exp iiw T hj G(dw) (2.13) This is known as Bochner's theorem (=-=Bochner 1979-=-). It can easily be demonstrated by considering the following expression of positive definiteness: rT Cr ? 0 (2.14) for any arbitrary real vector r. In terms of a stationary covariance function this c... |

81 |
Bayesian inductive inference and maximum entropy
- Gull
- 1989
(Show Context)
Citation Context ...e simplest model which accounts for the data rather than including unnecessary complications. As well as being intuitive this is also computational attractive. Bayesian theory embodies Occam's Razor (=-=Gull 1988-=-) which states that we should not assign a higher plausibility to a complex model than to a simpler model when the extra complexity is not required to explain the data. Thus application of Bayesian te... |

64 | Bayesian methods for backpropagation networks - MacKay - 1994 |

52 | Principles of Geostatistics.” Economic Geology 58 - Matheron - 1963 |

41 | Computing upper and lower bounds on likelihoods in intractable networks
- Jaakkola, Jordan
- 1996
(Show Context)
Citation Context ...processes to model the potential energy surfaces of weakly bound molecules is discussed. Classification methods based on Gaussian processes are implemented using variational methods. Existing bounds (=-=Jaakkola and Jordan 1996-=-) for the sigmoid function are used to tackle binary problems and multi-dimensional bounds on the softmax function are presented for the multiple class case. The performance of the variational classif... |

38 | Flexible non-linear approaches to classification - Ripley - 1994 |

35 | Gaussian Processes for Bayesian classification via hybrid Monte Carlo
- Barber, Williams
- 1997
(Show Context)
Citation Context ...processes is far from new, only recently have people viewed them from a Bayesian perspective with the aim of using them as models for regression and classification tasks (Williams and Rasmussen 1996; =-=Barber and Williams 1996-=-; Neal 1997). It is the aim of this thesis to extend this work with special emphasis on non-stationary forms of covariance functions, efficient implementation of the Gaussian process framework for lar... |

34 |
Bayesian Learning for Neural Networks, Number 118
- Neal
- 1996
(Show Context)
Citation Context ...onstruct a neural network with an infinite number of hidden neurons while still retaining a tractable model with a low level of complexity provided that we have the appropriate priors on the weights (=-=Neal 1996-=-) (see Section 2.5 for details). Another point to emphasise is that Occam's Razor does not suggest that we should vary the size of our model dependent on the amount of data we receive. Even if we have... |

33 |
Parameter uncertainty in estimation of spatial functions: Bayesian analysis
- Kitanidis
- 1986
(Show Context)
Citation Context ...ssian process approach to regression. `Kriging' has been developed considerably in the last thirty years (see Cressie (1993) for an excellent review) including several Bayesian treatments (Omre 1987; =-=Kitanidis 1986-=-). However the geostatistics approach to the Gaussian process model has concentrated mainly on low-dimensional input spaces and has has largely ignored any probabilistic interpretation of the model an... |

29 | Analysis of Linsker’s simulation of Hebbian rules - MacKay, Miller - 1990 |

28 |
1713): Ars Conjectandi
- Bernoulli
(Show Context)
Citation Context ...roblems. 1.1.1 A Brief History of Bayesian Inference Bernoulli was one of the first to raise the question of how we might use deductive logic to solve inductive problems. In his book Ars Conjectandi (=-=Bernoulli 1713-=-), he discussed the convergence of Binomial distributions and the relationship of uncertainty to probability but he did not formulate any corresponding mathematical structure. Such a structure was pro... |

22 |
Information - Based Objective Functions for Active Data
- MacKay
- 1992
(Show Context)
Citation Context ...lation. All the ab-initio data was generated using the CADPAC package (Amos 1982). In the investigation performed by Brown, an MLP with 32 hidden nodes (the number chosen using Evidence maximization (=-=MacKay 1992-=-d)) was trained on the data set using BACKPROB. In order to improve accuracy of the solution, Brown added a further 60 points to the data set and re-trained the network. The whole procedure took about... |

20 | Nonlinear prediction of acoustic vectors using hierarchical mixture of experts, in
- Waterhouse, Robinson
- 1994
(Show Context)
Citation Context ...gularization classes for weights connected to each input. Many other models have been devised where the parameters and hyperparameters can be given physical meaning: Hierarchical mixtures of experts (=-=Waterhouse and Robinson 1995-=-), Belief networks (Pearl 1988) and certain fuzzy logic architectures (Mendel 1994) to name a few. Using models with interpretable parameters and hyperparameters also helps us express our priors easil... |

19 |
Ajiz, In of the eigenvalue spectrum on the convergence rate of the conjugate gradient method
- Jennings, A
- 1977
(Show Context)
Citation Context ...K iterations where K is significantly less than N . In this section we will discuss the rate of convergence of the conjugate gradient algorithm and how this relates to the eigenvalue structure of CN (=-=Jennings 1977-=-). In the next section we will go on to derive a set of bounds which will allow us to monitor the convergence of the algorithm without detailed knowledge of the eigenstructure. Let Q = [q1; q2; \Delta... |

18 | Hyperparameters: Optimize or integrate out
- Mackay
- 1996
(Show Context)
Citation Context ... (tN+1jxN+1; D; C(:); \Theta ) (see Figure 2.4). This approximation is generally good and Evidence maximization predictions are often very close to those found using the true predictive distribution (=-=MacKay 1996-=-).sCHAPTER 2. REGRESSION USING GAUSSIAN PROCESSES 20 \Theta MP \ThetasP (tN+1jxN+1; D; C(:); \Theta ) P (\Theta jD; C(:)) Figure 2.4: The Evidence Approximation : This figure illustrates the assumptio... |

18 | On curve fitting and optimal design for regression - O‘Hagan - 1978 |

16 | Free-energy minimization algorithm for decoding and cryptoanalysis. Electron Letters 31:445–47. [aAC] MacKay,D.M.(1956)Theepistemologicalproblemforautomata.In:Automata studies
- MacKay, C
- 1995
(Show Context)
Citation Context ...eural network do not tell us a great deal about the data but the hyperparameters, if defined in a sensible manner, can lead to a greater understanding. For example, automatic relevance determination (=-=MacKay 1995-=-b; Neal 1996), the process of determining whether individual inputs cause significant variation in the outputs, can be performed by specifying different regularization classes for weights connected to... |

13 |
Similarity metric learning for a variable kernel classifier
- Lowe
- 1995
(Show Context)
Citation Context ...8) introduced an approach which is essentially similar to Gaussian processes. Generalized radial basis functions (Poggio and Girosi 1989), ARMA models (Wahba 1990) and variable metric kernel methods (=-=Lowe 1995-=-) are all closely related to Gaussian processes. The present interest in the area has been initiated by the work of Neal (1996) on priors for infinite networks. Neal showed that the prior over functio... |

13 | Interpolation models with multiple hyperparameters - Takeuchi - 1996 |

12 | Sir, Theory of Probability, Oxford Univ - Jeffreys - 1939 |

12 |
Bayesian kriging: merging observations and qualifi ed guesses in kriging
- Omre
- 1987
(Show Context)
Citation Context ... to the Gaussian process approach to regression. `Kriging' has been developed considerably in the last thirty years (see Cressie (1993) for an excellent review) including several Bayesian treatments (=-=Omre 1987-=-; Kitanidis 1986). However the geostatistics approach to the Gaussian process model has concentrated mainly on low-dimensional input spaces and has has largely ignored any probabilistic interpretation... |

9 | Statistical properties of atomic and nuclear spectra, Ann - Porter, Rosenzweig - 1960 |

9 | Spectral statistics in elastodynamics - Weaver - 1989 |

9 | Statistical properties of real symmetric matrices with many dimensions - Wigner - 1965 |

8 |
A random-walk simulation of the Schrödinger equation
- Anderson
- 1975
(Show Context)
Citation Context ...ential energy surface as a whole. Once we have obtained a model for the potential energy surface, we then need to find a method to determine the properties of the system. Diffusion Monte Carlo (DMC) (=-=Anderson 1975-=-; Suhm and Watts 1991) is a general and exact method for determining the ground state of the time-independent Schr"odinger equation. It takes into account the couplings and anharmonicities embodied by... |

8 |
Bayesian interpolation. Neural Computation 4(3):415–447
- MacKay
- 1991
(Show Context)
Citation Context ...ariance function. Having given examples of both stationary and non-stationary covariance functions, I show how the hyperparameters of a Gaussian process can be determined using Evidence maximization (=-=MacKay 1992-=-a). I also discuss the pros and cons of Evidence maximization in comparison to a Markov chain Monte Carlo approach. Previous work on Gaussian processes is then briefly reviewed and the relationship of... |

6 |
The Cambridge Analytic Derivative Package, version 4.0
- Amos, Rice
- 1987
(Show Context)
Citation Context ...as towards the region containing the bottom of the potential well as greater accuracy is required in this region for the DMC simulation. All the ab-initio data was generated using the CADPAC package (=-=Amos 1982-=-). In the investigation performed by Brown, an MLP with 32 hidden nodes (the number chosen using Evidence maximization (MacKay 1992d)) was trained on the data set using BACKPROB. In order to improve a... |

5 |
Model for hot cracking in low-alloy steel weld metals. Science and Technology of Welding and Joining 1
- Ichikawa, Bhadeshia, et al.
- 1996
(Show Context)
Citation Context ...keuchi (1994), inference of potential energy surfaces (Brown et al. 1996), classification of the PIMA and CRABS datasets by Barber and Williams (1996) and Ripley (1994), weld strength classification (=-=Ichikawa et al. 1996-=-), theoretical calculations of densities of states by Pickard (1997)) these are done as illustrations and are not meant as definitive comparisons between Gaussian processes and other methods. Chapter ... |

4 |
Pattern Recognition and Neural
- Ripley
- 2008
(Show Context)
Citation Context ...ional bounds on the softmax function are presented for the multiple class case. The performance of the variational classifier is compared with that of other methods using the CRABS and PIMA datasets (=-=Ripley 1996-=-) and the problem of predicting the cracking of welds based on their chemical composition is also investigated. The theoretical calculation of the density of states of crystal structures is discussed ... |

3 |
Combining ab initio computations, neural networks, and diffusion Monte Carlo: An efficient method to treat weakly bound molecules
- Brown, Gibbs, et al.
- 1996
(Show Context)
Citation Context ... While several comparisons are made in this thesis with the work of others on specific datasets (regression of neuron spike data by MacKay and Takeuchi (1994), inference of potential energy surfaces (=-=Brown et al. 1996-=-), classification of the PIMA and CRABS datasets by Barber and Williams (1996) and Ripley (1994), weld strength classification (Ichikawa et al. 1996), theoretical calculations of densities of states b... |

3 |
A general method for constructing multidimensional molecular potential energy surfaces from ab initio calculations
- Ho, Rabitz
(Show Context)
Citation Context ...However most of these schemes have either been specific to certain groups of systems (Tully 1980; Bowman and Kuppermann 1975; Connor et al. 1975), been viable only for a limited number of dimensions (=-=Ho and Rabitz 1996-=-) or required a large amount of ab initio data (Gregory and Clary 1995). We are principally interested in modelling the potential energy surface of large weakly bound systems. Large systems have many ... |

2 | Discoveries of the Faraday Society 73: 45. Bayes, T. (1763) An essay towards solving a problem in the doctrine of chances - Barton, Howard - 1982 |

2 | An analytical 6-dimensional potential energy surface for (hf )2 from ab initio calculations - Bunker, Kofranek, et al. - 1988 |

2 |
Exact quantum transition probabilities by the state path sum method: colinear F + H2 reaction
- Connor, Jakubetz, et al.
- 1975
(Show Context)
Citation Context ...n the past to model the potential energy surface of weakly bound systems. However most of these schemes have either been specific to certain groups of systems (Tully 1980; Bowman and Kuppermann 1975; =-=Connor et al. 1975-=-), been viable only for a limited number of dimensions (Ho and Rabitz 1996) or required a large amount of ab initio data (Gregory and Clary 1995). We are principally interested in modelling the potent... |