Results 1  10
of
16
A hierarchical dirichlet language model
 Natural Language Engineering
, 1994
"... We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as 'smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions ..."
Abstract

Cited by 79 (3 self)
 Add to MetaCart
We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as 'smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources. 1
Ensemble Learning and Evidence Maximization
 Proc. NIPS
, 1995
"... Ensemble learning by variational free energy minimization is a tool introduced to neural networks by Hinton and van Camp in which learning is described in terms of the optimization of an ensemble of parameter vectors. The optimized ensemble is an approximation to the posterior probability distributi ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Ensemble learning by variational free energy minimization is a tool introduced to neural networks by Hinton and van Camp in which learning is described in terms of the optimization of an ensemble of parameter vectors. The optimized ensemble is an approximation to the posterior probability distribution of the parameters. This tool has now been applied to a variety of statistical inference problems. In this paper I study a linear regression model with both parameters and hyperparameters. I demonstrate that the evidence approximation for the optimization of regularization constants can be derived in detail from a free energy minimization viewpoint. 1 Ensemble Learning by Free Energy Minimization A new tool has recently been introduced into the field of neural networks and statistical inference. In traditional approaches to neural networks, a single parameter vector w is optimized by maximum likelihood or penalized maximum likelihood. In the Bayesian interpretation, these optimized param...
Bayesian Methods for Neural Networks: Theory and Applications
, 1995
"... this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
this document. Before these are discussed however, perhaps we should have a tutorial on Bayesian probability theory and its application to model comparison problems. 2 Probability theory and Occam's razor
Interpolation Models with Multiple Hyperparameters
, 1997
"... A traditional interpolation model is characterized by the choice of rcg ularizcr applied to the intcrpolant, and the choice of noise model. Typi cally, the rcgularizcr has a single rcgularization constant , and the noise model has a single parameter . The ratio / alone is responsible for de t ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
A traditional interpolation model is characterized by the choice of rcg ularizcr applied to the intcrpolant, and the choice of noise model. Typi cally, the rcgularizcr has a single rcgularization constant , and the noise model has a single parameter . The ratio / alone is responsible for de termining globally all these attributes of the intcrpolant: its 'complexity', 'flexibility', 'smoothness', 'characteristic scale length', and 'characteristic amplitude'. We suggest that interpolation models should be able to cap turc more than just one flavour of simplicity and complexity. Wc describe Bayesian models in which the intcrpolant has a smoothness that varies spatially. We emphasize the importance, in practical implementation, of the concept of 'conditional convexity' when designing models with many hyperparameters.
An Empirical Evaluation of Bayesian Sampling with Hybrid Monte Carlo for Training Neural Network Classifiers
 Neural Networks
, 1998
"... This article gives a concise overview of Bayesian sampling for neural networks, and then presents an extensive evaluation on a set of various benchmark classification problems. The main objective is to study the sensitivity of this scheme to changes in the prior distribution of the parameters and hy ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
This article gives a concise overview of Bayesian sampling for neural networks, and then presents an extensive evaluation on a set of various benchmark classification problems. The main objective is to study the sensitivity of this scheme to changes in the prior distribution of the parameters and hyperparameters, and to evaluate the efficiency of the socalled automatic relevance determination (ARD) method. The paper concludes with a comparison of the achieved classification results with those obtained with (i) the evidence scheme and (ii) with nonBayesian methods. Keywords Bayesian statistics, prior and posterior distribution, parameters and hyperparameters, Gibbs sampling, hybrid Monte Carlo, automatic relevance determination (ARD), evidence approximation, classification problems, benchmarking. 1 Theory: Sampling of network weights and hyperparameters from the posterior distribution The objective of this section is to give a concise yet selfcontained overview of the Bayesian app...
Bayesian Regression Filters and the Issue of Priors
"... We propose a Bayesian framework for regression problems, which covers areas which are usually dealt with by function approximation. An online learning algorithm is derived which solves regression problems with a Kalman filter. Its solution always improves with increasing model complexity, without th ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
We propose a Bayesian framework for regression problems, which covers areas which are usually dealt with by function approximation. An online learning algorithm is derived which solves regression problems with a Kalman filter. Its solution always improves with increasing model complexity, without the risk of overfitting. In the infinite dimension limit it approaches the true Bayesian posterior. The issues of prior selection and overfitting are also discussed, showing that some of the commonly held beliefs are misleading. The practical implementation is summarised. Simulations using 13 popular publicly available data sets are used to demonstrate the method and highlight important issues concerning the choice of priors. Keywords: regression, Bayesian method, Kalman filter, approximation, prior selection, radial basis functions, online learning. Running title: Bayesian Regression filter. 1 Introduction Neural network models such as multilayer perceptrons or radial basis function ne...
Divide and Conquer: Pattern Recognition using Mixtures of Experts
, 1997
"... speech recognition task. The mixture of experts is shown to be a superior method for speaker adaptation of connectionist models to new conditions. In addition, the significant improvement of the performance of an ensemble of classifiers via the mixture framework is demonstrated. In addition to these ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
speech recognition task. The mixture of experts is shown to be a superior method for speaker adaptation of connectionist models to new conditions. In addition, the significant improvement of the performance of an ensemble of classifiers via the mixture framework is demonstrated. In addition to these applications, a number of theoretical extensions of the mixture of experts have been made in this thesis. The link between hierarchical mixtures of experts (HME) and other tree based models is described and used to motivate a new training algorithm for the HME, known as tree growing. Tree growing is a constructive algorithm which results in faster training and a more efficient use of parameters than standard training methods. The second extension described is path pruning which is a fast training and evaluation algorithm for deep hierarchies in which paths through the tree which have low probability are ignored. A stabilising method for the algorithm based on weight decay regularisation is
Bayesian Multioutput Feedforward Neural Network Comparison: A Conjugate Prior Approach
"... A Bayesian method for the comparison and selection of multioutput feedforward neural network topology, based on the predictive capability, is proposed . As a measure of the prediction fitness potential, an expected utility criterion is considered which is consistently estimated by a samplereuse ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A Bayesian method for the comparison and selection of multioutput feedforward neural network topology, based on the predictive capability, is proposed . As a measure of the prediction fitness potential, an expected utility criterion is considered which is consistently estimated by a samplereuse computation. As opposed to classic pointpredictionbased crossvalidation methods, this expected utility is defined from the logarithmic score of the neural model predictive probability density. It is shown how the advocated choice of a conjugate probability distribution as prior for the parameters of a competing network, allows a consistent approximation of the network posterior predictive density. A comparison of the performances of the proposed method with the performances of usual selection procedures based on classic crossvalidation and informationtheoretic criteria, is performed first on a simulated case study, and then on a wellknown food analysis dataset.