## Choice of Basis for Laplace Approximation (1998)

### Cached

### Download Links

- [www.inference.phy.cam.ac.uk]
- [www.cs.toronto.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 28 - 1 self |

### BibTeX

@TECHREPORT{MacKay98choiceof,

author = {David J.C. MacKay},

title = {Choice of Basis for Laplace Approximation},

institution = {Machine Learning},

year = {1998}

}

### OpenURL

### Abstract

Maximum a posterJori optimization of parameters and the Laplace approximation for the marginal likelihood are both basis-dependent methods. This note compares two choices of basis for models parameterized by probabilities, showing that it is possible to improve on the traditional choice, the probability simplex, by transforming to the softmax' basis.

### Citations

1508 | Bayesian data analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...re performed, with I = 20 in all cases. In the first experiment, all ui were set to 1, and a probability vector p was drawn from the corresponding Dirichlet distribution using the method described by =-=Gelman et al., 1995-=-. The vector F was then set to Np for a range of values of N, the effective number of data points (this fake data set thus has non-integer 'counts'). The three methods of evaluating P(Flu) were compar... |

1212 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ... f(sw)(2��) k=2 j\Gammarr log f(w)j \Gamma1=2 : (1) This method is widely used in probabilistic modelling to approximate the value of marginal likelihoods, which are of interest for model comparis=-=on (Ripley, 1996-=-; Lindley, 1980; Smith and Spiegelhalter, 1980; MacKay, 1992; Chickering and Heckerman, 1996). In this paper I consider the case of models whose parameters are probabilities, for example, hidden Marko... |

429 | A Practical Bayesian Framework for Backpropagation Networks
- Mackay
- 1992
(Show Context)
Citation Context ...hod is widely used in probabilistic modelling to approximate the value of marginal likelihoods, which are of interest for model comparison (Ripley, 1996; Lindley, 1980; Smith and Spiegelhalter, 1980; =-=MacKay, 1992-=-; Chickering and Heckerman, 1996). In this paper I consider the case of models whose parameters are probabilities, for example, hidden Markov models, mixture models, belief networks and certain langua... |

242 |
Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition
- Bridle
- 1990
(Show Context)
Citation Context ...sis. 2 A change of basis I suggest that maximum a posteriori parameter estimation and Laplace approximations would be better conducted in the 'softmax' representation (widely used in neural networks (=-=Bridle, 1989-=-)) in which the parameters p are replaced by parameters a: exp(ai) pi(a)- Ei, exp(ai, ) . (9) [Please do not confuse p(a), the function defined in equation (9), with the probability density P(a).] The... |

183 | Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables
- Chickering, Heckerman
- 1997
(Show Context)
Citation Context ...used in probabilistic modelling to approximate the value of marginal likelihoods, which are of interest for model comparison (Ripley, 1996; Lindley, 1980; Smith and Spiegelhalter, 1980; MacKay, 1992; =-=Chickering and Heckerman, 1996-=-). In this paper I consider the case of models whose parameters are probabilities, for example, hidden Markov models, mixture models, belief networks and certain language models. I examine the neglect... |

83 | A Hierarchical Dirichlet Language Model
- MacKay, Peto
- 1994
(Show Context)
Citation Context ...hump, so if this happens the traditional method is in trouble. The traditional solution to this problem is to forbid the use of Dirichlet priors with any u is1. However, as argued in (Jeffreys, 1939; =-=MacKay and Peto, 1995-=-; Gelman, 1996), there may be good reasons for expecting priors with u i ! 1 to be appropriate for many problems. I would argue that the `\Gamma1' terms in the traditional posterior probability are ar... |

64 |
Bayes factors and choice criteria for linear models
- Smith, Spiegelhalter
- 1980
(Show Context)
Citation Context ...l-VVlog/(w)l -/2 . (1) This method is widely used in probabilistic modelling to approximate the value of marginal likelihoods, which are of interest for model comparison (Ripley, 1996; Lindley, 1980; =-=Smith and Spiegelhalter, 1980-=-; MacKay, 1992; Chickering and Heckerman, 1996). In this paper I consider the case of models whose parameters are probabilities, for example, hidden Markov models, mixture models, belief networks and ... |

33 |
Speaker adaptation based on MAP estimation of HMM parameters
- Lee, Gauvain
- 1993
(Show Context)
Citation Context ...ables and mixture models have a likelihood function obtained by summation over the hidden variables; Dirichlet priors are also widely used for such models. The traditional MAP method for such models (=-=Lee and Gauvain, 1993-=-) is to maximize the posterior probability of the parameters p, and the traditional Laplace method for such a model is, after maximizing in the p basis, to make the Gaussian approximation in the same ... |

30 |
Bayesian mixture modeling
- Neal
- 1992
(Show Context)
Citation Context ...curate. 4 Discussion This paper's aim is not to advocate the use of Laplace approximations; indeed a good case can be made for using other methods such as Markov chain Monte Carlo (see, for example, (=-=Neal, 1992-=-)). And deterministic Bayesian approximations that are basis independent are under development (MacKay, 1997). But if MAP methods are used, this paper offers a way of evaluating marginal likelihoods w... |

29 |
Bayesian Inference: Volume 2B, Kendall’s Advanced Theory of Statistics
- O’Hagan
- 1994
(Show Context)
Citation Context ...ch face i came up Fi times, and we wish to calculate the marginal likelihood, which depends on the prior distribution over p. A popular prior for a probability vector p is the Dirichlet distribution (=-=O'Hagan, 1994-=-) parameterized by a measure u (a vector with all coefficients ui > 0): i z - 1-I - 1) -- Dirichlet()(plu). (3) P(plu) ZDir(U) i=l pi o (Ei Pi -- The function 5(x) is the Dirac delta function which si... |

26 | A hierarchical Dirichlet language model. Natural language engineering - MacKay, Peto - 1995 |

12 | Sir, Theory of Probability, Oxford Univ - Jeffreys - 1939 |

9 | Bayesian model-building by pure thought: some principles and examples, Statistica Sinica 6
- Gelman
- 1996
(Show Context)
Citation Context ...is only valid at the maximum of a smooth hump, so if this happens the traditional method is in trouble. The traditional solution to this problem is to forbid the use of Dirichlet priors with any ui _ =-=Gelman, 1996-=-), there may be good reasons for expecting priors with uisfor many problems. I would argue that the '-1' terms in the traditional posterior probability are artefacts of the choice of basis. 2 A change... |

9 |
Approximate Bayesian Methods
- Lindley
- 1980
(Show Context)
Citation Context ..._ f(v)(27r) /2 l-VVlog/(w)l -/2 . (1) This method is widely used in probabilistic modelling to approximate the value of marginal likelihoods, which are of interest for model comparison (Ripley, 1996; =-=Lindley, 1980-=-; Smith and Spiegelhalter, 1980; MacKay, 1992; Chickering and Heckerman, 1996). In this paper I consider the case of models whose parameters are probabilities, for example, hidden Markov models, mixtu... |

6 |
Ensemble learning for hidden Markov models.” Available from http://wol.ra.phy.cam.ac.uk/mackay
- MacKay
- 1997
(Show Context)
Citation Context ...case can be made for using other methods such as Markov chain Monte Carlo (see, for example, (Heal, 1992)). And deterministic Bayesian approximations that are basis independent are under development (=-=MacKay, 1997-=-). But if MAP methods are used, this paper offers a way of evaluating marginal likelihoods which satisfies these two desiderata: 1. We can make a Laplace approximation for any Dirichlet priors and any... |

2 | Pattern Recognition and Neural Networks - Riplcy - 1996 |

1 | Bayesian mixture modelling - Ncal - 1992 |