## Variational Inference for Bayesian Mixtures of Factor Analysers (2000)

### Cached

### Download Links

Venue: | In Advances in Neural Information Processing Systems 12 |

Citations: | 148 - 16 self |

### BibTeX

@INPROCEEDINGS{Ghahramani00variationalinference,

author = {Zoubin Ghahramani and Matthew J. Beal},

title = {Variational Inference for Bayesian Mixtures of Factor Analysers},

booktitle = {In Advances in Neural Information Processing Systems 12},

year = {2000},

pages = {449--455},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present an algorithm that infers the model structure of a mixture of factor analysers using an ecient and deterministic variational approximation to full Bayesian integration over model parameters. This procedure can automatically determine the optimal number of components and the local dimensionality of each component (i.e. the number of factors in each factor analyser). Alternatively it can be used to infer posterior distributions over number of components and dimensionalities. Since all parameters are integrated out the method is not prone to over tting. Using a stochastic procedure for adding components it is possible to perform the variational optimisation incrementally and to avoid local maxima. Results show that the method works very well in practice and correctly infers the number and dimensionality of nontrivial synthetic examples. By importance sampling from the variational approximation we show how to obtain unbiased estimates of the true evidence, the exa...

### Citations

440 | On Bayesian analysis of mixtures with an unknown number of components
- Richardson, Green
- 1997
(Show Context)
Citation Context ...ally and analytically intractable to perform the required integrals. For Gaussian mixture models Markov chain Monte Carlo (MCMC) methods have been developed to approximate these integrals by sampling =-=[8, 7]-=-. The main criticism of MCMC methods is that they are slow and 2 Technically, the log likelihood is not bounded above if no constraints are put on the determinant of the component covariances. So the ... |

397 | Mixtures of probabilistic principal component analysers
- Tipping, M
- 1999
(Show Context)
Citation Context ...iple of the identity the model becomes a mixture of probabilistic PCAs. Tractable maximum likelihood procedure forstting MFA and MPCA models can be derived from the Expectation Maximisation algorithm =-=[4, 11-=-]. The maximum likelihood (ML) approach to MFA can easily get caught in local maxima. 2 Ueda et al. [12] provide an eective deterministic procedure for avoiding local maxima by considering splitting a... |

263 | A unifying review of linear Gaussian models, Neural Computation 11
- Roweis, Ghahramani
- 1999
(Show Context)
Citation Context ...ctor analyser can be seen as a reduced parametrisation of a full-covariance Gaussian. 1 1 Factor analysis and its relationship to principal components analysis (PCA) and mixture models is reviewed in [10]. A mixture of factor analysers (MFA) models the density for y as a weighted average of factor analyser densities P (yj;s; ) = S X s=1 P (sj)P (yjs; s ;s); (1) where is the vector of mixing pro... |

225 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, GE
- 1996
(Show Context)
Citation Context ...iple of the identity the model becomes a mixture of probabilistic PCAs. Tractable maximum likelihood procedure forstting MFA and MPCA models can be derived from the Expectation Maximisation algorithm =-=[4, 11-=-]. The maximum likelihood (ML) approach to MFA can easily get caught in local maxima. 2 Ueda et al. [12] provide an eective deterministic procedure for avoiding local maxima by considering splitting a... |

158 | The infinite Gaussian mixture model
- Rasmussen
- 2000
(Show Context)
Citation Context ...ally and analytically intractable to perform the required integrals. For Gaussian mixture models Markov chain Monte Carlo (MCMC) methods have been developed to approximate these integrals by sampling =-=[8, 7]-=-. The main criticism of MCMC methods is that they are slow and 2 Technically, the log likelihood is not bounded above if no constraints are put on the determinant of the component covariances. So the ... |

136 | Inferring parameters and structure of latent variable models by variational Bayes
- Attias
- 1999
(Show Context)
Citation Context ... be used as an approximation to the intractable posterior. This approach draws its roots from one way of deriving meanseld approximations in physics, and has been used recently for Bayesian inference =-=[13, 5, 1]-=-. The variational method has several advantages over MCMC and Laplace approximations. Unlike MCMC, convergence can be assessed easily by monitoring F . The approximate posterior is encoded eciently in... |

99 | Smem algorithm for mixture models
- Ueda, Nakano, et al.
- 1999
(Show Context)
Citation Context ...dure forstting MFA and MPCA models can be derived from the Expectation Maximisation algorithm [4, 11]. The maximum likelihood (ML) approach to MFA can easily get caught in local maxima. 2 Ueda et al. =-=[12-=-] provide an eective deterministic procedure for avoiding local maxima by considering splitting a factor analyser in one part of space and merging two in a another part. But splits and merges have to ... |

79 | Ensemble learning for hidden Markov models
- MacKay
- 1997
(Show Context)
Citation Context ... be used as an approximation to the intractable posterior. This approach draws its roots from one way of deriving meanseld approximations in physics, and has been used recently for Bayesian inference =-=[13, 5, 1]-=-. The variational method has several advantages over MCMC and Laplace approximations. Unlike MCMC, convergence can be assessed easily by monitoring F . The approximate posterior is encoded eciently in... |

73 | Bayesian approaches to Gaussian mixture modeling
- SJ, Husmeier, et al.
- 1998
(Show Context)
Citation Context ...y dicult to assess convergence. Furthermore, the posterior density over parameters is stored as a set of samples, which can be inecient. Another approach to Bayesian integration for Gaussian mixtures =-=[9]-=- is the Laplace approximation which makes a local Gaussian approximation around a maximum a posteriori parameter estimate. These approximations are based on large data limits and can be poor, particul... |

60 | T.Robinson. Bayesian methods for mixtures of experts
- Waterhouse, Mackay
- 1996
(Show Context)
Citation Context ... be used as an approximation to the intractable posterior. This approach draws its roots from one way of deriving meanseld approximations in physics, and has been used recently for Bayesian inference =-=[13, 5, 1]-=-. The variational method has several advantages over MCMC and Laplace approximations. Unlike MCMC, convergence can be assessed easily by monitoring F . The approximate posterior is encoded eciently in... |

44 |
Bayesian pca
- Bishop
- 1999
(Show Context)
Citation Context ...o zero, which allows the model to reduce the intrinsic dimensionality of x if the data does not warrant this added dimension. This method of intrinsic dimensionality reduction has been used by Bishop =-=[2]-=- for Bayesian PCA, and is closely related to MacKay and Neal's method for automatic relevance determination (ARD) for inputs to a neural network. To avoid overfitting it is important that we integrate... |

41 | Assessing relevance determination methods using DELVE
- Neal
- 1998
(Show Context)
Citation Context ...ic dimensionality reduction has been used by Bishop [2] for Bayesian PCA, and is closely related to MacKay and Neal's method for automatic relevance determination (ARD) for inputs to a neural network =-=[6-=-]. To avoid overtting it is important to integrate out all parameters whose cardinality scales with model complexity (i.e. number of components and their dimensionalities) . We therefore also integrat... |

9 |
Variational PCA
- Bishop
- 1999
(Show Context)
Citation Context ...o zero, which allows the model to reduce the intrinsic dimensionality of x if the data does not warrant this added dimension. This method of intrinsic dimensionality reduction has been used by Bishop =-=[2-=-] for Bayesian PCA, and is closely related to MacKay and Neal's method for automatic relevance determination (ARD) for inputs to a neural network [6]. To avoid overtting it is important to integrate o... |

2 |
The countably in Bayesian Gaussian mixture density model
- Rasmussen
- 1999
(Show Context)
Citation Context ...ally and analytically intractable to perform the required integrals. For Gaussian mixture models Markov chain Monte Carlo (MCMC) methods have been developed to approximate these integrals by sampling =-=[6, 5]. Th-=-e main criticism of MCMC methods is that they are slow and it is usually di��cult to assess convergence. Furthermore, the posterior density over parameters is stored as a set of samples, which can... |

1 |
Learning model structure
- Ghahramani, Attias, et al.
- 1999
(Show Context)
Citation Context ...ck Q to be a good approximation to P and therefore hopefully a good proposal distribution. Third, this procedure can be applied to any variational approximation. A detailed exposition can be found in =-=[3]-=-. 6 Results Experiment 1: Discovering the number of components. We tested the model on synthetic data generated from a mixture of 18 Gaussians with 50 points per cluster (Figure 2, top left). The vari... |

1 | The countably infinite Bayesian Gaussian mixture density model
- Rasmussen
- 1999
(Show Context)
Citation Context ...ally and analytically intractable to perform the required integrals. For Gaussian mixture models Markov chain Monte Carlo (MCMC) methods have been developed to approximate these integrals by sampling =-=[6, 5]-=-. The main criticism of MCMC methods is that they are slow and it is usually difficult to assess convergence. Furthermore, the posterior density over parameters is stored as a set of samples, which ca... |

1 |
Neal.Assessing relevance determination methods using DELVE
- M
- 1998
(Show Context)
Citation Context ...ic dimensionality reduction has been used by Bishop [2] for Bayesian PCA, and is closely related to MacKay and Neal’s method for automatic relevance determination (ARD) for inputs to a neural network =-=[6]-=-. To avoid overfitting it is important to integrate out all parameters whose cardinality scales with model complexity (i.e. number of components and their dimensionalities). We therefore also integrat... |