## Inferring Parameters and Structure of Latent Variable Models by Variational Bayes (1999)

Citations: | 136 - 1 self |

### BibTeX

@INPROCEEDINGS{Attias99inferringparameters,

author = {Hagai Attias and London Wcn Ar},

title = {Inferring Parameters and Structure of Latent Variable Models by Variational Bayes},

booktitle = {},

year = {1999},

pages = {21--30},

publisher = {Morgan Kaufmann Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

Current methods for learning graphical models with latent variables and a fixed structure estimate optimal values for the model parameters. Whereas this approach usually produces overfitting and suboptimal generalization performance, carrying out the Bayesian program of computing the full posterior distributions over the parameters remains a difficult problem. Moreover, learning the structure of models with latent variables, for which the Bayesian approach is crucial, is yet a harder problem. In this paper I present the Variational Bayes framework, which provides a solution to these problems. This approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods. Unlike in the Laplace approximation, these posteriors are generally non-Gaussian and no Hessian needs to be computed. The resulting algorithm generalizes the standard Expectation Maximization a...

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

1075 | Herskovitz: A Bayesian Method for the Induction - Cooper, E - 1992 |

1070 | An Information-Maximization Approach to Blind Separation and Blind Deconvolution - Bell, Sejnowski - 1995 |

903 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

624 | Statistical Analysis of Finite Mixture Distributions - Titterington, Smith, et al. - 1985 |

522 | Bayesian interpolation
- Mackay
- 1992
(Show Context)
Citation Context ...earn the structure of the graph, since more complicated graphs assign a higher likelihood to the data. Third, it is computationally tractable only for a small class of models. The Bayesian framework (=-=Mackay 1992-=-a, 1992b; Cooper and Herskovits 1992; Heckerman et al. 1995) provides, in principle, a solution to thesrst two problems. In this framework one considers an ensemble of models, characterized by a proba... |

440 | On Bayesian analysis of mixtures with an unknown number of components - Richardson, Green - 1997 |

433 | Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture, Signal Process - Jutten, Herault - 1991 |

399 | A practical Bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...earn the structure of the graph, since more complicated graphs assign a higher likelihood to the data. Third, it is computationally tractable only for a small class of models. The Bayesian framework (=-=Mackay 1992-=-a, 1992b; Cooper and Herskovits 1992; Heckerman et al. 1995) provides, in principle, a solution to thesrst two problems. In this framework one considers an ensemble of models, characterized by a proba... |

378 | Equivariant adaptive source separation - Cardoso, Laheld - 1996 |

323 |
Estimating the dimension of a model
- Schwartz
- 1978
(Show Context)
Citation Context ... 0 reduces in this limit to a term that is linear in the number of the ML model parameters, plus a simple regularizer log p( 0 ). Finally, we point out that the Bayesian information criterion (BIC) (S=-=chwartz 197-=-8) and the minimum description length criterion (MDL) (Rissanen 1987) both emerge as a special case of our large sample expression (7), corresponding to usingsat prior p() and exact (rather than varia... |

221 | Independent factor analysis
- Attias
- 1999
(Show Context)
Citation Context ...matrices, as well as the source distributions, from noisy data. Since the computational complexity of the algorithm increases exponentially with the number of sources, the large m case is treated in (=-=Attias 1999-=-a) by a structured variational approximation (Ghahramani and Jordan 1997). However, in realistic cases the observed data is generated by an unknown number of sources m. Here we exploit the VB approach... |

220 | The Bayesian Structural EM Algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...ayesian framework can seldom be performed exactly, due to the need to integrate over models. Approximations therefore must be made (see, e.g., Cheeseman and Stutz 1995; Chickering and Heckerman 1997; =-=Friedman 1998-=-), the major schemes being Markov chain Monte Carlo methods and Laplace approximation. The former attempts to achieve exact results but typically requires vast computational resources. The latter has ... |

96 |
algorithms for ML factor analysis
- Rubin, Thayer, et al.
- 1982
(Show Context)
Citation Context ...(x j m) (26). We point out that in the large sample limit, the covariance of A vanishes and its mean becomes A = C yx (C xx ) 1 , a form appearing in the ordinary EM algorithms for factor analysis (R=-=ubin and Thayer 1982-=-) and independent factor analysis (Attias 1999a). However, the source posterior cannot be obtained by directly optimizing F (see (12)), due to the nonGaussian nature of the sources. Instead we use two... |

52 | A view of the em algorithm that justi�es incremental, sparse, and other variants - Neal, Hinton - 1998 |

39 | source separation and deconvolution: The dynamic component analysis algorithms - Attias, Schreiner, et al. - 1998 |

38 | Learning nonlinear overcomplete representation for efficient coding - Lewicki, Sejnowski - 1997 |

30 | E cient approximations for the marginal likelihood of Bayesian networks with hidden variables - Chickering, Heckerman - 1997 |

30 |
Stochastic complexity (with discussion
- Rissanen
- 1987
(Show Context)
Citation Context ...the ML model parameters, plus a simple regularizer log p( 0 ). Finally, we point out that the Bayesian information criterion (BIC) (Schwartz 1978) and the minimum description length criterion (MDL) (R=-=issanen 198-=-7) both emerge as a special case of our large sample expression (7), corresponding to usingsat prior p() and exact (rather than variational) posterior q(H). 2.4 Optimal Posteriors and Relation to EM T... |

23 | Choice of Basis for Laplace Approximation
- MacKay
(Show Context)
Citation Context ... of parameters and N the dataset (sample) size, but is a good approximation only in the limit N=m ! 1; in particular, is assumes that all posterior distributions are Normal (but see the discussion in =-=Mackay 1998-=-a). Naturally, the situation becomes worse when hidden variables exist. In this paper I present the Variational Bayes framework for computations in graphical models. This framework facilitates analyti... |

15 | Bayesian logistic regression: A variational approach - Jaakkola, Jordan - 1997 |

14 | Mean eld theory for sigmoid belief networks - Saul, Jaakkola, et al. - 1996 |

11 | Bayesian classi…cation (AutoClass): Theory and results - Cheesman, Stutz - 1996 |

11 | Factorial hidden Markov models. Machine Learning 29:245–273. Malcolm et al - Ghahramani, Jordan - 1997 |

2 |
The countably in Bayesian Gaussian mixture density model
- Rasmussen
- 1999
(Show Context)
Citation Context ...asses. While the Bayesian approach provides the solution in principle, no satisfactory practical algorithm has emerged from the application of involved sampling techniques (Richardson and Green 1997; =-=Rasmussen 199-=-9) and approximation methods (e.g., Cheeseman and Stutz 1995) to this problem. We now show that an elegant solution is provided by the VB approach. We consider models of the form p(y j ; m) = m X s=1 ... |

1 |
Hierarchical IFA belief networks
- Attias
- 1999
(Show Context)
Citation Context ...matrices, as well as the source distributions, from noisy data. Since the computational complexity of the algorithm increases exponentially with the number of sources, the large m case is treated in (=-=Attias 1999-=-a) by a structured variational approximation (Ghahramani and Jordan 1997). However, in realistic cases the observed data is generated by an unknown number of sources m. Here we exploit the VB approach... |

1 | Ensemble learning for hidden Markov models - Kackay - 1998 |