## Graphical Models and Variational Methods (2001)

Citations: | 38 - 2 self |

### BibTeX

@MISC{Ghahramani01graphicalmodels,

author = {Zoubin Ghahramani and Matthew J. Beal},

title = {Graphical Models and Variational Methods},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

We review the use of variational methods of approximating inference and learning in probabilistic graphical models. In particular, we focus on variational approximations to the integrals required for Bayesian learning. For models in the conjugate-exponential family, a generalisation of the EM algorithm is derived that iterates between optimising hyperparameters of the distribution over parameters, and inferring the hidden variable distributions. These approximations make use of available propagation algorithms for probabilistic graphical models. We give two case studies of how the variational Bayesian approach can be used to learn model structure: inferring the number of clusters and dimensionalities in a mixture of factor analysers, and inferring the dimension of the state space of a linear dynamical system. Finally, importance sampling corrections to the variational approximations are discussed, along with their limitations.

### Citations

8919 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ity known in statistical physics as the free energy: the expected energy under Q minus the entropy of Q [26], where we use Q to mean the set of all Q x i . The Expectation-Maximization (EM) algorithm =-=[2,-=- 5] alternates between maximising F with respect to the Q x i and , respectively, holding the othersxed. Starting from some initial parameters 0 : E step: Q k+1 x isarg max Qx i F(Q; k ); 8 i (5) M s... |

7441 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ... act on the real world we need to represent uncertainty. Probability theory provides a language for representing uncertain beliefs and a calculus for manipulating these beliefs in a consistent manner =-=[4, 28, 16]-=-. However, the real world problems a machine may be faced with might involve hundreds or thousands of variables, and atsrst it may seem daunting to represent and manipulate joint distributions over al... |

1343 |
Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ...g the conditional independence relationships, also provide a backbone upon which it has been possible to derive ecient message-propagating algorithms for updating the uncertain beliefs of the machine =-=[28, 21, 18, 12]-=-. This chapter focuses on learning and belief updating in models for which these are intractable despite the use of these ecient propagation algorithms. For such models one has to resort to approximat... |

969 |
An introduction to Bayesian Networks
- Jensen
- 1996
(Show Context)
Citation Context ...g the conditional independence relationships, also provide a backbone upon which it has been possible to derive ecient message-propagating algorithms for updating the uncertain beliefs of the machine =-=[28, 21, 18, 12]-=-. This chapter focuses on learning and belief updating in models for which these are intractable despite the use of these ecient propagation algorithms. For such models one has to resort to approximat... |

895 | A tutorial on learning with Bayesian Networks
- Heckerman
- 1995
(Show Context)
Citation Context ...g the conditional independence relationships, also provide a backbone upon which it has been possible to derive ecient message-propagating algorithms for updating the uncertain beliefs of the machine =-=[28, 21, 18, 12]-=-. This chapter focuses on learning and belief updating in models for which these are intractable despite the use of these ecient propagation algorithms. For such models one has to resort to approximat... |

866 | An introduction to variational methods for graphical models. Learning in Graphical Models
- Jordan
- 1999
(Show Context)
Citation Context ...hysics. Variational methods have been developed both for maximum likelihood (ML) learning and Bayesian learning. In section 3 we describe their use in ML learning, which is reviewed in more detail in =-=[20]-=-. Readers familiar with the lower-bound derivation of EM and the use of variational methods in ML learning can skip this section. In section 4, we motivate how the Bayesian approach of integrating ove... |

837 |
A maximization technique occuring in the statistical analysis of probabilistic functions of markov chains
- Baum, Petrie, et al.
- 1972
(Show Context)
Citation Context ...ity known in statistical physics as the free energy: the expected energy under Q minus the entropy of Q [26], where we use Q to mean the set of all Q x i . The Expectation-Maximization (EM) algorithm =-=[2,-=- 5] alternates between maximising F with respect to the Q x i and , respectively, holding the othersxed. Starting from some initial parameters 0 : E step: Q k+1 x isarg max Qx i F(Q; k ); 8 i (5) M s... |

650 | Learning in Graphical Models
- Jordan
- 1998
(Show Context)
Citation Context ...nal bounds. Finally, we conclude with section 9. We assume that the reader is familiar with the basics of inference in probabilistic graphical models. For relevant tutorials he or she is referred to: =-=[18, 12, 19, 30]-=-. 3 Variational methods for maximum likelihood learning Variational methods have been used for approximate maximum likelihood learning in probabilistic graphical models with hidden variables. To under... |

590 | Probabilistic inference using Markov chain Monte Carlo methods
- Neal
- 1993
(Show Context)
Citation Context ... in practice it is often computationally and analytically intractable to perform the required integrals. Markov chain Monte Carlo (MCMC) methods can be used to approximate these integrals by sampling =-=[25-=-]. The main criticism of MCMC methods is that they are slow and it is usually dicult to assess convergence. Furthermore, the posterior density over parameters, P (jY; M) which captures all information... |

572 | Probability Theory: The Logic of Science
- Jaynes
- 2007
(Show Context)
Citation Context ... act on the real world we need to represent uncertainty. Probability theory provides a language for representing uncertain beliefs and a calculus for manipulating these beliefs in a consistent manner =-=[4, 28, 16]-=-. However, the real world problems a machine may be faced with might involve hundreds or thousands of variables, and atsrst it may seem daunting to represent and manipulate joint distributions over al... |

511 | Factorial hidden Markov models
- Ghahramani
- 1997
(Show Context)
Citation Context ...te data case. For many models, especially those with multiple hidden variables forming a distributed representation of the observed variables, even these sucient statistics are intractable to compute =-=[24, 37, 13, 11, 10]-=-. In the E step, rather than optimising F over all Q, we constrain Q to be of a particular form, for example factorised. We can still optimise F as a functional of constrained distributions Q using ca... |

419 | Mixtures of probabilistic principal component analysers
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ...e identity the model becomes a mixture of probabilistic principal components analysis (PCA). Tractable maximum likelihood procedures forstting MFA and MPCA models can be derived from the EM algorithm =-=[9, -=-35]. Since P (sj) is multinomial, and both P (x) and P (yjx; s; ;s) are Gaussian, the model satises condition (1), that is, it has a complete data likelihood in the exponential family. Note that if we... |

275 | A Unifying Review of Linear Gaussian Models
- Roweis, Ghahramani
- 1999
(Show Context)
Citation Context ...nal bounds. Finally, we conclude with section 9. We assume that the reader is familiar with the basics of inference in probabilistic graphical models. For relevant tutorials he or she is referred to: =-=[18, 12, 19, 30]-=-. 3 Variational methods for maximum likelihood learning Variational methods have been used for approximate maximum likelihood learning in probabilistic graphical models with hidden variables. To under... |

257 |
An approach to time series smoothing and forecasting using the em algorithm
- Shumway, Stoffer
- 1982
(Show Context)
Citation Context ...cally manageable. However, since the model is conjugate-exponential we can apply Theorem 1 to derive a variational EM algorithm for state-space models analogous to the maximum-likelihood EM algorithm =-=[-=-33]. Writing out the expression for ln P (A; C; ; x 1:T ; y 1:T ), one sees that it contains interaction terms between and C, but none between A and either or C. This observation implies a further f... |

232 | The EM algorithm for mixtures of factor analyzers
- Ghahramani, Hinton
- 1996
(Show Context)
Citation Context ...e identity the model becomes a mixture of probabilistic principal components analysis (PCA). Tractable maximum likelihood procedures forstting MFA and MPCA models can be derived from the EM algorithm =-=[9, -=-35]. Since P (sj) is multinomial, and both P (x) and P (yjx; s; ;s) are Gaussian, the model satises condition (1), that is, it has a complete data likelihood in the exponential family. Note that if we... |

198 | A variational Bayesian framework for graphical models
- Attias
- 2000
(Show Context)
Citation Context ... with its own parameters. It has since been applied to various other models with hidden states and no restrictions on Q () and Qx i (x i ) other than the assumption that they factorise in some way [3=-=6, 23, 3, 1,-=- 8]. With only these factorisation assumptions, free-form optimisation with respect to the distributions Q () and Qx i (x i ) is done using calculus of variations, and often results in a modied EM-li... |

190 |
Connectionist learning of belief networks
- Neal
- 1992
(Show Context)
Citation Context ...te data case. For many models, especially those with multiple hidden variables forming a distributed representation of the observed variables, even these sucient statistics are intractable to compute =-=[24, 37, 13, 11, 10]-=-. In the E step, rather than optimising F over all Q, we constrain Q to be of a particular form, for example factorised. We can still optimise F as a functional of constrained distributions Q using ca... |

177 |
Probability, frequency and reasonable expectation
- Cox
- 1946
(Show Context)
Citation Context ... act on the real world we need to represent uncertainty. Probability theory provides a language for representing uncertain beliefs and a calculus for manipulating these beliefs in a consistent manner =-=[4, 28, 16]-=-. However, the real world problems a machine may be faced with might involve hundreds or thousands of variables, and atsrst it may seem daunting to represent and manipulate joint distributions over al... |

155 | Variational inference for Bayesian mixture of factor analysers
- Ghahramani, Beal
- 2000
(Show Context)
Citation Context ... with its own parameters. It has since been applied to various other models with hidden states and no restrictions on Q () and Qx i (x i ) other than the assumption that they factorise in some way [3=-=6, 23, 3, 1,-=- 8]. With only these factorisation assumptions, free-form optimisation with respect to the distributions Q () and Qx i (x i ) is done using calculus of variations, and often results in a modied EM-li... |

147 | Variational learning for switching state-space models
- Ghahramani, Hinton
- 2000
(Show Context)
Citation Context ...te data case. For many models, especially those with multiple hidden variables forming a distributed representation of the observed variables, even these sucient statistics are intractable to compute =-=[24, 37, 13, 11, 10]-=-. In the E step, rather than optimising F over all Q, we constrain Q to be of a particular form, for example factorised. We can still optimise F as a functional of constrained distributions Q using ca... |

132 | Keeping neural networks simple by minimizing the description length of the weights
- Hinton, Camp
- 1993
(Show Context)
Citation Context ...proach is actuallyst to the data. Having more parameters imparts an advantage in terms of the ability to model the data, but this is oset by the cost of having to code that parameter under the prior [=-=14]-=-. Along with the prior over parameters, a Bayesian approach to learning starts with some prior knowledge or assumptions about the model structure|the set of arcs in the Bayesian network. This initial ... |

126 | Mean field theory for sigmoid belief networks
- Saul, Jaakkola, et al.
- 1996
(Show Context)
Citation Context ... i ) = Q m j=1 Q x ij (x ij ), thesesxed point equations are called mean-eld equations by analogy to such methods in statistical physics. Examples of these variational approximations can be found in [=-=31, 6, 15, 11-=-]. 4 Variational methods for Bayesian learning Maximum likelihood methods suer from the problem that that they fail to take into account model complexity, which is, from an information theoretic view,... |

115 | Autoencoders, minimum description length and helmholtz free energy
- Hinton, Zemel
- 1993
(Show Context)
Citation Context |

111 | Propagation algorithms for variational Bayesian learning
- Ghahramani, Beal
- 2000
(Show Context)
Citation Context ...he model replaced with the following corresponding expectations under the Qdistribution: h i c i i, h i c i c > i i, hAi, hA > Ai. We omit the details here. Results from this model are presented in [7=-=]-=-. 5 It is straightforward to extend the following derivations to SSMs with inputs. 6 More generally, if we let R be a full covariance matrix for conjugacy we would give its inverse V = R 1 a Wishart d... |

83 | Ensemble Learning for Hidden Markov Models
- Mackay
- 1997
(Show Context)
Citation Context ... with its own parameters. It has since been applied to various other models with hidden states and no restrictions on Q () and Qx i (x i ) other than the assumption that they factorise in some way [3=-=6, 23, 3, 1,-=- 8]. With only these factorisation assumptions, free-form optimisation with respect to the distributions Q () and Qx i (x i ) is done using calculus of variations, and often results in a modied EM-li... |

71 |
Ockhamâ€™s razor and Bayesian analysis
- Jeffreys, Berger
- 1992
(Show Context)
Citation Context ...dels can a priori model a larger range of data sets. This property of Bayesian integration has been called Ockham's razor, since it favors simpler explanations (models) for the data over complex ones =-=[17, 22-=-]. The overtting problem is avoided simply because no parameter in the pure Bayesian approach is actuallyst to the data. Having more parameters imparts an advantage in terms of the ability to model th... |

61 | Bayesian Methods for Mixtures of Experts
- Waterhouse, McKay, et al.
- 1996
(Show Context)
Citation Context |

59 | Variational methods for inference and estimation in graphical models. Doctoral dissertation
- Jaakkola
- 1997
(Show Context)
Citation Context ... i ) = Q m j=1 Q x ij (x ij ), thesesxed point equations are called mean-eld equations by analogy to such methods in statistical physics. Examples of these variational approximations can be found in [=-=31, 6, 15, 11-=-]. 4 Variational methods for Bayesian learning Maximum likelihood methods suer from the problem that that they fail to take into account model complexity, which is, from an information theoretic view,... |

58 |
Solutions to the linear smoothing problem
- Rauch
- 1963
(Show Context)
Citation Context ...E step: computing Q(x 1:T ). Since SSMs are singly connected belief networks Corollary 1 tells us that we can make use of belief propagation, which in the case of SSMs is known as the Kalman smoother =-=[2-=-9]. We therefore run the Kalman smoother with every appearance of the natural parameters of the model replaced with the following corresponding expectations under the Qdistribution: h i c i i, h i c i... |

57 |
A view of the EM algorithm that justi incremental, sparse, and other variants. Learning in Graphical Models
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...bal conguration (x; y) to be ln P (x; yj), the lower bound F L() is the negative of a quantity known in statistical physics as the free energy: the expected energy under Q minus the entropy of Q [26], where we use Q to mean the set of all Q x i . The Expectation-Maximization (EM) algorithm [2, 5] alternates between maximising F with respect to the Q x i and , respectively, holding the othersxed.... |

53 | Factorial learning and the EM algorithm
- Ghahramani
- 1995
(Show Context)
Citation Context ... i ) = Q m j=1 Q x ij (x ij ), thesesxed point equations are called mean-eld equations by analogy to such methods in statistical physics. Examples of these variational approximations can be found in [=-=31, 6, 15, 11-=-]. 4 Variational methods for Bayesian learning Maximum likelihood methods suer from the problem that that they fail to take into account model complexity, which is, from an information theoretic view,... |

43 | Assessing relevance determination methods using DELVE
- Neal
- 1998
(Show Context)
Citation Context ...n [3] for Bayesian PCA. These Gaussian priors are called automatic relevance determination (ARD) priors as they were used by MacKay and Neal to do relevant input variable selection in neural networks =-=[27-=-]. To avoid overtting it is important to integrate out all parameters whose cardinality scales with model complexity (i.e. number of components and their dimensionalities). We therefore also integrate... |

32 |
Probable networks and plausible predictionsâ€”A review of practical Bayesian methods for supervised neural networks
- Mackay
- 1995
(Show Context)
Citation Context ...dels can a priori model a larger range of data sets. This property of Bayesian integration has been called Ockham's razor, since it favors simpler explanations (models) for the data over complex ones =-=[17, 22-=-]. The overtting problem is avoided simply because no parameter in the pure Bayesian approach is actuallyst to the data. Having more parameters imparts an advantage in terms of the ability to model th... |

9 |
Variational PCA
- Bishop
- 1999
(Show Context)
Citation Context |

4 |
Computation on bayesian graphical models. Bayesian Statistics, 5:407-425 (see www.mrc{bsu.cam.ac.uk/bugs
- Spiegelhalter, Thomas, et al.
- 1996
(Show Context)
Citation Context ...e to automate the derivation of variational Bayesian learning procedures for a large family of models much in the same way as Gibbs sampling and propagation algorithms have been automated in the BUGS =-=[34]-=- and HUGIN [32] software systems, respectively. Through combining sampling, exact propagation algorithms, and variational methods, Bayesian inference in very large domains should be possible, opening ... |

2 |
Mean networks that learn to discriminate temporally distorted strings
- Williams, Hinton
- 1991
(Show Context)
Citation Context |