## A variational approximation for Bayesian networks with discrete and continuous latent variables (1999)

### Cached

### Download Links

- [www.ics.uci.edu]
- [www.ics.uci.edu]
- [www.ics.uci.edu]
- [www.ics.uci.edu]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [http.cs.berkeley.edu]
- [www.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In UAI |

Citations: | 43 - 6 self |

### BibTeX

@INPROCEEDINGS{Murphy99avariational,

author = {Kevin P. Murphy},

title = {A variational approximation for Bayesian networks with discrete and continuous latent variables},

booktitle = {In UAI},

year = {1999},

pages = {457--466},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

We show how to use a variational approximation to the logistic function to perform approximate inference in Bayesian networks containing discrete nodes with continuous parents. Essentially, we convert the logistic function to a Gaussian, which facilitates exact inference, and then iteratively adjust the variational parameters to improve the quality of the approximation. We demonstrate experimentally that this approximation is much faster than sampling, but comparable in accuracy. We also introduce a simple new technique for handling evidence, which allows us to handle arbitrary distributionson observed nodes, as well as achieving a significant speedup in networks with discrete variables of large cardinality. 1

### Citations

1363 |
Generalized linear models
- McCullagh, Nelder
- 1990
(Show Context)
Citation Context ...ep function; in the limit as jw i j ! 0, the sigmoid approaches a uniform distribution. It turns out that linear Gaussians and softmax are both special cases of Generalized Linear Models (GLIMs): see =-=[MN83]-=- or [JJ94b] for details. Although we can use GLIMs as CPDs for observed nodes (see Section 8), in general it is difficult to use them for hidden nodes, at least if we restrict ourselves to exact infer... |

1103 | Numerical Modelling of the - S, ERRAUD, et al. - 2000 |

831 | An introduction to variational methods for graphical models
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...ed models (ones which have small clique size). This is in contrast to the more common use of variational methods, which is to approximate inference in models which are too dense to solve exactly (see =-=[JGJS98]-=- for a review). With any approximation method, it is natural to ask how good the approximation is. Although a quadratic function is a poor approximation to a sigmoid, the joint probability P (X; R) (w... |

764 | A view of the EM algorithm that justifies incremental sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...ma 1)(w 0 x + b). If X is hidden, the optimalvalue ofscannot be computed. However, we can guess an initial value, and then iteratively adjust it to increase the quality of the approximation. As in EM =-=[NH98]-=-, at each iteration we setsto the value that maximizes the expected complete-data log-likelihood, where the expectation is computed using the parameter values of the previous iteration. This results i... |

724 | Hierarchical mixtures of experts and EM algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...n; in the limit as jw i j ! 0, the sigmoid approaches a uniform distribution. It turns out that linear Gaussians and softmax are both special cases of Generalized Linear Models (GLIMs): see [MN83] or =-=[JJ94b]-=- for details. Although we can use GLIMs as CPDs for observed nodes (see Section 8), in general it is difficult to use them for hidden nodes, at least if we restrict ourselves to exact inference. 2.1 E... |

375 |
Evaluating influence diagrams
- Shachter
- 1986
(Show Context)
Citation Context ...s the normalizing constant (n is the number of rows/columns in S), which ensures R y N (y; ; S) = 1. Networks in which all the variables have this kind of linear Gaussian distribution were studied in =-=[SK89]-=-. If the continuous child (also) has discrete parents, we can specify a Gaussian for each value of the discrete parents; this is called a Conditional Gaussian (CG) distribution. Note that a CG distrib... |

293 |
The generalized distributive law
- Aji, McEliece
- 2000
(Show Context)
Citation Context ... \Theta 0 @ X x2c"H"C jxj 1 A 3 3 7 5 1 C A where C is the set of cliques, jxj is the number of values node x can take on (if it is discrete) or its length (if it is a continuous-valued vect=-=or). (See [MA98]-=- for a more detailed discussion of the complexity of the junction tree algorithm for discrete networks.) 10 Experimental results To see how accurate the variational approximation is, we compared the j... |

288 | Bucket elimination: a unifying framework for probabilistic inference
- Dechter
- 1996
(Show Context)
Citation Context ...n all N families --- a prerequisite for efficient parameter and structure learning --- in two passes over the graph, whereas other, query-driven (goal-directed) algorithms, such as bucket-elimination =-=[Dec98]-=- and SPI [CF91, CF95], would take N passes. In addition, the junction tree algorithm allows us to handle graphs with undirected cycles, unlike some previous work on networks with continuous variables ... |

164 | Graphical models for associations between variables, some of which are qualitative and some quantitative’, Annals of Statistics - Lauritzen, Wermuth - 1989 |

158 | Adaptive probabilistic networks with hidden variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context .... Note that softmax for binary variables is equivalent S C P B Figure 1: The crop network. Circles represent continuous (scalar) nodes, squares represent discrete (binary) nodes. This example is from =-=[BKRK97]-=-. to the logistic function when w = w 1 \Gamma w 0 and b = b 1 \Gamma b 0 , since Pr(R = 1jX = x) = e w 0 1 x+b 1 e w 0 0 x+b 0 +e w 0 1 x+b 1 = 1 1+e (w 0 \Gammaw 1 ) 0 x+(b 0 \Gammab1) . In the soft... |

156 |
Simulation approaches to general probabilistic inference on belief networks
- Shachter, Peot
- 1990
(Show Context)
Citation Context ...in" of 2000 iterations, and then sampled for 10,000 iterations. (Similar results were achieved using a burn-in of just 1000 plus 1000 iterations, and also using 1000 samples from likelihood weigh=-=ting [SP90]-=-.) For the junction tree, we updated the variational parameters until the relative change in log-likelihood dropped below 0:001; when S was observed, so P had a unimodal distribution, this took 2--3 i... |

149 | A.: Inference in belief networks: A procedural guide
- Huang, Darwiche
- 1996
(Show Context)
Citation Context ...d of 5. We plot oe(wx + b) and F(wx+ b), where x is the price, w = \Gamma1 and b = 5. 3 The junction tree algorithm In this section, we give a brief overview of the junction tree algorithm (see e.g., =-=[HD94]-=- for details), before discussing the aspects of it which are specific to hybrid networks. This summary is meant to provide a road map for the rest of the paper. In the junction tree algorithm, we firs... |

140 | Propagation of probabilities, means, and variances in mixed graphical association models
- Lauritzen
- 1992
(Show Context)
Citation Context ...al characteristics to g = \Gamma 1 2 0 S \Gamma1s+ log C(S) h = ` \GammaW 0 S \Gamma1 S \Gamma1 ' K = ` WS \Gamma1 W 0 \GammaW S \Gamma1 \GammaS \Gamma1 W 0 S \Gamma1 ' This generalizes the result in =-=[Lau92]-=- to the case of vectorvalued nodes. In the scalar case, S \Gamma1 = 1=oe, W = w, and n = 1, so the above becomes g = \Gamma 2 2oe 2 \Gamma 1 2 log(2oe 2 ) h = oe 2 ` \Gammaw 1 ' K = 1 oe 2 ` ww 0 \Gam... |

133 |
An algebra of Bayesian belief universes for knowledge-based systems
- Jensen, Olsen, et al.
- 1990
(Show Context)
Citation Context ...top) or w = 4 (bottom), and the optimalsvalue. On the right, we plot Pr(R = 1jx) Pr(x), where Pr(x) = N (x; 0; 1). the entries to zero. The technique of evidence shrinkage [HD94] and zero compression =-=[JA90]-=- can help reduce the inefficiency of manipulating such sparse potentials, but it would be better not to create them in the first place. ffl We need to have a way of converting the CPD of each node int... |

81 | Optimal Junction Trees
- Jensen, Jensen
- 1994
(Show Context)
Citation Context ...all them C. ffl Build an undirected weighted graph G J whose nodes are the cliques C and where the weight of the edge from clique i to clique j is jC i "C j j. Let T be a maximal spanning tree of=-= G J [JJ94a]. ffl-=- Add a separator node S to each edge (i; j) of T such that S = C i "C j . ffl Pick an arbitrary node in T as root. In Section 5, we discuss the changes that need to be made to the above steps in ... |

74 | A sufficiently fast algorithm for finding close to optimal junction trees
- Becker, Geiger
- 1996
(Show Context)
Citation Context ...he heuristics discussed in [Kja90]. ffl Let all nodes be initially unmarked. For each node in order , mark it and join all its unmarked neighbors. This will result in a triangulated graph, G T . (See =-=[BG96]-=- for more effective ways to triangulate a graph.) ffl Find the maximal cliques in GT ; call them C. ffl Build an undirected weighted graph G J whose nodes are the cliques C and where the weight of the... |

64 | Discretizing Continuous Attributes While Learning Bayesian Networks - Friedman, Goldszmidt - 1996 |

64 | Nonuniform dynamic discretization in hybrid networks - Kozlov, Koller - 1997 |

59 |
Triangulation of Graphs – Algorithms Giving Small Total State Space
- Kjaerulff
- 1990
(Show Context)
Citation Context ...o share a common child, and then drop the directionality of the arcs. This will result in an undirected graph, GM . ffl Choose an elimination ordering , e.g., according to the heuristics discussed in =-=[Kja90]-=-. ffl Let all nodes be initially unmarked. For each node in order , mark it and join all its unmarked neighbors. This will result in a triangulated graph, G T . (See [BG96] for more effective ways to ... |

57 | Variational Methods for Inference and Estimation in Graphical Models
- Jaakkola
- 1997
(Show Context)
Citation Context ...1-0941. A Derivation of the quadratic lower bound to the logistic function In this section, we derive a quadratic lower bound on the sigmoid function oe(x) = (1 + e \Gammax ) \Gamma1 For details, see =-=[Jaa97]-=-. Consider first 1 + e x = e x=2 (e \Gammax=2 + e x=2 ) = e x=2+log(e \Gammax=2 +e x=2 ) def = e x=2+sf (x) wheresf(x) = log(e \Gammax=2 + e x=2 ) is symmetric, and a concave function of x 2 . Now, fo... |

57 | Variational probabilistic inference and the QMR-DT network
- Jaakkola, Jordan
- 1999
(Show Context)
Citation Context ...ower bound on the likelihood [NH98]. However, the upper bound can be used in conjunction with the lower bound to filter out runs of MCMC which result in marginals which fall outside the bounds, as in =-=[JJ99]-=-. Note that we can also exploit the quadratic approximation to fit the parameters of the logistic node, w and b, using linear regression, instead of the slower IRLS (Iteratively Reweighted Least Squar... |

54 | Why the logistic function? A tutorial discussion on probabilities and neural networks
- Jordan
- 1995
(Show Context)
Citation Context ...max . Although probithas a nice interpretation as a noisy threshold unit (R = 1 iff y ? Z), the logistic distribution has several advantages: ffl It can be well-motivated from a statistical viewpoint =-=[Jor95]-=-. ffl There is an efficient method for fitting its parameters, called the Iterative Reweighted Least Squares (IRLS) algorithm [MN83, JJ94b] (a form of NewtonRaphson) . ffl There is a good approximatio... |

46 | A variational approach to Bayesian logistic regression models and their extensions, August 13 1996. [JMJ99
- Jaakkola, Jordan
(Show Context)
Citation Context ... nodes with Gaussian parents in the next section. 7 The variational approximation We can convert the logistic function to a canonical Gaussian potential by using the following variational lower bound =-=[JJ96]-=- (see Appendix A for the derivation): Pr(R = rjX = x) = oe(w 0 x + b)soe() exp \Theta (A \Gamma )=2 + ()(A 2 \Gammas2 ) where A = (2r \Gamma 1)(w 0 x+ b), () = ( 1 2 \Gamma oe())=2, and r 2 f0; 1g. No... |

45 | Global conditioning for probabilistic inference in belief networks
- Shachter, Szolovits
- 1994
(Show Context)
Citation Context ... on observed nodes. We present our results in the context of the junction tree algorithm, which is widely considered to be the most efficient and most general inference algorithm for graphical models =-=[SAS94]-=-. In particular, it allows us to compute the marginals on all N families --- a prerequisite for efficient parameter and structure learning --- in two passes over the graph, whereas other, query-driven... |

21 | Implementation of continuous Bayesian networks usings sums of weighted Gaussians - Driver, Morrel - 1995 |

18 | Causal probabilistic networks with both discrete and continuous variables - Olesen - 1993 |

14 | On the impact of causal independence
- Rish, Dechter
- 1998
(Show Context)
Citation Context ...ch a transformation might lose some local conditional independence information, which might have been exploited to speed up inference. For some kinds of CPDs, such as noisy causal independence models =-=[RD98]-=-, there are ways to expose the local structure graphically, which makes it easier to exploit in the junction tree framework, but we don't discuss this issue here. Finally, we discuss the case of discr... |

13 | Symbolic probabilistic inference with both discrete and continuous variables - Chang, Fung - 1995 |

11 |
Triangulated graphs with marked vertices
- Leimer
- 1989
(Show Context)
Citation Context ...oes not have any paths between two discrete vertices passing through only continuous vertices (i.e., a "forbidden path" of the form D \Gamma C \Gamma D), then there is always at least one st=-=rong root [Lei89]-=-; such graphs are called decomposable, marked graphs (marked just means there are two types of nodes). For example, consider Figure 1. Moralization adds an arc between S and C; the resulting graph is ... |

11 | Variational methods and the QMR-DT database
- Jaakkola, Jordan
- 1999
(Show Context)
Citation Context ...ower bound on the likelihood [NH98]. However, the upper bound can be used in conjunction with the lower bound to filter out runs of MCMC which result in marginals which fall outside the bounds, as in =-=[JJ97]-=-. Note that we can exploit the quadratic approximation to fit the parameters of the logistic node, w and b, using linear regression, instead of the slower IRLS procedure, as noted in [Tip98]. Finding ... |

7 |
Probabilistic visualization of high-dimensional binary data
- Tipping
- 1998
(Show Context)
Citation Context ...loit the quadratic approximation to fit the parameters of the logistic node, w and b, using linear regression, instead of the slower IRLS (Iteratively Reweighted Least Squares) procedure, as noted in =-=[Tip98]-=-. Finding a good variational approximation for the softmax distribution is a problem we are currently working on. In this paper, we only consider the logistic distribution (i.e., binary nodes). Howeve... |

5 | Inference Using Message Propogation and Topology Transformation in Vector Gaussian Continuous Networks - Alag - 1996 |

2 | Symbolic probabalistic inference with continuous variables - Chang, Fung - 1991 |

1 | Graphical models for associations betweenvariables, some of which are qualitative and some quantitative - Lauritzen, Wermuth - 1989 |