DMCA
Loopy belief propagation for approximate inference: An empirical study. In: (1999)
Venue: | Proceedings of Uncertainty in AI, |
Citations: | 676 - 15 self |
BibTeX
@INPROCEEDINGS{Murphy99loopybelief,
author = {Kevin P Murphy and Yair Weiss and Michael I Jordan},
title = {Loopy belief propagation for approximate inference: An empirical study. In:},
booktitle = {Proceedings of Uncertainty in AI,},
year = {1999},
pages = {467--475}
}
OpenURL
Abstract
Abstract Recently, researchers have demonstrated that "loopy belief propagation" -the use of Pearl's polytree algorithm in a Bayesian network with loops -can perform well in the context of error-correcting codes. The most dramatic instance of this is the near Shannon-limit performance of "Turbo Codes" -codes whose decoding algorithm is equivalent to loopy belief propagation in a chain-structured Bayesian network. In this paper we ask: is there something spe cial about the error-correcting code context, or does loopy propagation work as an ap proximate inference scheme in a more gen eral setting? We compare the marginals com puted using loopy propagation to the exact ones in four Bayesian network architectures, including two real-world networks: ALARM and QMR. We find that the loopy beliefs of ten converge and when they do, they give a good approximation to the correct marginals. However, on the QMR network, the loopy be liefs oscillated and had no obvious relation ship to the correct posteriors. We present some initial investigations into the cause of these oscillations, and show that some sim ple methods of preventing them lead to the wrong results. Introduction The task of calculating posterior marginals on nodes in an arbitrary Bayesian network is known to be NP hard In this paper we investigate the approximation performance of "loopy belief propagation". This refers to using the well-known Pearl polytree algorithm [12] on a Bayesian network with loops (undirected cycles). The algorithm is an exact inference algorithm for singly connected networks -the beliefs converge to the cor rect marginals in a number of iterations equal to the diameter of the graph.1 However, as Pearl noted, the same algorithm will not give the correct beliefs for mul tiply connected networks: When loops are present, the network is no longer singly connected and local propaga tion schemes will invariably run into trouble . We believe there are general undiscovered theorems about the performance of belief propagation on loopy DAGs. These theo rems, which may have nothing directly to do with coding or decoding will show that in some sense belief propagation "converges with high probability to a near-optimum value" of the desired belief on a class of loopy DAGs Progress in the analysis of loopy belief propagation has been made for the case of networks with a single loop • Unless all the conditional probabilities are deter ministic, belief propagation will converge. • There is an analytic expression relating the cor rect marginals to the loopy marginals. The ap proximation error is related to the convergence rate of the messages -the faster the convergence the more exact the approximation. • If the hidden nodes are binary, then thresholding the loopy beliefs is guaranteed to give the most probable assignment, even though the numerical value of the beliefs may be incorrect. This result only holds for nodes in the loop. In the max-product (or "belief revision") version, Weiss For the case of networks with multiple loops, Richard son To summarize, what is currently known about loopy propagation is that ( 1) it works very well in an error correcting code setting and (2) there are conditions for a single-loop network for which it can be guaranteed to work well. In this paper we investigate loopy prop agation empirically under a wider range of conditions. Is there something special about the error-correcting code setting, or does loopy propagation work as an approximation scheme for a wider range of networks? ..\ x(:x).) (1) where: and: The message X passes to its parent U; is given by: and the message X sends to its child Y j is given by: k;Cj For noisy-or links between parents and children, there exists an analytic expression for 1r( x) and Ax ( u;) that avoids the exhaustive enumeration over parent config urations We made a slight modification to the update rules in that we normalized both ..\ and 1r messages at each iteration. As Pearl Nodes were updated in parallel: at each iteration all nodes calculated their outgoing messages based on the incoming messages of their neighbors from the pre vious iteration. The messages were said to converge if none of the beliefs in successive iterations changed by more than a small threshold (10-4). All messages were initialized to a vector of ones; random initializa tion yielded similar results, since the initial conditions rapidly get "washed out" . For comparison, we also implemented likelihood weighting 3.1 The PYRAMID network All nodes were binary and the conditional probabilities were represented by tables-entries in the conditional probability tables (CPTs) were chosen uniformly in the range (0, 1]. 3.2 The toyQMR network All nodes were binary and the conditional probabilities of the leaves were represented by a noisy-or: ? (Child= OIParents) = e-Bo-L; B,Paren t; where 110 represents the "leak" term. The QMR-DT network The QMR-DT is a bipartite network whose structure is the same as that shown in figure 2 but the size is much larger. There are approximately 600 diseases and ap proximately 4000 findin nodes, with a number of ob served findings that varies per case. Due to the form of the noisy-or CPTs the complexity of inference is ex ponential in the number of positive findings Results Initial experiments The experimental protocol for the PYRAMID network was as follows. For each experimental run, we first gen erated random CPTs. We then sampled from the joint distribution defined by the network and clamped the observed nodes (all nodes in the bottom layer) to their sampled value. Given a structure and observations, we then ran three inference algorithms -junction tree, loopy belief propagation and sampling. We found that loopy belief propagation always con verged in this case with the average number of iter ations equal to 10.2. The experimental protocol for the toyQMR network was similar to that of the PYRAMID network except that we randomized over structure as well. Again we found that loopy belief propagation always converged, with the average number of iterations equal to 8.65. The protocol for the ALARM network experiments dif fered from the previous two in that the structure and parameters were fixed -only the observed evidence differed between experimental runs. We assumed that all leaf nodes were observed and calculated the pos- Figure 2: The structure of a toyQMR network. This is a bipartite structure where the conditional distributions of the leaves are noisy-or's. The network shown represents one sample from randomly generated structures where the parents of each symptom were a random subset of the diseases. terior marginals of all other nodes. Again we found that loopy belief propagation always converged with the average number of iterations equal to 14.55. The results presented up until now show that loopy propagation performs well for a variety of architectures involving multiple loops. We now present results for the QMR-DT network which are not as favorable. In the QMR-DT network there was no randomization. We used the fixed structure and calculated posteriors for the four cases for which posteriors have been cal culated exactly by Heckerman What causes convergence versus oscill ation? What our initial experiments show is that loopy prop agation does a good job of approximating the correct posteriors if it converges. Unfortunately, on the most challenging case-the QMR-DT network-the al gorithm did not converge. We wanted to see if this oscillatory behavior in the QMR-DT case was related to the size of the network -does loopy propagation tend to converge less for large networks than small networks? To investigate this question, we tried to cause oscil lation in the toyQMR network. We first asked what, besides the size, is different between toyQMR and real QMR? An obvious difference is in the parameter val ues -while the CPTs for toyQMR are random, the real QMR parameters are not. In particular, the prior probability of a disease node being on is extremely low in the real QMR (typically of the order of 10-3 ). Would low priors cause oscillations in the toyQMR case? To answer this question we repeated the ex periments reported in the previous section but rather than having the prior probability of each node be ran domly selected in the range [0, 1] we selected the prior uniformly in the range [0, U] and varied U. Unlike the previous simulations we did not set the observed nodes by sampling from the joint -for low priors all the findings would be negative and inference would be trivial. Rather each finding was independently set to positive or negative. If indeed small priors are responsible for the oscilla tion, then we would expect the real QMR network to converge if the priors were sampled randomly in the range [0, Small priors are not the only thing that causes oscil lation. Small weights can, too. The effect of both The exact marginals are represented by the circles; the ends of the "error bars" represent the loopy marginals at the last two iterations. We only plot the diseases which had non-negligible posterior probability. Loopy Belief Propagation . s---=-o� . a-----' range of prior To test this hypothesis, we reparameterized the pyra mid network as follows: we set the prior probability of the "1" state of the root nodes to 0.9, and we utilized the noisy-OR model for the other nodes with a small (0.1) inhibition probability (apart from the leak term, which we inhibited with probability 0.9). This param eterization has the effect of propagating 1 's from the top layer to the bottom. Thus the true marginal at each leaf is approximately (0.1, 0.9), i.e., the leaf is 1 with high probability. We then generated untypical evidence at the leaves by sampling from the uniform distribution, (0.5, 0.5), or from the skewed distribu tion (0.9, 0. 1). We found that loopy propagation still converged2, and that, as before, the marginals to which it converged were highly correlated with the correct marginals. Thus there must be some other explana tion, besides untypicality of the evidence, for the os cillations observed in QMR. Can we fix oscillations easily? When loopy propagation oscillates between two steady states it seems reasonable to try to find a way to com bine the two values. The simplest thing to do is to average them. Unfortunately, this gave very poor re sults, since the correct posteriors do not usually lie in the midpoint of the interval ( cf. 2More precisely, we found that with a convergence threshold of 10-4 , 98 out of 100 cases converged; when we lowered the threshold to 10-3 , all 100 cases converged. We also tried to avoid oscillations by using "momen tum"; replacing the messages that were sent at time t with a weighted average of the messages at times t and t-1. That is, we replaced the reference to >.� ) in and similarly for 11"�) in Equation 3, where 0 :::; J.l :::; 1 is the momentum term. It is easy to show that if the modified system of equations converges to a fixed point F, then F is also a fixed point of the original system (since if>.� ) = >.�-1) , then Equation 7 yields>.� ) ). In the experiments for which loopy propagation con verged (PYRAMID, toyQMR and ALARM), we found that adding the momentum term did not change the results -the beliefs that resulted were the same be liefs found without momentum. In the experiments which did not converge (toyQMR with small priors and real QMR), we found that momentum significantly reduced the chance of oscillation. However, in several cases the beliefs to which the algorithm converged were quite inaccurate-see Discussion The experimental results presented here suggest that loopy propagation can yield accurate posterior marginals in a more general setting than that of error correcting coding -the PYRAMID, toyQMR and ALARM networks are quite different from the error correcting coding graphs yet the loopy beliefs show high correlation with the correct marginals. In error-correcting codes the posterior is typically highly peaked and one might think that this feature is necessary for the good performance of loopy prop agation. Our results suggest that is not the case - in none of our simulations were the posteriors highly peaked around a single joint configuration. If the prob ability mass was concentrated at a single point the marginal probabilities should all be near zero or one; this is clearly not the case as can be seen in the figures. It might be expected that loopy propagation would only work well for graphs with large loops. However, our results, and previous results on turbo codes, show that loopy propagation can also work well for graphs with many small loops. At the same time, our experimental results suggest a cautionary note about loopy propagation, showing that the marginals may exhibit oscillations that have very little correlation with the correct marginals. We presented some preliminary results investigating the cause of the oscillations and showed that it is not sim ply a matter of the size of the network or the number of parents. Rather the same structure with different parameter values may oscillate or exhibit stable be havior. For all our simulations, we found that when loopy propagation converges, it gives a surprisingly good ap proximation to the correct marginals. Since the dis tinction between convergence and oscillation is easy to make after a small number of iterations, this may sug gest a way of checking whether loopy propagation is appropriate for a given problem. Acknowl edgements We thank Tommi Jaakkola, David Heckerman and David MacKay for useful discussions. We also thank Randy Miller and the University of Pittsburgh for the use of the QMR-DT database. Supported by MURI ARO DAAH04-96-1-0341. algorithm. These approaches are guaranteed to find local maxima, but do not explore the landscape for other modes. Our approach evolves structure and the missing data. We compare our stochastic algorithms and show they all produce accurate results.