## Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables (1997)

### Cached

### Download Links

Venue: | Machine Learning |

Citations: | 183 - 12 self |

### BibTeX

@INPROCEEDINGS{Chickering97efficientapproximations,

author = {David Maxwell Chickering and David Heckerman},

title = {Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables},

booktitle = {Machine Learning},

year = {1997},

pages = {29--181}

}

### Years of Citing Articles

### OpenURL

### Abstract

We discuss Bayesian methods for learning Bayesian networks when data sets are incomplete. In particular, we examine asymptotic approximations for the marginal likelihood of incomplete data given a Bayesian network. We consider the Laplace approximation and the less accurate but more efficient BIC/MDL approximation. We also consider approximations proposed by Draper (1993) and Cheeseman and Stutz (1995). These approximations are as efficient as BIC/MDL, but their accuracy has not been studied in any depth. We compare the accuracy of these approximations under the assumption that the Laplace approximation is the most accurate. In experiments using synthetic data generated from discrete naive-Bayes models having a hidden root node, we find that (1) the BIC/MDL measure is the least accurate, having a bias in favor of simple models, and (2) the Draper and CS measures are the most accurate. 1

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...995) discuss how to compute derivatives of the likelihood for a Bayesian network with discrete variables. A more efficient technique for identifying a local MAP or ML value of θs is the EM algorithm (=-=Dempster, Laird, & Rubin, 1977-=-). Applied to Bayesian networks for discrete variables, the EM algorithm works as follows. First, we assign values to θs somehow (e.g., at random). Next, we compute the expected sufficient 2 One of th... |

4055 | Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images - Geman, Geman - 1984 |

3085 | UCI repository of machine learning databases - Blake, Merz - 1998 |

2771 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...arge models. In this paper, we examine other large-sample approximations that are more efficient than the Laplace approximation. These approximations include the Bayesian Information Criterion (BIC) (=-=Schwarz, 1978-=-), which is equivalent to Risannen's (1987) Minimum-DescriptionLength (MDL) measure, diagonal and block diagonal approximations for the Hessian term in the Laplace approximation (Becker and LeCun, 198... |

1400 | Statistical decision theory and Bayesian analysis (2nd ed - Berger - 1980 |

1176 | Bayes factors - Kass, Raftery - 1995 |

1171 | Bayesian Theory - Bernardo, Smith - 1994 |

1140 | A Bayesian Method for the Induction of Probabilistic Networks from Data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...cture prior as the Bayesian Dirichlet (BD) scoring function. When the random sample D is incomplete, the exact computation of the marginal likelihood is intractable for real-world problems (e.g., see =-=Cooper & Herskovits, 1992-=-). Thus, approximations are required. In this paper, we consider asymptotic approximations. One well-known asymptotic approximation is the Laplace or Gaussian approximation (Kass et al., 1988; Kass & ... |

953 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...lihoods. Specifically, we computed ∆m ≡ log p(D|Sh hidden ) − log p(D|Sh nohide ). We used the Laplace approximation to compute the first term, and the exact expression for marginal likelihood (e.g., =-=Heckerman et al., 1995-=-) to compute the second term. Repeating this experiment five times, we obtained ∆m = 26 ± 33, indicating that the hidden-variable model better predicted the data. In additional experiments, we found t... |

639 | Markov Chain Monte Carlo in practice - Gilks, Richardson, et al. - 1996 |

594 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...own closed-form methods cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (=-=Neal, 1993-=-; Chib, 1995; Raftery, 1996) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, ... |

582 | Bayesian interpolation - MacKay - 1991 |

570 | Theory of Probability - Jeffreys - 1961 |

535 | Causation, Prediction, and Search - Spirtes, Glymour, et al. - 1993 |

515 | Bayesian Classification (AutoClass): Theory and Results - Cheeseman, Stutz - 1996 |

429 | A Practical Bayesian Framework for Backpropagation Networks - Mackay - 1992 |

392 | Marginal Likelihood from the Gibbs Output
- Chib
- 1995
(Show Context)
Citation Context ...orm methods cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; =-=Chib, 1995-=-; Raftery, 1996) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, whereas larg... |

375 | Inference and missing data - Rubin - 1976 |

330 | A tutorial on learning Bayesian networks - Heckerman - 1995 |

323 | Bayesian Model Selection in Social Research
- Raftery
- 1995
(Show Context)
Citation Context ...y more efficient 1 than Monte-Carlo techniques. One large-sample approximation, known as a Laplace approximation, is widely used by Bayesian statisticians (Haughton, 1988; Kass et al., 1988; Kass and =-=Raftery, 1995-=-). Although this approximation is efficient relative to Monte-Carlo methods, it has a computational complexity of O(d 2 N) (or greater) where d is the dimension of the model and N is the sample size o... |

293 | Model selection and accounting for model uncertainty in graphical models using Occam’s window
- Madigan, Raftery
- 1994
(Show Context)
Citation Context ...table. Consequently, approximate techniques for computing the marginal likelihood are used. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; =-=Madigan & Raftery, 1994-=-), sequential updating methods (Spiegelhalter & Lauritzen, 1990; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). ... |

258 | Bayesian graphical models for discrete data - Madigan, York - 1995 |

253 | Operations for learning with graphical models
- Buntine
- 1994
(Show Context)
Citation Context ...ly and in closed form. Many of researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; Spiegelhalter et al., 1993; =-=Buntine, 1994-=-; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X i and the jth ... |

206 |
1990. Sequential updating of conditional probabilities on directed graphical structures
- Spiegelhalter, Lauritzen
(Show Context)
Citation Context ...ng the marginal likelihood are used. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Madigan & Raftery, 1994), sequential updating methods (=-=Spiegelhalter & Lauritzen, 1990-=-; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). In this paper, we examine asymptotic approximations, comparing ... |

199 |
Bayesian analysis in expert systems
- Spiegelhalter, Dawid, et al.
- 1993
(Show Context)
Citation Context ...ection can be done efficiently and in closed form. Many of researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; =-=Spiegelhalter et al., 1993-=-; Buntine, 1994; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X... |

174 | Graphical Models for Association Between Variables, Some Which Are Qualitative and Some Quantitative.” Annals of Statistics. 17(1989): 31 – 57 - Lauritzen, Wermuth |

153 | Bayesian updating in recursive graphical models by local computation - Jensen, Lauritzen, et al. - 1990 |

149 | A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion
- KASS, WASSERMAN
- 1995
(Show Context)
Citation Context ... increases. 1 Thus, if we use BIC to select one of a set of models, we will select a model whose posterior probability is a maximum, when N 1 Under some conditions, the BIC is accurate to O(N −1/2 ) (=-=Kass & Wasserman, 1996-=-). These conditions do not apply to the models we examine in our experiments. θsbecomes sufficiently large. We say that BIC is asymptotically correct. By this definition, the Laplace approximation is... |

148 | Assessment and propagation of model uncertainty
- Draper
- 1995
(Show Context)
Citation Context ...& Raftery, 1994), sequential updating methods (Spiegelhalter & Lauritzen, 1990; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; =-=Draper, 1993-=-). In this paper, we examine asymptotic approximations, comparing their accuracy and efficiency. We consider the Laplace approximation (Kass et al., 1988; Kass & Raftery, 1995; Azevedo-Filho & Shachte... |

129 | Assessment and propagation of model uncertainty (with discussion - Draper - 1995 |

126 | Mean field theory for sigmoid belief networks - Saul, Jaakkola, et al. - 1996 |

116 | Learning Gaussian networks - Geiger, Heckerman - 1994 |

115 | Learning By Being Told and Learning From Examples: An Experimental Comparison of the Two Methods of Knowledge Acqusition in the Context of Developing an Expert System for Soybean Disease Diagnosis,” Policy Analysis and - Michalski, Chilausky - 1980 |

110 | Approximate Bayes factors and accounting for model uncertainty in generalised linear models
- Raftery
- 1996
(Show Context)
Citation Context ...of these approximations. For example, both theoretical and empirical studies have shown that the Laplace approximation is more accurate than is the BIC/MDL approximation (see, e.g., Draper, 1993, and =-=Raftery, 1994-=-). Also, Becker and LeCun (1989) and MacKay (1992b) report successful and unsuccessful applications of the diagonal approximation, respectively, in the context of parameter learning for probabilistic ... |

103 | Improving the convergence of back-propagation learning with second order methods - Becker, Cun - 1988 |

81 | Local learning in probabilistic networks with hidden variables - Russell, Binder, et al. - 1995 |

79 | Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm - Meng, Rubin - 1991 |

74 | Kinds of probability - GOOD - 1959 |

69 | Present position and potential developments: Some personal views, statistical theory, the prequential approach - Dawid - 1984 |

64 | PROTOS - an exemplar-based learning apprentice - Bareiss, Porter - 1987 |

64 |
On the choice of a model to fit data from an exponential family
- Haughton
- 1988
(Show Context)
Citation Context ... under certain assumptions, and are typically more efficient 1 than Monte-Carlo techniques. One large-sample approximation, known as a Laplace approximation, is widely used by Bayesian statisticians (=-=Haughton, 1988-=-; Kass et al., 1988; Kass and Raftery, 1995). Although this approximation is efficient relative to Monte-Carlo methods, it has a computational complexity of O(d 2 N) (or greater) where d is the dimens... |

51 |
Hypothesis testing and model selection
- Raftery
- 1996
(Show Context)
Citation Context ...cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Chib, 1995; =-=Raftery, 1996-=-) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, whereas large-sample method... |

49 | Asymptotic model selection for directed networks with hidden variables - Geiger, Heckerman, et al. - 1996 |

41 | Stochastic Complexity (with discussion - Rissanen - 1987 |

34 | Bayesian mixture modeling by Monte Carlo simulation (Tech
- Neal
(Show Context)
Citation Context ...ters once a model or set of models have been selected. If computation time is not an issue and we are concerned primarily with prediction, then a Monte-Carlo average over parameters is probably best (=-=Neal, 1991-=-). Nonetheless, we sometimes need a fast model for prediction or we may want point values for the parameters to facilitate an understanding of the domain. What is best in these circumstances is an ope... |

31 | Optimal discriminant plane for a small number of samples and design method of classifier on the plane, " Pattern recognition - hong, Yang - 1991 |

31 | Weigend,” Computing Second Derivative in Feed-Forward Networks: A review
- Buntine, A
- 1994
(Show Context)
Citation Context ...ently and in closed form. Many researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; Spiegelhalter et al., 1993; =-=Buntine, 1994-=-; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X i and the jth ... |

28 | Choice of basis for Laplace approximation - Mackay - 1998 |

22 | A guide to the literature on learning graphical models - Buntine - 1996 |

20 |
A comparison of sequential learning methods for incomplete data
- Cowell, Dawid, et al.
- 1996
(Show Context)
Citation Context ...ed. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Madigan & Raftery, 1994), sequential updating methods (Spiegelhalter & Lauritzen, 1990; =-=Cowell, Dawid, & Sebastiani, 1995-=-), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). In this paper, we examine asymptotic approximations, comparing their accuracy and efficiency. We c... |