## Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables (1996)

### Cached

### Download Links

Venue: | Machine Learning |

Citations: | 184 - 12 self |

### BibTeX

@INPROCEEDINGS{Chickering96efficientapproximations,

author = {David Maxwell Chickering and David Heckerman},

title = {Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables},

booktitle = {Machine Learning},

year = {1996},

pages = {181--212}

}

### Years of Citing Articles

### OpenURL

### Abstract

We discuss Bayesian methods for model averaging and model selection among Bayesiannetwork models with hidden variables. In particular, we examine large-sample approximations for the marginal likelihood of naive-Bayes models in which the root node is hidden. Such models are useful for clustering or unsupervised learning. We consider a Laplace approximation and the less accurate but more computationally efficient approximation known as the Bayesian Information Criterion (BIC), which is equivalent to Rissanen's (1987) Minimum Description Length (MDL). Also, we consider approximations that ignore some off-diagonal elements of the observed information matrix and an approximation proposed by Cheeseman and Stutz (1995). We evaluate the accuracy of these approximations using a Monte-Carlo gold standard. In experiments with artificial and real examples, we find that (1) none of the approximations are accurate when used for model averaging, (2) all of the approximations, with the exception of BI...

### Citations

9193 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...995) discuss how to compute derivatives of the likelihood for a Bayesian network with discrete variables. A more efficient technique for identifying a local MAP or ML value of θs is the EM algorithm (=-=Dempster, Laird, & Rubin, 1977-=-). Applied to Bayesian networks for discrete variables, the EM algorithm works as follows. First, we assign values to θs somehow (e.g., at random). Next, we compute the expected sufficient 2 One of th... |

4100 | Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images - Geman, Geman - 1984 |

3110 | UCI repository of machine learning databases - Blake, Keogh, et al. - 1998 |

2831 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...arge models. In this paper, we examine other large-sample approximations that are more efficient than the Laplace approximation. These approximations include the Bayesian Information Criterion (BIC) (=-=Schwarz, 1978-=-), which is equivalent to Risannen's (1987) Minimum-DescriptionLength (MDL) measure, diagonal and block diagonal approximations for the Hessian term in the Laplace approximation (Becker and LeCun, 198... |

1429 | Statistical Decision Theory and Bayesian Analysis - Berger - 1985 |

1205 | Bayes factors - Kass, Raftery - 1995 |

1194 | Bayesian Theory - Bernardo, Smith - 2000 |

1148 | A Bayesian method for the induction of probabilistic networks form data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...cture prior as the Bayesian Dirichlet (BD) scoring function. When the random sample D is incomplete, the exact computation of the marginal likelihood is intractable for real-world problems (e.g., see =-=Cooper & Herskovits, 1992-=-). Thus, approximations are required. In this paper, we consider asymptotic approximations. One well-known asymptotic approximation is the Laplace or Gaussian approximation (Kass et al., 1988; Kass & ... |

962 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...lihoods. Specifically, we computed ∆m ≡ log p(D|Sh hidden ) − log p(D|Sh nohide ). We used the Laplace approximation to compute the first term, and the exact expression for marginal likelihood (e.g., =-=Heckerman et al., 1995-=-) to compute the second term. Repeating this experiment five times, we obtained ∆m = 26 ± 33, indicating that the hidden-variable model better predicted the data. In additional experiments, we found t... |

650 | Markov Chain Monte Carlo in practice - Gilks, Richardson, et al. - 1996 |

597 | Probabilistic inference using markov chain monte carlo methods
- Neal
- 1993
(Show Context)
Citation Context ...own closed-form methods cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (=-=Neal, 1993-=-; Chib, 1995; Raftery, 1996) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, ... |

588 | Bayesian interpolation - MacKay |

579 | Theory of Probability - Jeffreys - 1961 |

542 | Causation, Prediction and Search - Spirtes, Glymour, et al. - 2000 |

519 | Bayesian classification (AutoClass): theory and results - Cheeseman, Stutz - 1996 |

432 | A pratctical Bayesian framework for backpropagation networks - Mackay - 1992 |

399 | Marginal likelihood from the Gibbs output
- Chib
- 1995
(Show Context)
Citation Context ...orm methods cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; =-=Chib, 1995-=-; Raftery, 1996) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, whereas larg... |

389 | Inference and missing data - Rubin - 1976 |

337 | Bayesian model selection in social research
- Raftery
- 1995
(Show Context)
Citation Context ...y more efficient 1 than Monte-Carlo techniques. One large-sample approximation, known as a Laplace approximation, is widely used by Bayesian statisticians (Haughton, 1988; Kass et al., 1988; Kass and =-=Raftery, 1995-=-). Although this approximation is efficient relative to Monte-Carlo methods, it has a computational complexity of O(d 2 N) (or greater) where d is the dimension of the model and N is the sample size o... |

331 | A tutorial on learning bayesian networks - Heckerman - 1996 |

297 | Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window
- Madigan, Raftery
- 1994
(Show Context)
Citation Context ...table. Consequently, approximate techniques for computing the marginal likelihood are used. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; =-=Madigan & Raftery, 1994-=-), sequential updating methods (Spiegelhalter & Lauritzen, 1990; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). ... |

263 | Bayesian Graphical Models for Discrete Data - Madigan, York - 1995 |

254 | Operations for learning with graphical models
- Buntine
- 2004
(Show Context)
Citation Context ...ly and in closed form. Many of researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; Spiegelhalter et al., 1993; =-=Buntine, 1994-=-; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X i and the jth ... |

208 |
Sequential updating of conditional probabilities on directed graphical structures
- Spiegelhalter, Lauritzen
- 1990
(Show Context)
Citation Context ...ng the marginal likelihood are used. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Madigan & Raftery, 1994), sequential updating methods (=-=Spiegelhalter & Lauritzen, 1990-=-; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). In this paper, we examine asymptotic approximations, comparing ... |

200 |
Bayesian analysis in expert systems
- Spiegelhalter, Dawid, et al.
- 1993
(Show Context)
Citation Context ...ection can be done efficiently and in closed form. Many of researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; =-=Spiegelhalter et al., 1993-=-; Buntine, 1994; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X... |

176 | Graphical models for associations between variables, some of which are qualitative and some quantitative - Lauritzen, Wermuth - 1989 |

154 | Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly - Jensen, Lauritzen, et al. - 1990 |

154 | Assessment and Propagation of Model Uncertainty
- Draper
- 1995
(Show Context)
Citation Context ...& Raftery, 1994), sequential updating methods (Spiegelhalter & Lauritzen, 1990; Cowell, Dawid, & Sebastiani, 1995), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; =-=Draper, 1993-=-). In this paper, we examine asymptotic approximations, comparing their accuracy and efficiency. We consider the Laplace approximation (Kass et al., 1988; Kass & Raftery, 1995; Azevedo-Filho & Shachte... |

153 | A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion
- Kass, Wasserman
- 1994
(Show Context)
Citation Context ... increases. 1 Thus, if we use BIC to select one of a set of models, we will select a model whose posterior probability is a maximum, when N 1 Under some conditions, the BIC is accurate to O(N −1/2 ) (=-=Kass & Wasserman, 1996-=-). These conditions do not apply to the models we examine in our experiments. θsbecomes sufficiently large. We say that BIC is asymptotically correct. By this definition, the Laplace approximation is... |

132 | Assessment and propagation of model uncertainty (with discussion - Draper - 1995 |

126 | Mean field theory for sigmoid belief networks - Saul, Jaakkola, et al. - 1996 |

117 | Learning Gaussian Networks - Geiger, Heckerman - 1994 |

116 | Learning by being told and learning from examples: An experimental comparison of the two methods of knowledge acquisition in the context of developing an expert system for soybean disease diagnosis - Michalski, Chilausky - 1980 |

110 | Approximate Bayes Factors and Accounting for Model Uncertainty in Generalized Linear Models
- Raftery
- 1996
(Show Context)
Citation Context ...of these approximations. For example, both theoretical and empirical studies have shown that the Laplace approximation is more accurate than is the BIC/MDL approximation (see, e.g., Draper, 1993, and =-=Raftery, 1994-=-). Also, Becker and LeCun (1989) and MacKay (1992b) report successful and unsuccessful applications of the diagonal approximation, respectively, in the context of parameter learning for probabilistic ... |

103 | Improving the convergence of back-propagation learning with second order methods - Becker, LeCun - 1989 |

82 | Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm - Meng, Rubin - 1991 |

81 | Local learning in probabilistic networks with hidden variables - Russell, Binder, et al. - 1995 |

74 | Kinds of probability - GOOD - 1959 |

69 | Present Position and Potential Development: Some Personal Views: Statistical Theory: The Prequential Approach - Dawid - 1984 |

64 | PROTOS: an exemplar-based learning apprentice - Bareiss, Porter - 1988 |

64 |
On the choice of a model to fit data from an exponential family
- Haughton
- 1988
(Show Context)
Citation Context ... under certain assumptions, and are typically more efficient 1 than Monte-Carlo techniques. One large-sample approximation, known as a Laplace approximation, is widely used by Bayesian statisticians (=-=Haughton, 1988-=-; Kass et al., 1988; Kass and Raftery, 1995). Although this approximation is efficient relative to Monte-Carlo methods, it has a computational complexity of O(d 2 N) (or greater) where d is the dimens... |

52 |
Hypothesis testing and model selection
- Raftery
- 1996
(Show Context)
Citation Context ...cannot be used to determine marginal likelihood. Approximations for computing marginal likelihood include MonteCarlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Chib, 1995; =-=Raftery, 1996-=-) and large-sample approximations (Kass et al., 1988; Kass and Raftery, 1995). As mentioned in the introduction, Monte-Carlo methods are accurate but typically inefficient, whereas large-sample method... |

49 | Asymptotic model selection for directed networks with hidden variables - Geiger, Heckerman, et al. - 1996 |

41 | Stochastic complexity (with discussion - Rissanen - 1987 |

34 | Bayesian mixture modeling by Monte Carlo simulation (Tech
- Neal
(Show Context)
Citation Context ...ters once a model or set of models have been selected. If computation time is not an issue and we are concerned primarily with prediction, then a Monte-Carlo average over parameters is probably best (=-=Neal, 1991-=-). Nonetheless, we sometimes need a fast model for prediction or we may want point values for the parameters to facilitate an understanding of the domain. What is best in these circumstances is an ope... |

32 | Optimal Discriminant Plane for a Small Number of Samples and - Hong, Yang - 1991 |

31 | Weigend,” Computing Second Derivative in Feed-Forward Networks: A review
- Buntine, A
- 1994
(Show Context)
Citation Context ...ently and in closed form. Many researchers who have addressed Bayesian-network learning have adopted at least some of these assumptions (e.g., Cooper and Herskovits, 1992; Spiegelhalter et al., 1993; =-=Buntine, 1994-=-; Heckerman et al., 1995). The assumptions are as follows. 1. Every variable is discrete, having a finite number of states. We use x k i and pa j i to denote the kth possible state of X i and the jth ... |

28 | Choice of basis for the Laplace approximation - MacKay - 1996 |

22 | A guide to the literature on learning graphical models - Buntine - 1996 |

20 |
A comparison of sequential learning methods for incomplete data
- Cowell, Dawid, et al.
- 1996
(Show Context)
Citation Context ...ed. These techniques include Monte Carlo approaches such as Gibbs sampling and importance sampling (Neal, 1993; Madigan & Raftery, 1994), sequential updating methods (Spiegelhalter & Lauritzen, 1990; =-=Cowell, Dawid, & Sebastiani, 1995-=-), and asymptotic approximations (Kass, Tierney, & Kadane, 1988; Kass & Raftery, 1995; Draper, 1993). In this paper, we examine asymptotic approximations, comparing their accuracy and efficiency. We c... |