## Learning Bayesian networks: The combination of knowledge and statistical data (1995)

### Cached

### Download Links

Venue: | Machine Learning |

Citations: | 977 - 35 self |

### BibTeX

@INPROCEEDINGS{Heckerman95learningbayesian,

author = {David Heckerman and David M. Chickering},

title = {Learning Bayesian networks: The combination of knowledge and statistical data},

booktitle = {Machine Learning},

year = {1995},

pages = {20--197}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe scoring metrics for learning Bayesian networks from a combination of user knowledge and statistical data. We identify two important properties of metrics, which we call event equivalence and parameter modularity. These properties have been mostly ignored, but when combined, greatly simplify the encoding of a user’s prior knowledge. In particular, a user can express his knowledge—for the most part—as a single prior Bayesian network for the domain. 1

### Citations

9409 | Maximum likelihood from incomplete data via the em algorithm - Dempster, Laird, et al. - 1977 |

7644 |
Probabilistic reasoning in intelligent systems: networks of plausible inference
- Pearl
- 1988
(Show Context)
Citation Context ...h 1. Introduction A Bayesian network is an annotated directed graph that encodes probabilistic relationships among distinctions of interest in an uncertain-reasoning problem (Howard & Matheson, 1981; =-=Pearl, 1988-=-). The representation formally encodes the joint probability distribution for its domain, yet includes a human-oriented qualitative structure that facilitates communication between auser and a system ... |

1159 | A bayesian method for the induction of probabilistic networks from data
- Cooper, Dietterich
- 1992
(Show Context)
Citation Context ...he same set of parents (the empty set). Consequently, by parameter modularity, p(O~]Bh~y, ~) = p(OzlB)y, ~). We note that CH, Buntine, and SDLC implicitly make the assumption of parameter modularity (=-=Cooper & Herskovits, 1992-=-, Equation A6, p. 340; Buntine, 1991, p. 55; Spiegelhalter et al., 1993, pp. 243-244). The fourth assumption restricts each parameter set Oij to have a Dirichlet distribution: ASSUMPTION 4 (DIRICHLET)... |

708 | Approximating discrete probability distributions with dependence trees - Chow, Member, et al. - 1968 |

548 | Causation, prediction and search - Spirtes, Glymour, et al. - 1993 |

451 |
Lectures on Functional Equations and Their Applications
- Aczel
- 2006
(Show Context)
Citation Context ...O~) (39) where f~, full, fyle, fy, f~ly, and f<v are unknown density functions. Equations 38 and 39 define a functional equation. Methods for solving such equations have been well studied (see, e.g., =-=Aczel, 1966-=-). In our case, Geiger and Heckerman (1995) show that, if each function is positive, then the only solution to Equations 38 and 39 is for p(0~, O~90evlB~__.u,g) h to be a Dirichlet distribution. In fa... |

380 |
Influence diagrams
- Howard, Matheson
- 1984
(Show Context)
Citation Context ...ranching, heuristic search 1. Introduction A Bayesian network is an annotated directed graph that encodes probabilistic relationships among distinctions of interest in an uncertain-reasoning problem (=-=Howard & Matheson, 1981-=-; Pearl, 1988). The representation formally encodes the joint probability distribution for its domain, yet includes a human-oriented qualitative structure that facilitates communication between auser ... |

334 | Learning Bayesian Networks - Heckerman, Geiger, et al. - 1995 |

302 | Model Selection and Accounting for Model Uncertainty in Graphical Models using Occams
- Madigan, Raftery
- 1994
(Show Context)
Citation Context ...cribe a methodology for evaluating learning algorithms. We use this methodology to compare various scoring metrics and search methods. We note that several researchers (e.g., Dawid & Lauritzen, 1993; =-=Madigan & Raftery, 1994-=-) have developed methods for learning undirected network structures as described in (e.g.) Lauritzen (1982). In this paper, we concentrate on learning directed models, because we can sometimes use the... |

251 |
The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks
- Beinlich, Suermondt, et al.
- 1989
(Show Context)
Citation Context ...) new Bayesian networks as shown in Figure ld. To appreciate the effectiveness of the approach, note that the database was generated from the Bayesian network in Figure la known as the Alarm network (=-=Beinlich et al., 1989-=-). Comparing the three network structures, we see that the structure of the learned network is much closer to that of the Alarm network than is the structure of the prior network. In effect, our learn... |

224 |
Equivalence and synthesis of causal models
- VERMA, PEARL
- 1990
(Show Context)
Citation Context ...al independence--that is, every joint probability distl ation encoded by one structure can be encoded by the other, and vice versa. In this case, the two network structures are said to be equivalent (=-=Verma & Pearl, 1990-=-). For example, the structures xl --+ :c2 ~ :ca and Xl e-- X 2 +-- X3 both represent the assertion that Xl and x3 are conditionally independent given x2, and are equivalent, In some of the technical d... |

218 | A theory of inferred causation
- Pearl, Verma
- 1991
(Show Context)
Citation Context ...Spiegelhalter et al., 1993; Dawid & Lauritzen, 1993; Heckerman et al., 1994), quasi-Bayesian methods (Lam &198 HECKERMAN, GEIGER AND CHICKERING Bacchus, 1993; Suzuki, 1993), and nonBayesian methods (=-=Pearl & Verma, 1991-=-; Spirtes et al., 1993). In this paper, we concentrate on the Bayesian approach, which takes prior knowledge and combines it with data to produce one or more Bayesian networks. Our approach is illustr... |

209 | How easy is local search
- Johnson, Papadimitriou, et al.
- 1988
(Show Context)
Citation Context ...xi[II~) need be evaluated to determine A(e). If an arc between xi and xj is reversed, then only s(zilIIi) and s(xjlIIj) need be evaluated. One simple heuristic search algorithm is local search (e.g., =-=Johnson, 1985-=-). First, we choose a graph. Then, we evaluate A(e) for all e E E, and make the change e for which A(e) is a maximum, provided it is positive. We terminate search when there is no e with a positive va... |

208 |
Sequential updating of conditional probabilities on directed graphical structures
- Spiegelhalter, Lauritzen
- 1990
(Show Context)
Citation Context ...these probabilities can be approximated for incomplete databases by well-known statistical methods. Such methods include filling in missing data based on the data that is present (Titterington, 1976; =-=Spiegelhalter & Lauritzen, 1990-=-), the EM algorithm (Dempster, 1977), and Markov chain Monte Carlo methods (e.g., Gibbs sampling) (York, 1992; Madigan & Raftery, 1994). Let us now explore the consequences of these assumptions. First... |

204 | Theory refinement on Bayesian networks
- Buntine
- 1991
(Show Context)
Citation Context ...ks to encode expert knowledge. More recently, AI researchers and statisticians have begun to investigate methods for learning Bayesian networks, including Bayesian methods (Cooper & Herskovits, 1991; =-=Buntine, 1991-=-; Spiegelhalter et al., 1993; Dawid & Lauritzen, 1993; Heckerman et al., 1994), quasi-Bayesian methods (Lam &198 HECKERMAN, GEIGER AND CHICKERING Bacchus, 1993; Suzuki, 1993), and nonBayesian methods... |

201 | Bayesian analysis in expert systems - Spiegelhalter, Dawid, et al. - 1993 |

191 | Reasoning about beliefs and actions under computational resource constraints
- Horvitz
- 1987
(Show Context)
Citation Context ... gold-standard and learned networks, and note the difference (Heckerman & Nathwani, 1992). This utility function may include not only domain utility, but the costs of probabilistic inference as well (=-=Horvitz, 1987-=-). Unfortunately, it is difficult if not impossible to construct utility functions and decision scenarios in practice. For example, a particular set of learned network structures may be used for a col... |

154 |
Linear-Space Best-First Search
- Korf
- 1993
(Show Context)
Citation Context ...m graph. Alternatively, we may start with a lower temperature, and use one of the initialization methods described for local search. Other methods for escaping local maxima include best-first search (=-=Korf, 1993-=-) and Gibbs' sampling (e.g., Madigan & Raftery, 1994). 8. Evaluation Methodology Our methodology for measuring the learning accuracy of scoring metrics and search procedures is as follows. We start wi... |

127 |
Hyper Markov laws in the statistical analysis of decomposable graphical models
- Dawid, Lauritzen
- 1993
(Show Context)
Citation Context ...y, AI researchers and statisticians have begun to investigate methods for learning Bayesian networks, including Bayesian methods (Cooper & Herskovits, 1991; Buntine, 1991; Spiegelhalter et al., 1993; =-=Dawid & Lauritzen, 1993-=-; Heckerman et al., 1994), quasi-Bayesian methods (Lam &198 HECKERMAN, GEIGER AND CHICKERING Bacchus, 1993; Suzuki, 1993), and nonBayesian methods (Pearl & Verma, 1991; Spirtes et al., 1993). In this... |

119 |
Learning gaussian networks
- Geiger, Heckerman
- 1994
(Show Context)
Citation Context ...priors from our construcfion to create the likelihoodequivalent BDe metric for complete databases. We note that our metrics and methods for constructing priors may be extended to nondiscrete domains (=-=Geiger & Heckerman, 1994-=-; Heckerman & Geiger, 1995). Third, we described search methods for identifying network structures with high posterior probabilities. We described polynomial algorithms for finding the highest-scoring... |

96 | A transformational characterization of equivalent Bayesian network structures
- Chickering
- 1995
(Show Context)
Citation Context ...cal discussionsLEARNING BAYESIAN NETWORKS 203 in this paper, we shall require the following characterization of equivalent networks, proved in Chickering (1995a) and also in the Appendix. THEOREM 1 (=-=Chickering, 1995-=-a) Let Bsl and Bs2 be two Bayesian-network structures, and RBsl,B~2 be the set of edges by which Bs1 and Bs2 differ in directionality. Then, Bs1 and Bs2 are equivalent if and only if there exists a se... |

88 | Causal Diagrams for Empirical Research.” Biometrika 82(4):669–710 - Pearl - 1995 |

81 | Optimization algorithms for networks and graphs - Evans, Minieka - 1992 |

79 | Bayesian network structures from data - Singh, Valtorta |

77 | Learning Bayesian networks: Search methods and experimental results - Chickering, Geiger, et al. - 1995 |

75 |
The Estimation of Probabilities
- Good
- 1965
(Show Context)
Citation Context ...er Ov=k is positive (i.e., greater than zero). A sequence that satisfies these conditions is a particular type of random sample known as an (r - 1)-dimensional multinomial sample with parameters (~y (=-=Good, 1965-=-). When r = 2, the sequence is said to be a binomial sample. One example of a binomial sample is the outcome of repeated flips of a thumbtack. If we knew the long-run fraction of "heads" (point down) ... |

63 | A Bayesian approach to learning causal networks
- Heckerman
- 1995
(Show Context)
Citation Context ...to be a Dirichlet distribution. In fact, they show that, even when x and/or y have more than two states, the only solution consistent with likelihood equivalence is the Dirichlet. THEOREM 6 (Geiger & =-=Heckerman, 1995-=-) Let Oxv, Oz U Oul:~, and 0 u U ~zly be (positive) multinomiaI parameters related by the rules of probability. If r~ 7 r~ orv -- 1" rv I~k=l x=k 4((~y) I-If~ly=Z(@=lu=Z) (40) k=l _ H/=I ~y=l /=1 wher... |

56 | Learning Bayesian networks with discrete variables from data - Spirtes, Meek - 1995 |

46 | Causality in Bayesian belief networks
- Druzdzel, Simon
- 1993
(Show Context)
Citation Context ... cause and effect. Recently, several researchers have begun to explore a formal causal semantics for Bayesian networks (e.g., Pearl & Verma, 1991, SpirtesLEARNING BAYESIAN NETWORKS 213 et al., 1993, =-=Druzdzel & Simon, 1993-=-, and Heckerman & Shachter, 1995). They argue that the representation of causal knowledge is important not only for assessment, but for prediction as welk In particular, they argue that causal knowled... |

44 | Learning Bayesian networks: a unification for discrete and Gaussian domains
- Heckerman, Geiger
- 1995
(Show Context)
Citation Context ...on to create the likelihoodequivalent BDe metric for complete databases. We note that our metrics and methods for constructing priors may be extended to nondiscrete domains (Geiger & Heckerman, 1994; =-=Heckerman & Geiger, 1995-=-). Third, we described search methods for identifying network structures with high posterior probabilities. We described polynomial algorithms for finding the highest-scoring network structures in the... |

40 | The assessment of prior distributions in Bayesian analysis - Winkler - 1967 |

35 | Ecient implementation of graph algorithms using contraction - Gabow, Galil, et al. - 1984 |

35 | Using causal information and local measures to learn Bayesian belief networks - Lain, Bacchus - 1993 |

34 |
A construction of bayesian networks from databases based on an MDL principle
- Suzuki
- 1993
(Show Context)
Citation Context ...oper & Herskovits, 1991; Buntine, 1991; Spiegelhalter et al., 1993; Dawid & Lauritzen, 1993; Heckerman et al., 1994), quasi-Bayesian methods (Lam &198 HECKERMAN, GEIGER AND CHICKERING Bacchus, 1993; =-=Suzuki, 1993-=-), and nonBayesian methods (Pearl & Verma, 1991; Spirtes et al., 1993). In this paper, we concentrate on the Bayesian approach, which takes prior knowledge and combines it with data to produce one or ... |

23 | A characterization of the Dirichlet distribution with application to learning Bayesian networks
- Geiger, Heckerman
- 1995
(Show Context)
Citation Context ..._.u,g) h to be a Dirichlet distribution. In fact, they show that, even when x and/or y have more than two states, the only solution consistent with likelihood equivalence is the Dirichlet. THEOREM 6 (=-=Geiger & Heckerman, 1995-=-) Let Oxv, Oz U Oul:~, and 0 u U ~zly be (positive) multinomiaI parameters related by the rules of probability. If r~ 7 r~ orv -- 1" rv I~k=l x=k 4((~y) I-If~ly=Z(@=lu=Z) (40) k=l _ H/=I ~y=l /=1 wher... |

20 | A simple derivation of Edmonds’ algorithm for optimum branching - Karp - 1972 |

18 | Lectures on Contingency Tables - Lauritzen - 1982 |

18 |
Updating a diagnostic system using unconfirmed cases. Applied Statistics
- Titterington
- 1976
(Show Context)
Citation Context ...abase. In practice, these probabilities can be approximated for incomplete databases by well-known statistical methods. Such methods include filling in missing data based on the data that is present (=-=Titterington, 1976-=-; Spiegelhalter & Lauritzen, 1990), the EM algorithm (Dempster, 1977), and Markov chain Monte Carlo methods (e.g., Gibbs sampling) (York, 1992; Madigan & Raftery, 1994). Let us now explore the consequ... |

17 |
A definition and graphical representation for causality
- Heckerman, Shachter
- 1995
(Show Context)
Citation Context ... several researchers have begun to explore a formal causal semantics for Bayesian networks (e.g., Pearl & Verma, 1991, SpirtesLEARNING BAYESIAN NETWORKS 213 et al., 1993, Druzdzel & Simon, 1993, and =-=Heckerman & Shachter, 1995-=-). They argue that the representation of causal knowledge is important not only for assessment, but for prediction as welk In particular, they argue that causal knowledge--unlike knowledge of correlat... |

15 | A Decision-Based View of Causality - Heckerman, Shachter - 1994 |

8 | An evaluation of the diagnostic accuracy
- Heckerman, Nathwani
- 1992
(Show Context)
Citation Context ...nd a model of that uncertainty (i.e., one ore more Bayesian networks for U), we evaluate the expected utility of these decisions using the gold-standard and learned networks, and note the difference (=-=Heckerman & Nathwani, 1992-=-). This utility function may include not only domain utility, but the costs of probabilistic inference as well (Horvitz, 1987). Unfortunately, it is difficult if not impossible to construct utility fu... |

7 | A characterization of the Dirichlet distribution applicable to learning Bayesian networks - Geiger, Heckerman - 1995 |

6 | Deriving a Minimal I-Map of a Belief Network Relative to a Target Ordering of its Nodes - Matzkevich, Abramson - 1993 |

6 |
Bayesian Methods for the Analysis of Misclassied and Incomplete Multivariate Discrete Data
- York
- 1992
(Show Context)
Citation Context ...n missing data based on the data that is present (Titterington, 1976; Spiegelhalter & Lauritzen, 1990), the EM algorithm (Dempster, 1977), and Markov chain Monte Carlo methods (e.g., Gibbs sampling) (=-=York, 1992-=-; Madigan & Raftery, 1994). Let us now explore the consequences of these assumptions. First, from the multinomialsample assumption and the assumption of no missing data, we obtain p(Cz[Dz, OÆs, Bs ~) ... |

5 | The k best spanning arborescences of a network. Networks - Camerini, Fratta, et al. - 1980 |

5 |
Search operators for learning equivalent classes of Bayesian network structures
- Chickering
- 1995
(Show Context)
Citation Context ...cal discussionsLEARNING BAYESIAN NETWORKS 203 in this paper, we shall require the following characterization of equivalent networks, proved in Chickering (1995a) and also in the Appendix. THEOREM 1 (=-=Chickering, 1995-=-a) Let Bsl and Bs2 be two Bayesian-network structures, and RBsl,B~2 be the set of edges by which Bs1 and Bs2 differ in directionality. Then, Bs1 and Bs2 are equivalent if and only if there exists a se... |

5 | La prévision: see lois logiques, ses sources subjectives. Annales de l’Institut Henri Poincare - Finetti - 1937 |

2 | Learning causal networks - Heckerman - 1995 |

2 | Finding optimal branchings. Networks - Tarjan - 1977 |

1 | La provision: See lois logiques, ses sources subjectives. Annales de l'Institut Henri Poincard - Finetti - 1937 |