## Improved learning of Bayesian networks (2001)

### Cached

### Download Links

- [www.cs.kun.nl]
- [www.cs.ru.nl]
- [www.cs.ru.nl]
- [www.jmlr.org]
- [www.cs.uu.nl]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. of the Conf. on Uncertainty in Artificial Intelligence |

Citations: | 37 - 6 self |

### BibTeX

@INPROCEEDINGS{Castelo01improvedlearning,

author = {Robert Castelo and Craig Boutilier},

title = {Improved learning of Bayesian networks},

booktitle = {Proc. of the Conf. on Uncertainty in Artificial Intelligence},

year = {2001},

pages = {269--276},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Two or more Bayesian network structures are Markov equivalent when the corresponding acyclic digraphs encode the same set of conditional independencies. Therefore, the search space of Bayesian network structures may be organized in equivalence classes, where each of them represents a different set of conditional independencies. The collection of sets of conditional independencies obeys a partial order, the so-called “inclusion order.” This paper discusses in depth the role that the inclusion order plays in learning the structure of Bayesian networks. In particular, this role involves the way a learning algorithm traverses the search space. We introduce a condition for traversal operators, the inclusion boundary condition, which, when it is satisfied, guarantees that the search strategy can avoid local maxima. This is proved under the assumptions that the data is sampled from a probability distribution which is faithful to an acyclic digraph, and the length of the sample is unbounded. The previous discussion leads to the design of a new traversal operator and two new learning algorithms in the context of heuristic search and the Markov Chain Monte Carlo method. We carry out a set of experiments with synthetic and real-world data that show empirically the benefit of striving for the inclusion order when learning Bayesian networks from data.

### Citations

7052 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...s to converge to an asymptotic value smaller than 3.7. This was observed up to 10 vertices. 2. Frydenberg (1990) also proved it but under the additional condition of the fifth graphoid axiom CI5 (see =-=Pearl, 1988-=-). 533In Figure 1 we see the cardinalities of DAG-space and EG-space plotted, up to 10 vertices. From a non-causal perspective one is interested in learning equivalence classes of Bayesian networks f... |

2243 | Equation of state calculations by fast computing machines - Metropolis, Rosenbluth, et al. - 1953 |

1217 |
Monte carlo sampling methods using markov chains and their applications
- Hastings
(Show Context)
Citation Context ...the paper of Smith and Roberts (1993). Given output M(t, q) = {Mt=1, Mt=2, . . .,Mt=n} of the Markov chain, the regularity conditions allow us to derive the following asymptotic results (Chung, 1967, =-=Hastings, 1970-=-, Smith and Roberts, 1993, Madigan and York, 1995): Mt=n n→∞ These imply that, when the Markov chain M(t, q) converges: and 1 n n∑ t=1 −→ M ∼ p(M|D) (7) f(M(t, q)) n→∞ −→ E(f(M)) (8) • the draws from ... |

1102 | Graphical Models - LAURITZEN - 1996 |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ...ctorization in (1) allows us to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (Buntine, 1991, =-=Cooper and Herskovits, 1992-=-, Heckerman et al., 1995). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. Throughout this paper we have ... |

1019 | Cadie, Empirical analysis of predictive algorithms for collaborative filtering - Breese, Heckerman, et al. - 1998 |

983 | Bayes factors - Kass, Raftery - 1995 |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (Buntine, 1991, Cooper and Herskovits, 1992, =-=Heckerman et al., 1995-=-). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. Throughout this paper we have used the BDeu scoring me... |

496 |
Causation, Prediction, and Search
- Spirtes, Glymour, et al.
- 1993
(Show Context)
Citation Context ...ns. Each equivalence class has a canonical representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (=-=Spirtes et al., 1993-=-, Chickering, 1995, Andersson et al., 1997a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering... |

382 | Understanding the Metropolis-Hastings algorithm
- Chib, Greenberg
- 1995
(Show Context)
Citation Context ...the ratio of the cardinalities of the neighborhoods |N(G)|/|N(G ′ )| which is known as the candidate-generating ratio. In our experimentation we have assumed a symmetric candidate-generating density (=-=Chib and Greenberg, 1995-=-), where |N(G)| = |N(G ′ )|. This is reasonable in our context since G and G ′ will differ in a single adjacency. The eMC 3 algorithm of Figure 7 needs the specification of some Bayesian network as a ... |

351 |
Bayesian computation via the Gibbs Sampler and related Markov Chain Monte Carlo Method
- Smith, Roberts
- 1993
(Show Context)
Citation Context ...th and Roberts (1993). Given output M(t, q) = {Mt=1, Mt=2, . . .,Mt=n} of the Markov chain, the regularity conditions allow us to derive the following asymptotic results (Chung, 1967, Hastings, 1970, =-=Smith and Roberts, 1993-=-, Madigan and York, 1995): Mt=n n→∞ These imply that, when the Markov chain M(t, q) converges: and 1 n n∑ t=1 −→ M ∼ p(M|D) (7) f(M(t, q)) n→∞ −→ E(f(M)) (8) • the draws from the Markov chain mimic a ... |

239 |
The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks
- Beinlich, Suermondt, et al.
- 1989
(Show Context)
Citation Context ... the results for heuristic learning, and in Subsection 5.3 we show the results for MCMC learning. 5.1 Synthetic and Real-World data We have used two kinds of synthetic data. One is the Alarm dataset (=-=Beinlich et al., 1989-=-), which has become a standard benchmark dataset for the assessment of learning algorithms for Bayesian networks on discrete data. The Alarm dataset was sampled from the Bayesian network in Figure 8, ... |

226 | Bayesian graphical models for discrete data
- Madigan, York
- 1995
(Show Context)
Citation Context ...ven output M(t, q) = {Mt=1, Mt=2, . . .,Mt=n} of the Markov chain, the regularity conditions allow us to derive the following asymptotic results (Chung, 1967, Hastings, 1970, Smith and Roberts, 1993, =-=Madigan and York, 1995-=-): Mt=n n→∞ These imply that, when the Markov chain M(t, q) converges: and 1 n n∑ t=1 −→ M ∼ p(M|D) (7) f(M(t, q)) n→∞ −→ E(f(M)) (8) • the draws from the Markov chain mimic a random sample from p(M|D... |

217 | Equivalence and synthesis of causal models - Verma, Pearl - 1990 |

202 | Being bayesian about network structure
- Friedman, Koller
- 2000
(Show Context)
Citation Context ... is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, Singh and Valtorta, 1993, Larrañaga et al., 1996, =-=Friedman and Koller, 2000-=-). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefore, errors in the ordering may easily lead to very ... |

183 | Theory refinement on Bayesian networks
- Buntine
- 1991
(Show Context)
Citation Context ...he recursive factorization in (1) allows us to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (=-=Buntine, 1991-=-, Cooper and Herskovits, 1992, Heckerman et al., 1995). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. T... |

171 |
Markov Chains with Stationary Transition Probabilities
- Chung
- 1960
(Show Context)
Citation Context ...onditions in the paper of Smith and Roberts (1993). Given output M(t, q) = {Mt=1, Mt=2, . . .,Mt=n} of the Markov chain, the regularity conditions allow us to derive the following asymptotic results (=-=Chung, 1967-=-, Hastings, 1970, Smith and Roberts, 1993, Madigan and York, 1995): Mt=n n→∞ These imply that, when the Markov chain M(t, q) converges: and 1 n n∑ t=1 −→ M ∼ p(M|D) (7) f(M(t, q)) n→∞ −→ E(f(M)) (8) •... |

159 | Optimal structure identification with greedy search - HEMMECKE, Chickering |

140 |
Independence properties of directed markov fields
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ...alized subgraph induced by the smallest ancestral set of A ∪ B ∪ S. The DGMP is the sharpest possible graphical criterion that permits reading CI restrictions from a given DAG (Pearl and Verma, 1987, =-=Lauritzen et al., 1990-=-). An alternative way of reading conditional independencies in a DAG is using the dseparation criterion of Pearl and Verma (1987), which we review now. A vertex vi in a path v0, v1, . . .,vn, n > 1, i... |

129 | Learning equivalence classes of Bayesian-network structure
- Chickering
- 2002
(Show Context)
Citation Context ...and AR neighborhoods. Therefore, errors in the ordering may easily lead to very bad local maxima, as shown by Chickering et al. (1995). Heuristic algorithms that use EG-space (Spirtes and Meek, 1995, =-=Chickering, 1996-=-, 2002a,b) do not assume that any form of causal ordering is known probably because, in general, they can work better with complex domains. We introduce here a new heuristic algorithm which works in D... |

111 | Assessment and propagation of model uncertainty
- Draper
- 1995
(Show Context)
Citation Context ... may learn different models through different runs, and this permits trading time for multiple local maxima. 4.3 The Markov Chain Monte Carlo Method The need to account for the uncertainty of models (=-=Draper, 1995-=-) has led to the development of computational methods that implement the full Bayesian approach to modeling. Recall the Bayes’ theorem: p(M|D) = p(D|M)p(M) p(D) , (3) where p(D) is known as the normal... |

105 | The chain graph Markov property - FRYDENBERG - 1990 |

100 | Conditional Independence in Statistical Theory," (with Discussion - Dawid - 1979 |

91 | A transformational characterization of equivalent Bayesian network structures - Chickering - 1995 |

91 | A characterization of Markov equivalence classes for acyclic digraphs
- ANDERSSON, MADIGAN, et al.
- 1997
(Show Context)
Citation Context ...al representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (Spirtes et al., 1993, Chickering, 1995, =-=Andersson et al., 1997-=-a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering, 1995) and essential graph (Andersson et ... |

77 | An algorithm for the construction of Bayesian network structures from data
- Singh, Valtorta
- 1993
(Show Context)
Citation Context ...ssume that a causal ordering between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, =-=Singh and Valtorta, 1993-=-, Larrañaga et al., 1996, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefo... |

73 | D.: Learning Bayesian networks: Search methods and experimental results - Chickering, Geiger, et al. - 1995 |

63 | Decomposable graphical Gaussian model determination
- Giudici, Green
- 1999
(Show Context)
Citation Context ...t, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (Havránek, 1984, Edwards and Havránek, 1985, =-=Giudici and Green, 1999-=-) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe an efficient implementation of the ENR and ENCR neigh... |

56 |
Learning bayesian networks with discrete variables from data
- Spirtes, Meek
- 1995
(Show Context)
Citation Context ...s reachable from the NR and AR neighborhoods. Therefore, errors in the ordering may easily lead to very bad local maxima, as shown by Chickering et al. (1995). Heuristic algorithms that use EG-space (=-=Spirtes and Meek, 1995-=-, Chickering, 1996, 2002a,b) do not assume that any form of causal ordering is known probably because, in general, they can work better with complex domains. We introduce here a new heuristic algorith... |

54 | Learning Bayesian network structures by searching for the best ordering with genetic algorithms
- Larrañaga, Kuijpers, et al.
- 1996
(Show Context)
Citation Context ...ng between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, Singh and Valtorta, 1993, =-=Larrañaga et al., 1996-=-, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefore, errors in the orderi... |

48 |
A fast procedure for model search in multidimensional contingency tables
- Edwards, Havranek
- 1985
(Show Context)
Citation Context ...en class of GMMs. 544In fact, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (Havránek, 1984, =-=Edwards and Havránek, 1985-=-, Giudici and Green, 1999) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe an efficient implementation ... |

47 | Improving Markov Chain Monte Carlo Model Search for Data Mining - Giudici, Castelo - 2003 |

44 |
Graphical models, selecting causal and statistical models
- Meek
- 1997
(Show Context)
Citation Context ...th, that ends in the true Bayesian network, the score will increase as it is shown in Theorem 3.3. This result is equivalent to Lemmas 8 and 9 from Chickering (2002b) where the optimality of the GES (=-=Meek, 1997-=-) algorithm for structure learning of Bayesian networks is proved. However, the inclusion boundary condition provides us with a general policy for the design of effective traversal operators for any g... |

39 | Counting labeled acyclic digraphs - Robinson - 1971 |

38 | Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs
- MADIGAN, ANDERSSON, et al.
- 1996
(Show Context)
Citation Context ...one uses a score equivalent scoring metric. In such situation it makes sense to use EG-space instead of DAG-space. This argument has been further supported by several authors (Heckerman et al., 1995, =-=Madigan et al., 1996-=-) who cite the following advantages: 1. The cardinality of EG-space is smaller than in DAG-space. 2. The scoring metric is no longer constrained to give equal scores to Markov equivalent Bayesian netw... |

30 | On the Markov equivalence of chain graphs, undirected graphs and acyclic digraphs
- Andersson, Madigan, et al.
- 1997
(Show Context)
Citation Context ...al representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (Spirtes et al., 1993, Chickering, 1995, =-=Andersson et al., 1997-=-a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering, 1995) and essential graph (Andersson et ... |

27 | Computer-based probabilistic networks construction - Herskovits - 1991 |

26 | Computer-Based Probabilistic-Network Construction, Doctoral Dissertation - Herskovits - 1991 |

25 | The ALARM monitoring system - Beinlich, Suermondt, et al. - 1989 |

25 | Random Generation of Bayesian Networks - Ide, Cozman - 2002 |

20 |
The Logic of Representing Dependencies by Directed Graphs
- Pearl, Verma
- 1987
(Show Context)
Citation Context ...ates A and B in the moralized subgraph induced by the smallest ancestral set of A ∪ B ∪ S. The DGMP is the sharpest possible graphical criterion that permits reading CI restrictions from a given DAG (=-=Pearl and Verma, 1987-=-, Lauritzen et al., 1990). An alternative way of reading conditional independencies in a DAG is using the dseparation criterion of Pearl and Verma (1987), which we review now. A vertex vi in a path v0... |

13 | Enumerating markov equivalence classes of acyclic digraph models
- Gillispie, Perlman
- 2001
(Show Context)
Citation Context ...tive. 1e+20 1e+18 DAG-space EG-space 1e+16 1e+14 number of graphs 1e+12 1e+10 1e+08 1e+06 10000 100 1 1 2 3 4 5 6 7 8 9 10 vertices Figure 1: Cardinalities of DAG-space and EG-space. Observation 2.1 (=-=Gillispie and Perlman, 2001-=-) The average ratio of DAGs per equivalence class seems to converge to an asymptotic value smaller than 3.7. This was observed up to 10 vertices. 2. Frydenberg (1990) also proved it but under the addi... |

11 |
Optimizing causal orderings for generating dag’s from data
- Bouckaert
(Show Context)
Citation Context ...some algorithms assume that a causal ordering between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (=-=Bouckaert, 1992-=-, Singh and Valtorta, 1993, Larrañaga et al., 1996, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and... |

11 |
A fast procedure for model search in multidimensional contingency tables
- Havránek
- 1987
(Show Context)
Citation Context ...tors for any given class of GMMs. 544In fact, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (=-=Havránek, 1984-=-, Edwards and Havránek, 1985, Giudici and Green, 1999) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe ... |

8 | Mambo: Discovering association rules based on conditional independencies - Castelo, Feelders, et al. - 2001 |

7 | Enumeration of labelled chain graphs and labelled essential directed acyclic graphs - Steinsky - 2003 |

7 | Influence diagrams and d-separation - Verma, Pearl - 1988 |

5 | Association Models for Web Mining - Giudici, Heckerman, et al. |

4 | On characterizing inclusion of Bayesian networks - Kočka, Bouckaert, et al. - 2001 |

4 |
Graphical models: learning and application
- Kočka
- 2001
(Show Context)
Citation Context ...lies to every type of GMM. As we shall see throughout the paper, this concept is the key to understanding the relevance of the inclusion order in the learning task. Definition 3.1 Inclusion boundary (=-=Kočka, 2001-=-) Let M(H),M(L) be two GMMs determined by the graphs H and L. Let M I (H) ≺ M I (L) denote that M I (H) ⊂ M I (L) and for no graph K, M I (H) ⊂ M I (K) ⊂ M I (L). The inclusion boundary of the GMM M(G... |