## Active Learning of Causal Bayes Net Structure (2001)

Citations: | 37 - 2 self |

### BibTeX

@TECHREPORT{Murphy01activelearning,

author = {Kevin P. Murphy},

title = {Active Learning of Causal Bayes Net Structure},

institution = {},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a decision theoretic approach for deciding which interventions to perform so as to learn the causal structure of a model as quickly as possible. Without such interventions, it is impossible to distinguish between Markov equivalent models, even given infinite data. We perform online MCMC to estimate the posterior over graph structures, and use importance sampling to find the best action to perform at each step. We assume the data is discrete-valued and fully observed.

### Citations

1284 |
Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems (with Discussion
- Lauritzen, Spiegelhalter
- 1988
(Show Context)
Citation Context ... other nodes. A test set consists of M = 20 such cases. To be comparable with [TK01], we test our algorithm on three commonly-used networks: the 5 node cancer network [FMR98], the 8 node Asia network =-=[LS88]-=-, and the 12 node car trouble-shooter network [HBR94]. All networks have binary nodes with multinomial conditional probability distributions (CPDs). For the Asia network, we used the published paramet... |

1117 |
Causality: Models, Reasoning, and Inference
- Pearl
- 2000
(Show Context)
Citation Context ...t the eect on Y of changing X). The only way to distinguish members of the same Markov equivalence class is to perform experiments. By \experiments" we mean ideal interventions in the sense of Pe=-=arl [Pea00], i.e-=-., the learning agent can clamp a subset of the variables tosxed values. For example, in the genetics domain, an experiment might consist of \knocking out" a gene, which we can think of as clampi... |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ...].) In other words, Q(G 0 jG) = 1=jnbd(G)j, for G 0 2 nbd(G), and Q(G 0 jG) = 0 for G 0 62 ndb(G), so R = jnbd(G)jP (G 0 )P (DjG 0 ) jnbd(G 0 )jP (G)P (DjG)) where the marginal likelihood is given by =-=[-=-CH92] P (DjG) = n Y i=1 q i Y j=1 ( ij ) ( ij +N ij ) r i Y k=1 ( ijk +N ijk ) ( ijk ) The main advantage of this proposal distribution is that it is ecient to compute the Bayes factor P (DjG 0 )=P (... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...s ijk , the posterior mean parameters are given by ijk = E[ ijk jg; D] = ijk +N ijk P r i l=1 ijl +N ijl where N ijk = P y2D 1 ijk (y). In this paper, we use the priors ijk = 1=(r i q i ); [HGC95=-=-=-] call this the BDeu metric. This ensures that Markov equivalent models have the same marginal likelihood given observational data alone, unlike using ijk = 1. Since many of the graphs in G will shar... |

890 |
Sequential Monte Carlo Methods in Practice
- Doucet, Freitas, et al.
- 2001
(Show Context)
Citation Context ...lgorithm as described above is appropriate for oine (batch) computation. However, we need to compute P (GjD 1:t ) online (sequentially). We therefore combine the ideas of particlesltering (see e.g., [=-=DdFG01]-=-) with MH as follows. The belief state, P (GjD 1:t ), is represented as a set of weighted particles (samples). When a new observation arrives, we apply a small number, B, of MH moves to each particle ... |

854 | A tutorial on learning with bayesian networks - Heckerman - 1995 |

647 |
Queries and concept learning
- Angluin
- 1988
(Show Context)
Citation Context ...se any point in the input space. By contrast, in the querysltering paradigm, the learner can choose to see the label of certain items from a stream of inputs (see e.g., [FSST97]). In the PAC setting, =-=[Ang88]-=- showed how the ability to ask questions reduces the problem of identifying certain kinds of boolean functions from NP-complete to polynomial time. [BHH95] and [TR98] have exended this to active learn... |

528 | Active learning with statistical models
- Cohn, Ghahramani, et al.
- 1996
(Show Context)
Citation Context ...n and non-Bayesian setting (see e.g., [CV95] for a review). There has also been work on active learning for non-linear regression models (e.g., neural networks [Mac92] and locally weighted regression =-=[CGJ96]-=-), where the objective is to minimize the expected variance of the predictor. In the above works, the active learner can choose any point in the input space. By contrast, in the querysltering paradigm... |

334 | Selective sampling using the query by committee algorithm
- Freund, Seung, et al.
- 1997
(Show Context)
Citation Context ...ks, the active learner can choose any point in the input space. By contrast, in the querysltering paradigm, the learner can choose to see the label of certain items from a stream of inputs (see e.g., =-=[FSST97]-=-). In the PAC setting, [Ang88] showed how the ability to ask questions reduces the problem of identifying certain kinds of boolean functions from NP-complete to polynomial time. [BHH95] and [TR98] hav... |

324 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ...near regression model, in both a Bayesian and non-Bayesian setting (see e.g., [CV95] for a review). There has also been work on active learning for non-linear regression models (e.g., neural networks =-=[Mac92]-=- and locally weighted regression [CGJ96]), where the objective is to minimize the expected variance of the predictor. In the above works, the active learner can choose any point in the input space. By... |

247 | Operations for learning with graphical models - Buntine - 1994 |

226 | Bayesian graphical models for discrete data
- Madigan, York
- 1995
(Show Context)
Citation Context ...s from P (GjD). The algorithm is summarized below, where Q(G 0 jG) is the probability of proposing a move from G to G 0 , B is the burn-in period, and N is the number of samples we want to draw. (See =-=[MY-=-95] for details.) Choose G 1 somehow e.g., at random For t = 1; : : : ; B +N Sample G 0 Q(jG t ) Compute R = P (G 0 jD)Q(G t jG 0 ) P (G t jD)Q(G 0 jG t ) Sample u Unif(0; 1) If usminf1; Rg then G t... |

219 | Learning the Structure of Dynamic Probabilistic Networks
- Friedman
- 1998
(Show Context)
Citation Context ...dom values, and then sampling the other nodes. A test set consists of M = 20 such cases. To be comparable with [TK01], we test our algorithm on three commonly-used networks: the 5 node cancer network =-=[FMR98]-=-, the 8 node Asia network [LS88], and the 12 node car trouble-shooter network [HBR94]. All networks have binary nodes with multinomial conditional probability distributions (CPDs). For the Asia networ... |

217 |
Equivalence and synthesis of causal models
- Verma, Pearl
- 1990
(Show Context)
Citation Context ...nd X Y Z are Markov equivalent, since they all represent X ? ZjY . In general, two graphs are Markov equivalent i they have the same structure ignoring arc directions, and have the same v-structures [=-=VP90]-=-. (A v-structure consists of converging directed edges into the same node, such as X!Y Z.) two BNs might be Markov equivalent and yet make different predictions about the consequences of interventions... |

202 | Being bayesian about network structure
- Friedman, Koller
- 2000
(Show Context)
Citation Context ...l) terms in the marginal likelihood ratio cancel. [MAPV96] suggested searching the (smaller) space of (Markov) equivalence classes of DAGs, but this is inappropriate when we have interventional data. =-=[FK00]-=- suggested searching the (even smaller) space of total orderings of the nodes, marginalizing out the actual structure. Although this converges much faster, we chose not to pursue this technique, since... |

197 | Sequential updating of conditional probabilities on directed graphical structures - Spiegelhalter, Lauritzen - 1990 |

172 | Bayesian experimental design: a review
- CHALONER, VERDINELLI
- 1995
(Show Context)
Citation Context ...can be maximized in closed form. Many other results (such as A-optimility, Goptimality, etc.) have been derived for the linear regression model, in both a Bayesian and non-Bayesian setting (see e.g., =-=[CV95]-=- for a review). There has also been work on active learning for non-linear regression models (e.g., neural networks [Mac92] and locally weighted regression [CGJ96]), where the objective is to minimize... |

101 | Hoeffding races: Accelerating model selection search for classification and function approximation
- Maron, Moore
- 1994
(Show Context)
Citation Context ...t is that, for action selection, it is the relative values of V (a) that matter. This idea has been exploited in [OK00] to reduce the number of samples used (see also the \Hoeding races" approach=-= of [MM93]-=-). However, in this paper, we just use asxed number of samples. 5 Results 3 We compare the behavior of the active learner with two other algorithms: passive observation, and random interventions. We a... |

85 | Discovery of regulatory interactions through perturbation: Inference and experimental design
- Ideker, Thorsson, et al.
- 2000
(Show Context)
Citation Context ...oolean functions, where the internal nodes are hidden. [AKMM98] have some results concerning upper and lower bounds on the number of experiments necessary to learn (possibly cyclic) boolean networks. =-=[ITK00]-=- discusses active learning techniques for learning boolean networks using an entropy-based cost function. The most closely related work is that of Tong and Koller [TK01], who also use a decision theor... |

71 |
Expected information as expected utility
- Bernardo
- 1979
(Show Context)
Citation Context ...ow that maximizing log P (gja; y; D) is equivalent to maximizing the expected KL divergence between our posterior and our prior, or equivalently, to minimizing the conditional entropy of H(GjY; a; D) =-=[Ber79], wh-=-ich are perhaps more familiar criteria. V = max a E Y KL(P (Gja; Y; D)jjP (GjD)) = max a X y P (yja; D) " X g P (gja; y; D) log P (gja; y; D) P (gjD) # Since the denominator P (gjD) is independe... |

63 | Identification of gene regulatory networks by strategic gene disruptions and gene overexpressions
- Akutsu, Kuhara, et al.
- 1998
(Show Context)
Citation Context ...tain kinds of boolean functions from NP-complete to polynomial time. [BHH95] and [TR98] have exended this to active learning of tree-structured boolean functions, where the internal nodes are hidden. =-=[AKMM98]-=- have some results concerning upper and lower bounds on the number of experiments necessary to learn (possibly cyclic) boolean networks. [ITK00] discusses active learning techniques for learning boole... |

62 | Causal discovery from a mixture of experimental and observational data
- Cooper, Yoo
- 1999
(Show Context)
Citation Context ...ods to handle data obtained by interventional studies (in addition to the usual passive observational data). Specically, we simply refrain from updating the parameters of the nodes that were clamped [=-=CY99-=-]. (The intuitive justication for this is that observing that a clamped node has a certain value does not tell us anything about how likely it is that that value would occur had we not forced it.) Wha... |

58 | A bayesian approach for learning causal networks - Heckerman - 1995 |

55 | Active learning for structure in Bayesian networks
- Tong, Koller
- 2001
(Show Context)
Citation Context ...g that a clamped node has a certain value does not tell us anything about how likely it is that that value would occur had we not forced it.) What has not been studied | with the notable exception of =-=[TK01]-=-, which we discuss in Section 6 | is a way to decide which interventions to perform so as to learn the causal structure as quickly/ cheaply as possible. This is the goal of this paper. We adopt a stan... |

47 |
Improving Markov Chain Monte Carlo Model Search for Data Mining
- Giudici, Castelo
- 2003
(Show Context)
Citation Context ... be the set of all DAGs that dier from G by a single edge addition, deletion or reversal. (A way of quickly checking that the proposed graph is acyclic, based on the ancestor matrix, is described in [=-=GC01]-=-.) In other words, Q(G 0 jG) = 1=jnbd(G)j, for G 0 2 nbd(G), and Q(G 0 jG) = 0 for G 0 62 ndb(G), so R = jnbd(G)jP (G 0 )P (DjG 0 ) jnbd(G 0 )jP (G)P (DjG)) where the marginal likelihood is given by [... |

45 | Sequential update of Bayesian networks structure
- Friedman, Goldszmidt
- 1997
(Show Context)
Citation Context ...longer keep the whole dataset D, nor can we store the (expected) sucient statistics for all possible models. One approach would be to keep the statistics just for a \fringe" of probable models, a=-=s in [FG97]-=-. It is straightforward to adapt the above techniques to learn the structure of a dynamic Bayesian network (DBN) from time series data c.f., [FMR98]. If we only allow arcs between time-slices (\diachr... |

41 | Troubleshooting under uncertainty
- Heckerman, Breese, et al.
- 1994
(Show Context)
Citation Context ...es. To be comparable with [TK01], we test our algorithm on three commonly-used networks: the 5 node cancer network [FMR98], the 8 node Asia network [LS88], and the 12 node car trouble-shooter network =-=[HBR94]-=-. All networks have binary nodes with multinomial conditional probability distributions (CPDs). For the Asia network, we used the published parameters. For the other networks, we sampled the CPD param... |

38 | Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs
- MADIGAN, ANDERSSON, et al.
- 1996
(Show Context)
Citation Context ...s proposal distribution is that it is ecient to compute the Bayes factor P (DjG 0 )=P (DjG), since all but one (or two, in the case of an edge reversal) terms in the marginal likelihood ratio cancel. =-=[MAPV96]-=- suggested searching the (smaller) space of (Markov) equivalence classes of DAGs, but this is inappropriate when we have interventional data. [FK00] suggested searching the (even smaller) space of tot... |

25 |
An overview of the representation and discovery of causal relationships using Bayesian networks
- Cooper
- 1999
(Show Context)
Citation Context ...form formula is known for the number of DAGs on n nodes, f(n), but thesrst few values of f , for n = 1; : : : ; 10, are 1, 3, 25, 543, 29281, 3781503, 1:1 10 9 , 7:810 11 , 1:210 15 and 4:210 18 [Coo=-=99-=-]. A crude upper bound is O(2 n 2 ), the number of boolean matrices. A D U Y G Figure 1: An in uence diagram for one-shot experiment design. The dotted line is an informational arc, and species that t... |

10 | HELLERSTEIN: Learning boolean read-once formulas over generalized bases
- BSHOUTY, HANCOCK, et al.
- 1995
(Show Context)
Citation Context ...ts (see e.g., [FSST97]). In the PAC setting, [Ang88] showed how the ability to ask questions reduces the problem of identifying certain kinds of boolean functions from NP-complete to polynomial time. =-=[BHH95]-=- and [TR98] have exended this to active learning of tree-structured boolean functions, where the internal nodes are hidden. [AKMM98] have some results concerning upper and lower bounds on the number o... |

4 |
Learning from examples and membership queries with structured determinations
- Tadepalli, Russell
- 1998
(Show Context)
Citation Context ..., [FSST97]). In the PAC setting, [Ang88] showed how the ability to ask questions reduces the problem of identifying certain kinds of boolean functions from NP-complete to polynomial time. [BHH95] and =-=[TR98]-=- have exended this to active learning of tree-structured boolean functions, where the internal nodes are hidden. [AKMM98] have some results concerning upper and lower bounds on the number of experimen... |

3 |
Insua. Decision analysis by augmented probability simulation
- Bielza, Mueller, et al.
- 1999
(Show Context)
Citation Context ...the genetics domain, we might be able to clamp a node to its \wildtype" (mean) value , to \overexpress" it (clamp it to + , where is the standard or to \underexpress" it (clamp it to =-= ). (See [BMI99]-=- for an interesting MCMC approach to selecting continuous-valued actions.) 4.4 How many samples? The question of how many samples we need to take is an interesting one. The key insight is that, for ac... |

2 |
Learning Bayesian networks: a uni for discrete and Gaussian domains
- Heckerman, Geiger
- 1995
(Show Context)
Citation Context ...les, we plan to use linearGaussian CPDs, possibly with non-linear basis functions, as discussed above. (The use of such nonlinearities makes this dierent from the global jointly Gaussian approach of [=-=HG95]-=-.) For missing data, we plan to use sampling (data augmentation). The actions might now also consist of choosing to measure a hidden variable, as in classical value-of-information computations. For on... |

1 |
Sampling methods for action selection in in diagrams
- Ortiz, Kaelbling
- 2000
(Show Context)
Citation Context ... The question of how many samples we need to take is an interesting one. The key insight is that, for action selection, it is the relative values of V (a) that matter. This idea has been exploited in =-=[OK00] to -=-reduce the number of samples used (see also the \Hoeding races" approach of [MM93]). However, in this paper, we just use asxed number of samples. 5 Results 3 We compare the behavior of the active... |