## Learning Bayes net structure from sparse data sets (2001)

Citations: | 15 - 2 self |

### BibTeX

@TECHREPORT{Murphy01learningbayes,

author = {Kevin P. Murphy},

title = {Learning Bayes net structure from sparse data sets},

institution = {},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

There are essentially two kinds of approaches for learning the structure of Bayesian Networks (BNs) from data. The first approach tries to find a graph which satis es all the constraints implied by the empirical conditional independencies measured in the data [PV91, SGS00a, Shi00]. The second approach searches through the space of models (either DAGs or PDAGs), and uses some scoring metric (typically Bayesian or some approximation, such as BIC/MDL) to evaluate the models [CH92, Hec95, Hec98, Kra98], typically returning the highest scoring model found. Our main interest is in learning BN structure from gene expression data [FLNP00, HGJY01, MM99, SGS00b]. In domains such as this, where the ratio of the number of observations to the number of variables is low (i.e., when we have sparse data), selecting a threshold for the conditional independence (CI) tests can be tricky, and repeated use of such tests can lead to inconsistencies [DD99]. Bayesian s...

### Citations

7634 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...(1 + e z ) is the sigmoid (logistic) function, w ij is the weight on the arc from X j to X i , and u is a bit-vector representing the parents' values. A closely related model is the noisy-or function =-=[Pea88]. In this case-=-, the child is \on" if any of its parents are on, provided not all the \links" from the on parents are \broken". Dene q ij as the probability that the link from X j to X i fails. (Failu... |

5536 | Neural Networks for Pattern Recognition - Bishop - 1995 |

1460 |
Statistical Decision Theory and Bayesian Analysis. 2nd edn
- Berger
- 1985
(Show Context)
Citation Context ...meter independence, plus an additional assumption called likelihood equivalence 5 , imply that the prior must be Dirichlet. Fortunately, the Dirichlet prior is the conjugate prior for the multinomial =-=[Ber85]-=-, which makes analysis easier, as we will see below. (For this reason, the Dirichlet is often used even if the assumption of likelihood equivalence is violated.) Note that, in the case of binary nodes... |

1326 | Causality: Models, Reasoning and Inference - Pearl - 2000 |

1221 | Bayesian Theory - Bernardo, Smith - 1994 |

1153 | A Bayesian method for the induction of probabilistic networks from data. Machine learning 9
- Cooper, Herskovitz
- 1992
(Show Context)
Citation Context ...rameters (Equation 8), and plug these expected values into the sample likelihood equation: P (D) = P (Dj) = n Y i=1 q i Y j=1 r i Y k=1 N ijk ijk (11) Alternatively, this can be written as follows [CH92] P (D) = n Y i=1 q i Y j=1 B( ij1 +N ij1 ; : : : ; i;j;r i +N i;j;r i ) B( ij1 ; : : : ; i;j;r i ) = n Y i=1 q i Y j=1 ( ij ) ( ij +N ij ) r i Y k=1 ( ijk +N ijk ) ( ijk ) (12) For interve... |

1004 |
An introduction to Bayesian networks
- Jensen
- 1996
(Show Context)
Citation Context ...Fri98]. In this case, instead of using the MAP value ^ G , he approximately integrates over all . 3 Note that G 0 might have families that are not in G. If we are using the junction tree algorithm [J=-=en96]-=- for inference, this means we may have to compute joint probability distributions on sets of nodes that are not in any clique, which can be slow. 10 3.4 Searching over orderings [FK00] claim that MCMC... |

978 | Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.Biometrika
- Green
- 1995
(Show Context)
Citation Context ... of these methods are applicable if we have experimental (interventional) as well as observational data. 3.6 Reversible jump MCMC We mention, just for completeness, the reversible jump MCMC algorithm =-=[Gre98]-=-. This is necessary when the state space has variable dimension, as occurs when estimating parameters as well as model structure (since the number of parameters varies with the structure). For an appl... |

974 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...nal likelihood or Ockham factor [Gul88]. This matches our intuition that we trust a constrained (but correct!) model more than one that can predict anything. 4 counting the number of mismatched edges =-=[HGC95]. (Fo-=-r example, in [ITR + 01], they constructed a network based on databases listing known protein-protein and protein-DNA interactions.) However, most of the time, we have some \fuzzy" prior knowledg... |

841 | Using Bayesian networks to analyze expression data.,” Journal of computational biology a journal of computational molecular cell biology - Friedman, Linial, et al. - 2000 |

677 |
The Calculation of Posterior Distributions by Data Augmentation
- Tanner, Wong
- 1987
(Show Context)
Citation Context ...between sampling a new model given the current completed data set, and sampling a completed data set given the current model. (This is basically an extension of the IP algorithm for data augmentation =-=[TW8-=-7]). At a high level, the algorithm cycles through the following steps (where Y is the observed data and Z is the hidden data): 1. Sample G t+1 P (GjY; Z t ) / P (G)P (Y; Z t jG) 2. Compute P ( t+1 j... |

661 |
Markov Chain Monte Carlo in Practice
- Richardson, Spiegelhalter, et al.
- 1996
(Show Context)
Citation Context ...because there a superexponential number of graphs. To avoid this intractability, we plan to use MCMC (Markov Chain Monte Carlo) techniques to search the very large space of possible models (see e.g., =-=[GRS96-=-] for an introduction to MCMC). Specically, we plan to use the Metropolis-Hastings (MH) algorithm, which only requires that we be able to compute the posterior odds between the current candidate model... |

409 | Marginal likelihood from the Gibbs output - Chib - 1995 |

271 | J: Bayesian Graphical Models for Discrete Data
- Madigan, York
- 1995
(Show Context)
Citation Context ...evidences, P (DjG2 ) P (DjG1 ) , is called the Bayes factor, and is the Bayesian equivalent of the likelihood ratio test. The idea of applying the MH algorithm to graphical models wassrst proposed in =-=[MY95]-=-, who called the technique MC 3 , for MCMC Model Composition. The basic idea is to construct a Markov Chain whose state space is the set of all DAGs and whose stationary distribution is P (GjD). We ac... |

254 | Operations for learning with graphical models - Buntine - 2004 |

246 | Learning Bayesian networks with local structure
- Friedman, Goldszmidt
- 1998
(Show Context)
Citation Context ...]; however, its accuracy on small samples has yet to be determined. A representation for CPDs for discrete nodes which is of variable complexity, ranging from O(1) to O(2 k ) parameters, is the tree. =-=[FG96b] sho-=-w how allowing such \local structure" can enable the learning of denser global graph structures, without overtting. For continuous-valued nodes, the most widely used CPD is linear-Gaussian: P (X ... |

223 | Being Bayesian about Network Structure
- Friedman, Koller
- 2000
(Show Context)
Citation Context ...tree algorithm [Jen96] for inference, this means we may have to compute joint probability distributions on sets of nodes that are not in any clique, which can be slow. 10 3.4 Searching over orderings =-=[FK00-=-] claim that MCMC over structures does not mix well for large models (more than 10 variables, say). Instead they use MCMC to search over variable orderings. Given a total ordering , the likelihood dec... |

223 | TheBayesian structural EMalgorithm
- Friedman
- 1998
(Show Context)
Citation Context ...model averaging, and hence the above formula was embedded inside of the hill-climbing algorithm in Figure 3; the resulting algorithm is shown in Figure 5. A \Bayesian" version of this was propose=-=d in [Fri9-=-8]. In this case, instead of using the MAP value ^ G , he approximately integrates over all . 3 Note that G 0 might have families that are not in G. If we are using the junction tree algorithm [Jen96... |

223 |
Equivalence and synthesis of causal models
- Verma, Pearl
- 1990
(Show Context)
Citation Context ...Y!Z and X Y Z are Markov equivalent, since they all represent X ? ZjY . In general, two graphs are Markov equivalent i they have the same structure ignoring arc directions, and the same v-structures [=-=VP90]-=-. (A v-structure consists of converging directed edges into the same node, such as X!Y Z.) We can only distinguish members of the same equivalence class if we have interventional (experimental) data [... |

218 | A theory of inferred causation - Pearl, Verma - 1991 |

211 | A variational bayesian framework for graphical models - Attias - 2000 |

208 |
Sequential updating of conditional probabilities on directed graphical structures
- Spiegelhalter, Lauritzen
- 1990
(Show Context)
Citation Context ...e the same parameters. (Parameter nodes are shown inside dotted circles.) If all the data is observed, and the parameter priors are independent, then the parameter posteriors will also be independent =-=[SL9-=-0]. This means the marginal likelihood decomposes into a product of terms, one for each node: P (DjG) = Y i score(X i ; PaG (X i )jD) where score(X i ; U jD) = Z N Y m=1 P (x i [m]ju i [m]; D)P ( i )d... |

201 |
Bayesian analysis in expert systems
- Spiegelhalter, Dawid, et al.
- 1993
(Show Context)
Citation Context ...G c : see [HGC95] for a discussion). In addition, computing the parameter priors for an arbitrary graph structure from such a prior network requires running an inference algorithm, which can be slow. =-=[SDLC93-=-] suggest a similar way of computing Dirichlet priors from a prior network. A much simpler alternative is to use a non-informative prior. A natural choice is ijk = 0, which corresponds to maximum lik... |

195 | Learning Bayesian Network Structure from Massive Datasets: The Sparse Candidate Algorithm
- Friedman, Nachman, et al.
- 1999
(Show Context)
Citation Context ... we expect the fan-in to be small). This reduces the number of parent sets we need to evaluate from O(2 n ) to n k n k . Some heuristics for choosing the set of k potential parents are given in [FN=-=P99]-=-. As long as we give non-zero probability to all possible edge changes in our proposal, we are guaranteed to get the correct answer (since we can get from any graph to any other by single edge changes... |

181 | Modelling gene expression data using dynamic bayesian networks - Murphy, Mian - 1999 |

177 | Bayesian Model Averaging: A Tutorial
- Raftery, Volinsky
- 1999
(Show Context)
Citation Context ...hese issues, it does not focus on methods which are suitable for small data sets. The last problem is unique to Bayesian model averaging. Although there have been many general tutorials on BMA (e.g., =-=[HMRV9-=-9]), few discuss BNs in any detail. This tutorial aims tosll in these gaps. 2 Priors 2.1 Parameter priors Following common practice, we will assume global parameter independence: P (jG) = n Y i=1 P ( ... |

151 | Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks
- Hartemink
- 2001
(Show Context)
Citation Context ...90].) We can encode this prior knowledge using a diagonal Gaussian prior on the weight vector w i , with a mean at +1 for excitatory links, 1 for inhibitory links, and at 0 for links of unknown sign. =-=[HGJY01]-=- suggest modeling prior knowledge of signs with constrained Dirichlet distributions, but this requires numerical integration. See also [DvdG95, WJ00] for ways of imposing constraints on parameter prio... |

146 |
Efficient Metropolis jumping rules
- Gelman, Roberts, et al.
- 1994
(Show Context)
Citation Context ...tructures, as in genetic programming. If one is not sure which proposal to use, one can always create a mixture distribution. The weights of this mixture are parameters that have to be tuned by hand. =-=[GRG96]-=- suggest that the kernel should be designed so that the average acceptance rate is 0.25. A natural way to speed up mixing is to reduce the size of the search space. Suppose that, for each node, we res... |

140 | Estimating a Dirichlet distribution
- Minka
- 2003
(Show Context)
Citation Context ... hierarchical priors, and must resort to sampling, as in [GGT00, DF99]. A more ecient method, known as empirical Bayes or maximum likelihood type II, is to estimate the hyperparameters from data: see =-=[Min00a]-=- for details. A.3 Other CPDs in the exponential family Many CPDs in the exponential family (e.g., multinomial, Gaussian) can be given a conjugate prior (Dirichlet, Normal-Wishart), for which the corre... |

135 | Learning equivalence classes of Bayesian network structures
- Chickering
- 1996
(Show Context)
Citation Context ...of all DAGs. (Constraint-based algorithms [SGS00a] only work with essential graphs.) The Bayesian scoring metric can only be applied to DAGs; hence one has to convert the PDAG to a DAG to evaluate it =-=[Chi96]-=-. MCMC methods forsnding high scoring PDAGs are discussed in [MAPV95]. Hybrid methods, that start with constraint-based methods and then switch to greedy search using a Bayesian evaluation metric, are... |

126 | Simplifying neural networks by soft weight sharing - Nowlan, Hinton - 1992 |

124 | Fundamental concepts of qualitative probabilistic networks
- Wellman
- 1990
(Show Context)
Citation Context ...or genetic networks, we often have prior knowledge about the \sign" of a connection, i.e., whether it is excitatory or inhibitory. (This is a special case of a qualitative probabilistic network (=-=QPN) [Wel90]-=-.) We can encode this prior knowledge using a diagonal Gaussian prior on the weight vector w i , with a mean at +1 for excitatory links, 1 for inhibitory links, and at 0 for links of unknown sign. [HG... |

119 | Learning gaussian networks - Geiger, Heckerman - 1994 |

112 | Bayesian parameter estimation via variational methods
- Jaakkola, Jordan
(Show Context)
Citation Context ...meter priors. A major disadvantage of the logistic (and related) CPD is the inability to compute the marginal likelihood (Equation 2) exactly. One possible (variational) approximation is discussed in =-=[JJ00]-=-; however, its accuracy on small samples has yet to be determined. A representation for CPDs for discrete nodes which is of variable complexity, ranging from O(1) to O(2 k ) parameters, is the tree. [... |

97 | Discovery of regulatory interaction through perturbation: inference and experimental design
- Ideker, Thorsson, et al.
- 2000
(Show Context)
Citation Context ..., but genes X4 and X5 do not, it suggests that X 1 is the ancestor of X2 and X3 . This heuristic, together with the set covering algorithm, was used to learn boolean networks from interventional data =-=[ITK0-=-0]. 11 A Multinomial distributions and Dirichlet priors For discrete nodes, it is very common to assume the local CPDs are multinomial, i.e., represented as a table of the form Pr(X i = kj i = j) = i... |

83 |
Bayesian inductive inference and maximum entropy
- Gull
- 1988
(Show Context)
Citation Context ...ations from this initial model by 2 Such a structural penalty is not strictly necessary, since sparse networks will have fewer free parameters, and hence a larger marginal likelihood or Ockham factor =-=[Gul88]-=-. This matches our intuition that we trust a constrained (but correct!) model more than one that can predict anything. 4 counting the number of mismatched edges [HGC95]. (For example, in [ITR + 01], t... |

79 | Bayesian network structures from data - Singh, Valtorta |

71 | Cause and Correlation in Biology. A User’s Guide to Path Analysis, Structural Equations and Causal Inference - Shipley - 2004 |

70 | Structure learning in conditional probability models via an entropic prior and parameter extinction
- Brand
- 1999
(Show Context)
Citation Context ...as not seen in the training data. If we set 0sijks1, we encourage the parameter values ijk to be near 0 or 1, thus encoding neardeterministic distributions. This might be desirable in some domains. [=-=Bra99] -=-explicitely encodes this bias using an \entropic prior" of the form P ( ij ) / e H( ij ) = Y k ijk ijk : Unfortunately, the entropic prior is not a conjugate prior. [CH92] suggest the uniform p... |

70 | Discretization of continuous attributes while learning Bayesian networks
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...se discretizing (binning) the data, and then using CPDs for discrete nodes. To reduce artifacts, we might be able to perform the discretization process simultaneously with the structure learning c.f. =-=[FG96a]-=-. 2.2 Structure priors For domains in which we have little prior knowledge, it is common to use a uniform prior over possible models. Alternatively, we can impose penalties based on the number of arcs... |

64 | Causal discovery from a mixture of experimental and observational data
- Cooper, Yoo
- 1999
(Show Context)
Citation Context ... r i Y k=1 ( ijk +N ijk ) ( ijk ) (12) For interventional data, Equation 12 is modied by dening N ijk to be the number of times X i = k is passively observed in the context i = j, as shown by [CY99]=-=. (Th-=-e intuition is that setting X i = k does not tell us anything about how likely this event is to occur \by chance", and hence should not be counted). Hence, in addition to D, we need to keep a rec... |

58 | Markov chain monte carlo model determination for hierarchical and graphical log-linear models - Dellaportas, Forster - 1999 |

54 | Der Gaag. Elicitation of probabilities for belief networks: Combining qualitative and quantitative information - Druzdzel, Van - 1995 |

51 |
Improving Markov chain Monte Carlo model search for data mining
- Giudici, Castelo
- 2003
(Show Context)
Citation Context ...thms is discussed in [GRS96]. The number of samples needed after reaching convergence depends on how rapidly the chain \mixes" (i.e., moves around the posterior distribution). To get a ballbarksg=-=ure, [GC01]-=- use MC 3 tosnd a distribution over the 3,781,503 DAGs with 6 binary nodes (of course, many of these are Markov equivalent), using a fully observed dataset with 1,841 cases. They used T = 100; 000 ite... |

41 | Learning Probabilistic Networks - Krause - 1998 |

39 | Bayesian model averaging and model selection for Markov equivalence classes of acyclic digraphs
- Madigan, Andersson, et al.
- 1996
(Show Context)
Citation Context ...sential graphs.) The Bayesian scoring metric can only be applied to DAGs; hence one has to convert the PDAG to a DAG to evaluate it [Chi96]. MCMC methods forsnding high scoring PDAGs are discussed in =-=[MAPV95]-=-. Hybrid methods, that start with constraint-based methods and then switch to greedy search using a Bayesian evaluation metric, are discussed in [SV93, SM95, DD99]. None of these methods are applicabl... |

32 | Constructing Bayesian network models of gene expression networks from microarray data - Spirtes, C, et al. - 2000 |

31 |
Ecient approximations for the marginal likelihood of incomplete data given a Bayesian network
- Chickering, Heckerman
- 1996
(Show Context)
Citation Context ...( G jG) P ( G jD; G) Computing P ( G jG) is trivial, and computing P (Dj G ; G) can be done using any BN inference algorithm. The denominator can be approximated using Gibbs sampling: see [CH97] for details. (See also [SNR00] for a related approach, based on the harmonic mean estimator.) Various large sample (e.g., Laplace) approximations to the marginal likelihood, which are computationally... |

25 |
Learning Bayesian networks in the presence of missing values and hidden variables
- Friedman
- 1997
(Show Context)
Citation Context ...We therefore assume we have some good initial guess based on domain knowledge. 3.3.3 Structural EM A deterministic approximation to the data augmentation scheme, called Structural EM, was proposed in =-=[Fri9-=-7]. The basic idea is to compute the expected complete-data marginal likelihood, P (Y jG 0 ) EZ h P (Y; ZjG 0 )jG; Y; ^ G i , using a BN inference algorithm applied to the current model G and its cu... |

24 | Hypothesis testing and model selection via posterior simulation - Raftery - 1996 |