## The Posterior Probability of Bayes Nets with Strong Dependences (1999)

Venue: | Soft Computing |

Citations: | 14 - 1 self |

### BibTeX

@ARTICLE{Kleiter99theposterior,

author = {Gernot D. Kleiter},

title = {The Posterior Probability of Bayes Nets with Strong Dependences},

journal = {Soft Computing},

year = {1999},

volume = {3},

pages = {162--173}

}

### Years of Citing Articles

### OpenURL

### Abstract

Stochastic independence is an idealized relationship located at one end of a continuum of values measuring degrees of dependence. Modeling real world systems, we are often not interested in the distinction between exact independence and any degree of dependence, but between weak ignorable and strong substantial dependence. Good models map significant deviance from independence and neglect approximate independence or dependence weaker than a noise threshold. This intuition is applied to learning the structure of Bayes nets from data. We determine the conditional posterior probabilities of structures given that the degree of dependence at each of their nodes exceeds a critical noise level. Deviance from independence is measured by mutual information. Arc probabilities are determined by the amount of mutual information the neighbors contribute to a node, is greater than a critical minimum deviance from independence. A Ø 2 approximation for the probability density function of mutual info...

### Citations

8564 |
Elements of Information Theory
- Cover, Thomas
- 2003
(Show Context)
Citation Context ...tly be known either. In our model ` ij , ` i+ , and ` +j follow Dirichlet distributions. In [29] we proposed as2 approximation for the distribution ofsthat is slightly improved here. It is well known =-=[9]-=- that the relative entropy distance can be approximated bys2 . This leads to the mean of the distribution. The variance was obtained by heuristic numerical methods. Let X 1 and X 2 be two discrete ran... |

2307 |
Estimating the dimension of a model
- SCHWARZ
- 1978
(Show Context)
Citation Context ...ies of the observed cases into the conditional probability tables of m. The asymptotic marginal likelihood function is called the Bayesian Information Criterion (BIC). It was first derived by Schwarz =-=[41]-=- as an asymptotic approximation of the posterior distribution. The criterion is equivalent to the 7 MDL. This is an important relationship showing that different approaches can converge to equivalent ... |

1792 | Random graphs
- Bollobas
- 2001
(Show Context)
Citation Context ...more detail we will discuss the function of arc probabilities in Bayesian networks. 4 Random Directed Acyclic Graphs While several different kinds of random graphs are distinguished in the literature =-=[3, 22]-=- we consider only random graphs in which the nodes are fixed (non-random) and the edges are random. Usually, random graphs are introduced with undirected graphs. Let V be a set of n (finite labelled) ... |

1131 | Algorithmic Graph Theory and Perfect Graphs - Golumbic - 2004 |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ... and the same holds for frequentist confidence regions. The most prominent approach to the identification of Bayes nets is the Bayesian version of hypothesis testing introduced by Cooper & Herskovits =-=[11]-=- and further investigated by 3 Heckerman, Geiger, & Chickering [17] and Heckerman [15]. A tutorial is provided by Heckerman [16]. A treatment of Bayesian hypothesis testing from the perspective of Mar... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...ominent approach to the identification of Bayes nets is the Bayesian version of hypothesis testing introduced by Cooper & Herskovits [11] and further investigated by 3 Heckerman, Geiger, & Chickering =-=[17]-=- and Heckerman [15]. A tutorial is provided by Heckerman [16]. A treatment of Bayesian hypothesis testing from the perspective of Markov Chain Monte Carlo methods is given by Raftery [37]. As the meth... |

854 | A tutorial on learning with bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...ical problem. Intuitively, we want to find those structures that, based on the observed data, are 1 well justified. For a review of the literature and tutorials on learning probabilistic networks see =-=[6, 15, 16]-=-. This paper presents a new method to extract structures from frequency data and to evaluate their probability. We select networks containing strong links and substantial deviance from independence. T... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...employed to approximate joint highly multidimensional probability distributions that are too complex to be processed or stored directly. An example of this approach is the seminal paper by Chow & Liu =-=[10]-=- in which the authors approximated a multidimensional probability distribution by a tree structure. They determined the mutual information at each of the branches of the tree and showed that a maximum... |

533 |
Probability and statistics
- DEGROOT, SCHERVISH
- 2004
(Show Context)
Citation Context ...sted under a continuous prior distribution." ([38], p. 184). "In order to take seriously the problem of testing a point hypothesis, one must use a prior distribution in which Pr(\Theta = ` 0=-= ) ? 0." ([40], p. -=-221) Jeffreys [19] proposed to assign a prior probability (a point mass) to the null hypothesis and distribute the remaining probability mass on the remaining parameter space. "Alternatively, one... |

526 |
Theory of probability
- Jeffreys
- 1961
(Show Context)
Citation Context ...y of accepting H 0 when in fact it is false (fi, type II error), a Bayesian test finds the posterior probability of ` 0 and ` 1 . The Bayesian version of hypothesis testing was introduced by Jeffreys =-=[19]-=-. In Bayesian statistics, though, hypothesis testing never played the dominant role it plays in the sampling theory approach. The structure of a Bayes net may be stated as a null hypothesis and then b... |

496 |
Causation, Prediction, and Search
- Spirtes, Glymour, et al.
- 1993
(Show Context)
Citation Context ...s, neither the significance level nor the power of tests used within the search algorithms to decide statistical dependence measures the long run frequency of anything interesting about the search.&qu=-=ot; ([44]-=-, p. 130/1) Moreover, significance testing in Bayes nets involves multiple tests and controversial P values raising additional problems [32]. A systematic reference investigating the role of significa... |

188 | Learning Bayesian Belief Networks : An Approach Based on
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...limbing procedure for computing solutions that are locally optimal. An important relationship between the cross-entropy measure and the minimum description length (MDL) was described by Lam & Bacchus =-=[31]-=-. The coding length of a Bayes net is a monotonically increasing function of the Kullback-Leibler divergence and can be used to fit simple nets to the data. Jirousek & Kleiter [21] have shown that the... |

172 | A guide to the literature on learning probabilistic networks from data
- Buntine
- 1996
(Show Context)
Citation Context ...ical problem. Intuitively, we want to find those structures that, based on the observed data, are 1 well justified. For a review of the literature and tutorials on learning probabilistic networks see =-=[6, 15, 16]-=-. This paper presents a new method to extract structures from frequency data and to evaluate their probability. We select networks containing strong links and substantial deviance from independence. T... |

166 |
Introduction to graphical modelling
- Edwards
- 2000
(Show Context)
Citation Context ...y no connections will pass the effect size filter, while, on the other hand, if we admit very weak effect sizes, practically every node is connected with every other one. 7 Examples Recently, Edwards =-=[12]-=- p. 9, discussed the Florida murder data 1976-1977 originally published by Radelet (Table 2). There are three binary variables: the colors of victims (black, white), the color of murderers (black, whi... |

146 |
Mathematical Statistics
- Wilks
- 1962
(Show Context)
Citation Context ...; r 1 ; : : : ; r d ]sDi(ff 1 ; : : : ; ff D ) : (14) If the joint distribution on the simplex is a Dirichlet distribution then all marginals and conditional distributions are Dirichlet distributions =-=[46]-=-. We work with a second order probability density function on the simplex of first order probabilities. The distribution tells us which values in the parameter space, given the observed data, are plau... |

137 |
Combinatorial algorithms
- Nijenhuis, Wilf
- 1979
(Show Context)
Citation Context ... following graph generating model for random DAGs: 1. Select one of the (n!) permutations with probability 1=(n!). A short algorithm generating random permutations with probability 1=(n!) is given in =-=[34]-=-. The adjacency matrix of a graph that corresponds to a total ordering can always be arranged such that the lower triangular matrix contains 1s, and that the diagonal and the upper triangular matrix c... |

133 |
An algebra of Bayesian belief universes for knowledge-based systems
- Jensen, Olsen, et al.
- 1990
(Show Context)
Citation Context ...graph instead of nodes of an acyclic directed graph. Kjaerulff [23] proposed to simplify and approximate Bayesian networks and related structures by the removal of weak dependences. Jensen & Anderson =-=[20]-=- developed a complementary method that annihilates small probabilities by setting small probabilities in the clique potential of a junction tree equal to zero predetermined threshold. They choose a th... |

126 |
The Bayesian Choice
- ROBERT
- 2007
(Show Context)
Citation Context ...t the probability of the hypothesis is always zero. The marginal distribution ofsis continuous and "... a point null hypothesis H 0 : ` = ` 0 cannot be tested under a continuous prior distributio=-=n." ([38], p. 184).-=- "In order to take seriously the problem of testing a point hypothesis, one must use a prior distribution in which Pr(\Theta = ` 0 ) ? 0." ([40], p. 221) Jeffreys [19] proposed to assign a p... |

62 |
Statistical Prediction Analysis
- Aitchison, Dunsmore
- 1975
(Show Context)
Citation Context ...erior distribution is also Dirichlet p(\ThetajD; m)sDi(N 00 ij1 ; N 00 ij2 ; : : : ; N 00 ijr i ) : (5) where N 00 ij = P r j j=1 N 00 ijk . Cooper & Herskovits now invoke the predictive distribution =-=[1]-=-. Denote by x r+1 the configuration of the next case to be observed after having observed a sample of r previous cases. The probability of the next case to obtain a certain configuration given a sampl... |

61 |
Probabilistic Expert Systems
- Shafer
- 1996
(Show Context)
Citation Context ...l ordering may arise from several total or linear orderings. Any permutation of the nodes of a Bayes net corresponds to a total ordering. With n elements there are (n!) permutations. Following Shafer =-=[43]-=- we call a permutation that is compatible with the partial ordering of the actual Bayes net a construction sequence. In parts of the literature a permutation is also conceived as a complete transitive... |

60 |
Bayesian Networks for Knowledge Discovery
- Heckerman
- 1996
(Show Context)
Citation Context ...ical problem. Intuitively, we want to find those structures that, based on the observed data, are 1 well justified. For a review of the literature and tutorials on learning probabilistic networks see =-=[6, 15, 16]-=-. This paper presents a new method to extract structures from frequency data and to evaluate their probability. We select networks containing strong links and substantial deviance from independence. T... |

58 |
Bayesian belief networks: from construction to evidence
- Bouckaert
- 1995
(Show Context)
Citation Context ... 00 ijk ) \Gamma(N 0 ijk ) : (9) This is the Cooper-Herskovits scoring function. The asymptotic approximation of the marginal likelihood as the number of observations N !1 has been shown by Bouckaert =-=[4]-=- to be P(Djm)sH(m;D)N \Gamma 1 2 dim(m) log(N) ; (10) where dim(m) is the number of parameters in m and H(m;D) is the entropy of the probability distribution obtained by projecting the frequencies of ... |

47 | Model selection and accounting for model uncertainty in linear regression models
- Raftery, Madigan, et al.
- 1997
(Show Context)
Citation Context ...g run frequency of anything interesting about the search." ([44], p. 130/1) Moreover, significance testing in Bayes nets involves multiple tests and controversial P values raising additional prob=-=lems [32]-=-. A systematic reference investigating the role of significance testing in model selection within the domain of Bayes nets is not known to me, and the same holds for frequentist confidence regions. Th... |

47 |
Counting unlabeled acyclic digraphs
- Robinson
- 1977
(Show Context)
Citation Context ...sterior probability in Eq. 11 cannot be determined as the number of structures in M is too large. The number of possible structures increases over-exponentially with the number of variables. Robinson =-=[39]-=- gives the following recursive function for the number of labelled acyclic directed graphs for n nodes f(n) = n X i=1 (\Gamma1) i+1 / n i ! 2 i(n\Gammai) f(n \Gamma i) ; f(0) = 1: We obtain f(2) = 2, ... |

46 | Asymptotic model selection for directed networks with hidden variables - Geiger, Heckerman, et al. - 1996 |

45 |
Counting linear extensions
- Brightwell, Winkler
- 1991
(Show Context)
Citation Context ...tal to the theory of ordered sets and it is also of practical importance in such areas as computer sciences, for example, because of its close relationship with sorting problems. Brightwell & Winkler =-=[5]-=- have shown that the problem is #P-complete. Pruesse & Ruskey presented an algorithm that generates the linear extensions in time proportional to the number of linear extensions (constant amortized ti... |

44 |
Hypothesis testing and model selection
- Raftery
- 1996
(Show Context)
Citation Context ...ared by a Bayes factor. The Bayes factor is the ratio of posterior to prior odds. A "calibration" by linguistic labels such as "barely worth mentioning", "positive", or &=-=quot;strong" is sometimes proposed [37]-=-. Interval judgment does not provide ready made decisions about the rejection or acceptance of hypotheses at a given significance level, but leaves the evaluation of the results of the statistical ana... |

42 |
Principles of Combinatorics
- Berge
- 1971
(Show Context)
Citation Context ... 1 ; X 2 ; : : : ; X n are denoted by (x 1 ; x 2 ; : : : ; x n ). A nice way to visually represent the set of all permutations in their integer coding is to draw a permutohedron (or convexpolyhedron) =-=[2]-=-. Each permutation is represented by a point in the n-dimensional Euclidian space. The permutohedron is a structure on which a probability function can easily be defined. To find the cardinality of ce... |

37 | Generating linear extensions fast - Pruesse, Ruskey - 1994 |

34 |
Reduction of computational complexity in bayesian networks through removal of weak dependencies
- Kjaerulff
- 1994
(Show Context)
Citation Context ...n the conditional probability distributions. To exploit all the information, they proposed to assign probabilities to cliques of a moral graph instead of nodes of an acyclic directed graph. Kjaerulff =-=[23]-=- proposed to simplify and approximate Bayesian networks and related structures by the removal of weak dependences. Jensen & Anderson [20] developed a complementary method that annihilates small probab... |

32 | The multi-information function as a tool for measuring stochastic dependence
- Vejnarová
- 1999
(Show Context)
Citation Context ...tance of any point in the cube from the paraboloid. There are many proposals how to quantify deviance from independence. Mutual information is the quantity that has been used most often in Bayes nets =-=[45]-=-. In the present example containing two binary variables 4 Figure 1: The cloud represents the probability density associated with the outcomes of two binary variables X and Y . The parameter space is ... |

27 |
Approximating discrete probability distributions with decomposable models
- Malvestuto
- 1991
(Show Context)
Citation Context ... Wong & Poon [47] considered a classification problem and showed that under certain assumptions the Chow & Liu criterion is equivalent to minimizing an upper bound of the Bayes error rate. Malvestuto =-=[33]-=- extended the Chow & Liu approach to decomposable models of a given complexity. He used a hill-climbing procedure for computing solutions that are locally optimal. An important relationship between th... |

23 |
Gibbs sampling in Bayesian networks
- Hrycej
- 1990
(Show Context)
Citation Context ... it simultaneously may be a parent of X i and a parent of a child of X i . Moreover, the focus node is not itself a member of the Markov blanket, though it clearly is a parent of its children. Hrycej =-=[18]-=- has shown that stochastic simulation of the Markov blankets of a Bayes net is a Gibbs sampler of the involved probability distributions. A similar technique was employed to propagate probabilities an... |

17 |
A review of random graphs
- Karoński
- 1982
(Show Context)
Citation Context ...more detail we will discuss the function of arc probabilities in Bayesian networks. 4 Random Directed Acyclic Graphs While several different kinds of random graphs are distinguished in the literature =-=[3, 22]-=- we consider only random graphs in which the nodes are fixed (non-random) and the edges are random. Usually, random graphs are introduced with undirected graphs. Let V be a set of n (finite labelled) ... |

15 | Propagating imprecise probabilities in Bayesian networks
- Kleiter
- 1996
(Show Context)
Citation Context ...tion of the Markov blankets of a Bayes net is a Gibbs sampler of the involved probability distributions. A similar technique was employed to propagate probabilities and their precisions in Bayes nets =-=[26]-=-. Before the simulation starts we fix the effect sizes, the number B of initial burn-in iterations, and the number T of iterations in the main stage. B = T = 1; 000 usually lead to results that change... |

14 |
Bayesian Diagnosis in Expert Systems
- Kleiter
- 1992
(Show Context)
Citation Context ...actual problem at hand. From a Bayesian point of view all the information in the data about the parameters of a model is contained in the posterior distribution. In the distributional way of thinking =-=[25]-=- probability distributions take on the role of a language to express partial knowledge in a flexible and coherent way. Probability intervals under the posterior distribution are an important criterion... |

11 | Lindley’s paradox
- Shafer
- 1982
(Show Context)
Citation Context ...ex and only slightly more accurate. Given that we must perform learning with only a limited amount of data, this insistence on accuracy is questionable." ([31], p. 273) De Groot (in the discussio=-=n of [42]) rec-=-ommends: "when diffuse prior distributions are used in Bayesian inference, they must be used with care. Although they can serve as convenient and useful approximations in some estimation problems... |

10 | A loop-free algorithm for generating the linear extensions of a poset - Canfield, Williamson - 1995 |

6 | Learning Bayesian networks under the control of mutual information
- Kleiter, Jirousek
- 1996
(Show Context)
Citation Context ...robabilities on the right hand side are not perfectly known, the measure on the left hand side cannot perfectly be known either. In our model ` ij , ` i+ , and ` +j follow Dirichlet distributions. In =-=[29]-=- we proposed as2 approximation for the distribution ofsthat is slightly improved here. It is well known [9] that the relative entropy distance can be approximated bys2 . This leads to the mean of the ... |

5 |
Comments on the Approximating Discrete Probability Distributions with Dependence Trees
- Wong, Poon
(Show Context)
Citation Context ... information at each of the branches of the tree and showed that a maximum likelihood estimator of the overall structure is obtained when the total sum of mutual information is maximized. Wong & Poon =-=[47]-=- considered a classification problem and showed that under certain assumptions the Chow & Liu criterion is equivalent to minimizing an upper bound of the Bayes error rate. Malvestuto [33] extended the... |

3 |
The number of linear extensions of a directed acyclic graph. Institut fur Psychologie, Universitat Salzburg, in preparation
- Kleiter
- 1997
(Show Context)
Citation Context ...nian path through the transposition graph and this proof underlies their algorithm. We have proposed an algorithm that determines the number of linear extensions without generating all the extensions =-=[27, 28]-=-. We used a coding scheme that in the generation process allows to jump over large subsets of linear extensions. As for the present purpose only the number of linear extensions is of interest and not ... |

2 | A note on learning Bayesian networks
- Jirousek, Kleiter
- 1995
(Show Context)
Citation Context ...ed by Lam & Bacchus [31]. The coding length of a Bayes net is a monotonically increasing function of the Kullback-Leibler divergence and can be used to fit simple nets to the data. Jirousek & Kleiter =-=[21]-=- have shown that the Lam & Bacchus algorithm does not fully exploit the information contained in the conditional probability distributions. To exploit all the information, they proposed to assign prob... |

1 | revised version). Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables - Chickering - 1997 |

1 |
Structural uncertainty in Bayes nets. Institut fur Psychologie
- Kleiter
- 1998
(Show Context)
Citation Context ...nian path through the transposition graph and this proof underlies their algorithm. We have proposed an algorithm that determines the number of linear extensions without generating all the extensions =-=[27, 28]-=-. We used a coding scheme that in the generation process allows to jump over large subsets of linear extensions. As for the present purpose only the number of linear extensions is of interest and not ... |