## On the Sample Complexity of Learning Bayesian Networks (1996)

Citations: | 50 - 2 self |

### BibTeX

@INPROCEEDINGS{Friedman96onthe,

author = {Nir Friedman and Zohar Yakhini},

title = {On the Sample Complexity of Learning Bayesian Networks},

booktitle = {},

year = {1996},

pages = {274--282},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

In recent years there has been an increasing interest in learning Bayesian networks from data. One of the most effective methods for learning such networks is based on the minimum description length (MDL) principle. Previous work has shown that this learning procedure is asymptotically successful: with probability one, it will converge to the target distribution, given a sufficient number of samples. However, the rate of this convergence has been hitherto unknown. In this work we examine the sample complexity of MDL based learning procedures for Bayesian networks. We show that the number of samples needed to learn an ffl-close approximation (in terms of entropy distance) with confidence ffi is O i ( 1 ffl ) 4 3 log 1 ffl log 1 ffi log log 1 ffi j . This means that the sample complexity is a low-order polynomial in the error threshold and sublinear in the confidence bound. We also discuss how the constants in this term depend on the complexity of the target distribution. F...

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...able is connected to every other variable). To see this, suppose that G ae G 0 (i.e., every edge in G appears in G 0 ). Thus, \Pi G X ` \Pi G 0 X , for every X . Using the data processing inequality [=-=Cover and Thomas 1991-=-], we get that H(X j\Pi G X )sH(X j\Pi G 0 X ). It immediately follows that LL !N (G 0 )sLL !N (G). This preference is highly undesirable, since complete structures usually overfit the training data. ... |

7495 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...jQ) = X x P (x) log P (x) Q(x) : This measure of distance is accepted as a standard measure of error in the Bayesian networks literature [Heckerman, Geiger, and Chickering 1995; Lam and Bacchus 1994; =-=Pearl 1988-=-]. (See Section 3.1 for a detailed motivation of this choice.) It is also clear that since we are learning from random samples, there is some probability of seeing unrepresentative sequences that migh... |

2771 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...ce is a logarithmic penalty function. The standard penalty examined in the MDL literature is /(N) = 1 2 log N . The resulting scoring metric is also known as the Bayesian Information Criterion (BIC) [=-=Schwarz 1978-=-]. We now discuss in some detail the motivation for this penalty weight. We start with the MDL approach. Notice that frequencies in !N are fractions with precision of 1 N . Thus, we need only log N bi... |

1421 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ... always necessary nor possible to learn the exact target distribution. Instead a close approximation suffices. In this paper we use the entropy distance (also known as the Kullback-Leibler distance) [=-=Kullback and Leibler 1951-=-] as the measure of approximation: D(P jjQ) = X x P (x) log P (x) Q(x) : This measure of distance is accepted as a standard measure of error in the Bayesian networks literature [Heckerman, Geiger, and... |

1140 | A Bayesian Method for the Induction of Probabilistic Networks from Data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ... m, the skewness of the target distribution. We are currently considering using a similar technique to reduce the m-order of our results. Another related issue is Bayesian approaches to learning BNs [=-=Cooper and Herskovits 1992-=-; Heckerman, Geiger, and Chickering 1995; Heckerman 1995]. In these approaches we assume that there is a prior over possible network structures and their associated parameters. Learning is done by sel... |

953 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

330 | A tutorial on learning Bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...s an O(1) approximation of the logarithm of the posterior distribution log Pr(Gj!N ) for well-behaved priors [Schwarz 1978]. We refer the interested reader to [Heckerman, Geiger, and Chickering 1995; =-=Heckerman 1995-=-] for a description of the Bayesian approach to learning Bayesian networks and of Schwarz's result in the context of Bayesian networks. Finally, we might assign a polynomial penalty, i.e., take /(N) =... |

245 | Learning Bayesian networks with local structure
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...amples, N . The vertical axis measures the average cross-entropy error from 10 experiments multiplied by N log N . The dotted diagonal lines are lines of constant error. These results are taken from [=-=Friedman and Goldszmidt 1996-=-]. The learning curves are of Alarm (solid line), CTS (dashed line), and TJ (dot-dash line). three different networks with the cross-entropy error scaled by N log N . (These results are taken from [Fr... |

214 | Minimum complexity density estimation - Barron, Cover - 1991 |

212 | Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy - Shore, Johnson - 1980 |

199 | Learning Bayesian belief networks: An approach based on the MDL principle
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...nce they fail to capture (and exploit) independencies in the domain and tend to overfit the training data. One possible solution to this problem is to use the minimum description length (MDL) metric [=-=Lam and Bacchus 1994-=-; Suzuki 1993], which adds a penalty term to the likelihood measure. This term penalizes complex networks, i.e., networks that embody large numbers of parameters. The size of the penalty depends on th... |

152 |
An Introduction to Probability Theory and Its Applications, 2nd edition
- Feller
- 1971
(Show Context)
Citation Context ...able penalty functions /(N) lead to consistent learning procedures: Corollary 3.12: If lim N!1 /(N) N = 0, then Lrn / is asymptotically consistent. Proof: Use Theorem 3.9 and Borel-Cantelli's lemma ([=-=Feller 1957-=-]). This consistency result for MDL learning is by no means new. Barron and Cover [1991] treat such questions in a general setting, but without obtaining a confidence rate. When /(N) = 1 2 log N , con... |

56 |
A new look at the statistical identification model
- Akaike
- 1974
(Show Context)
Citation Context ...n into account only when comparing two candidates that perform equally well (e.g., have approximately the same log-likelihood). This scoring metric is known as the Akaike Information Criterion (AIC) [=-=Akaike 1974-=-]. We note that a constant penalty weight does not constitute a proper MDL encoding scheme, since it assigns a fixed number of bits to the description of the network parameters. Another possible choic... |

41 | Decision theoretic subsampling for induction on large databases - Musick, Catlett, et al. - 1993 |

33 |
Learning and robust learning of product distributions
- Hoffgen
- 1993
(Show Context)
Citation Context ...cent results focused on the theoretical development of learning procedures and on empirical methods of learning. However, the complexity of the learning problem is generally unknown (an exception is [=-=Hoffgen 1993-=-], which we discuss below). A BN is composed of two parts. The first part is a directed acyclic graph that represents (in)dependencies among the random variables: X is independent of Y given Z, if Z s... |