## Bayesian Network Classifiers (1997)

### Cached

### Download Links

Citations: | 594 - 22 self |

### BibTeX

@MISC{Friedman97bayesiannetwork,

author = {Nir Friedman and Dan Geiger and Moises Goldszmidt},

title = { Bayesian Network Classifiers },

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evaluate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that generalize the naive Bayesian classifier and explicitly represent statements about independence. Among these approaches we single out a method we call Tree Augmented Naive Bayes (TAN), which outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness that characterize naive Bayes. We experimentally tested these approaches, using problems from the University of California at Irvine repository, and compared them to C4.5, naive Bayes, and wrapper methods for feature selection.

### Citations

8609 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...i) =0. Hence, we need to maximize the term X i,⇡(i)>0 I ˆ PD (Ai; A ⇡(i),C)+ X i,⇡(i)=0 I ˆ PD (Ai; C) . (8) We simplify this term by using the identity known as the chain law for mutual information (=-=Cover & Thomas, 1991-=-): IP (X; Y, Z) =IP (X; Z)+IP (X; Y|Z). Hence, we can rewrite expression (8) as X IPD ˆ (Ai; C)+ X IPD ˆ (Ai; A⇡(i)|C) i i,⇡(i)>0 Note that the first term is not affected by the choice of ⇡(i). Theref... |

8557 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...he sum of weights attached to the selected arcs is maximized. There are well-known algorithms for solving this problem of time complexity O(n 2 log n), where n is the number of vertices in the graph (=-=Cormen et al., 1990-=-). The Construct-Tree procedure of CL consists of four steps: 1. Compute I ˆ PD (Xi; Xj) between each pair of variables, i 6= j, where IP (X; Y) = X P (x, y) P (x, y) log x,y P (x)P (y) is the mutual ... |

7074 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...ence? In order to tackle this problem effectively, we need an appropriate language and efficient machinery to represent and manipulate independence assertions. Both are provided by Bayesian networks (=-=Pearl, 1988-=-). These networks are directed acyclic graphs that allow efficient and effective representation of the joint probability distribution over a set of random variables. Each vertex in the graph represent... |

4957 |
C4.5: programs for machine learning
- Quinlan
- 1992
(Show Context)
Citation Context ...7% without smoothing. The complete results for the smoothed version of naive Bayes are reported in Table 3. Given that TAN performs better than naive Bayes and that naive Bayes is comparable to C4.5 (=-=Quinlan, 1993-=-), a state-of-the-art decision tree learner, we may infer that TAN should perform rather well in comparison to C4.5. To confirm this prediction, we performed experiments comparing TAN to C4.5, and als... |

3928 | Pattern Classification and Scene Analysis - Duda, Hart - 1973 |

1250 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ...maximizing the log likelihood we are minimizing the description of D. Another way of viewing this optimization process is to use cross entropy, which is also known as the Kullback-Leibler divergence (=-=Kullback & Leibler, 1951-=-). Cross entropy is a measure of distance between two probability distributions. Formally, D(P (X)||Q(X)) = X x2Val(X) P (x) log P (x) Q(x) . (A.1) One information-theoretic interpretation of cross en... |

1165 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...eiger et al., 1996). An in-depth discussion of the pros and cons of each scoring function is beyond the scope of this paper. Henceforth, we concentrate on the MDL scoring function. The MDL principle (=-=Rissanen, 1978-=-) casts learning in terms of data compression. Roughly speaking, the goal of the learner is to find a model that facilitates the shortest description of the original data. The length of this descripti... |

1116 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ...58 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT assess only the second term, since it is the only one relevant to the classification process. In general, neither of these approaches dominates the other (=-=Ripley, 1996-=-). The naive Bayesian classifier and the extensions we have evaluated belong to the sampling paradigm. Although the unrestricted Bayesian networks (described in Section 3) do not strictly belong in ei... |

1079 | Bayesian method for the induction of probabilistic networks from data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...efficient algorithms in Section 4.1, where we propose a particular extension to naive Bayes. The two main scoring functions commonly used to learn Bayesian networks are the Bayesian scoring function (=-=Cooper & Herskovits, 1992-=-; Heckerman et al., 1995), and the function based on the principle of minimal description length (MDL) (Lam & Bacchus, 1994; Suzuki, 1993); see also Friedman and Goldszmidt (1996c) for a more recent a... |

1036 | Wrappers for feature subset selection - Kohavi, John - 1997 |

905 | Learning Bayesian networks: the combination of knowledge and statistical
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...tion 4.1, where we propose a particular extension to naive Bayes. The two main scoring functions commonly used to learn Bayesian networks are the Bayesian scoring function (Cooper & Herskovits, 1992; =-=Heckerman et al., 1995-=-), and the function based on the principle of minimal description length (MDL) (Lam & Bacchus, 1994; Suzuki, 1993); see also Friedman and Goldszmidt (1996c) for a more recent account of this scoring f... |

757 | A study of cross-validation and bootstrap for accuracy estimation and model selection - Kohavi - 1995 |

741 |
Aha, “UCI repository of machine learning data bases,” http: //www.ics.uci.edu/~mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...lassifiers based on unrestricted networks) to that of the naive Bayesian classifier. We ran this experiment onBAYESIAN NETWORK CLASSIFIERS 139 25 data sets, 23 of which were from the UCI repository (=-=Murphy & Aha, 1995-=-). Section 5 describes in detail the experimental setup, evaluation methods, and results. As the results in Figure 2 show, the classifier based on unrestricted networks performed significantly better ... |

693 | Optimal Statistical Decisions - DeGroot - 1970 |

655 | Multi-interval discretization of continuousvalued attributes for classification learning - Fayyad, Irani - 1993 |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...ed one by choosing a root variable and setting the direction of all edges to be outward from it. CL prove that this procedure finds the tree that maximizes the likelihood given the data D. Theorem 1 (=-=Chow & Liu, 1968-=-) Let D be a collection of N instances of X1,...,Xn. The Construct-Tree procedure constructs a tree BT that maximizes LL(BT |D) and has time complexity O(n 2 · N).142 N. FRIEDMAN, D. GEIGER, AND M. G... |

411 | Supervised and unsupervised discretization of continuous features - Dougherty, Kohavi, et al. - 1995 |

336 | An analysis of bayesian classifiers
- Langley, Iba, et al.
- 1992
(Show Context)
Citation Context ... of conditional probability, we get Pr(C|A1,...,An) =↵ · Pr(C) · Q n i=1 Pr(Ai|C), where ↵ is a normalization constant. This is in fact the definition of naive Bayes commonly found in the literature (=-=Langley et al., 1992-=-). The problem of learning a Bayesian network can be informally stated as: Given a training set D = {u1,...,uN } of instances of U, find a network B that best matches D. The common approach to this pr... |

314 | Estimating Continuous Distributions in Bayesian Classifiers - John, Langley |

298 | Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier
- Domingos, Pazzani
- 1996
(Show Context)
Citation Context ...actors such as numeric attributes and missing values. 6.1. Related work on naive Bayes There has been recent interest in explaining the surprisingly good performance of the naive Bayesian classifier (=-=Domingos & Pazzani, 1996-=-; Friedman, 1997a). The analysis provided by Friedman (1997a) is particularly illustrative, in that it focuses on characterizing how the bias and variance components of the estimation error combine to... |

298 | Learning Bayesian Networks: The
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ... size increases; furthermore, they are both asymptotically correct: with probability equal to one the learned distribution converges to the underlying distribution as the number of samples increases (=-=Heckerman, 1995-=-; Bouckaert, 1994; Geiger et al., 1996). An in-depth discussion of the pros and cons of each scoring function is beyond the scope of this paper. Henceforth, we concentrate on the MDL scoring function.... |

267 |
Inference and missing data
- Rubin
- 1976
(Show Context)
Citation Context ...works but we leave this issue for future work. Regarding the problem of missing values, in theory, probabilistic methods provide a principled solution. If we assume that values are missing at random (=-=Rubin, 1976-=-), then we can use the marginal likelihood (the probability assigned to the parts of the instance that were observed) as the basis for scoring models. If the values are not missing at random, then mor... |

234 | Learning bayesian networks with local structure - Friedman, Goldszmidt - 1998 |

216 |
The EM algorithm for graphical association models with missing data’, Computational Statistics and Analysis
- Lauritzen
- 1995
(Show Context)
Citation Context ...written as the sum of local terms (as in Equation 4). Moreover, to evaluate the optimal choice of parameters for a candidate network structure, we must perform nonlinear optimization using either EM (=-=Lauritzen, 1995-=-) or gradient descent (Binder et al., 1997). The problem of selecting the best structure is usually intractable in the presence of missing values. Several recent efforts (Geiger et al., 1996; Chickeri... |

213 | Induction of selective Bayesian classifiers
- Langley, Sage
- 1994
(Show Context)
Citation Context ...e may infer that TAN should perform rather well in comparison to C4.5. To confirm this prediction, we performed experiments comparing TAN to C4.5, and also to the selective naive Bayesian classifier (=-=Langley & Sage, 1994-=-; John & Kohavi, 1997). The latter approach searches for the subset of attributes over which naive Bayes has the best performance. The results, displayed in Figures 5 and 6 and in Table 2, show that T... |

193 | On bias, variance, 0/1-loss, and the curse-of-dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ...ributes and missing values. 6.1. Related work on naive Bayes There has been recent interest in explaining the surprisingly good performance of the naive Bayesian classifier (Domingos & Pazzani, 1996; =-=Friedman, 1997-=-a). The analysis provided by Friedman (1997a) is particularly illustrative, in that it focuses on characterizing how the bias and variance components of the estimation error combine to influence class... |

190 | Bayesian analysis in expert systems - Spiegelhalter, Dawid, et al. - 1993 |

188 | Learning Bayesian Belief Networks : An Approach Based on
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...didate network. We start by examining a straightforward application of current Bayesian networks techniques. We learn networks using the score based on the minimum description length (MDL) principle (=-=Lam & Bacchus, 1994-=-; Suzuki, 1993), and use them for classification. The results, which are analyzed in Section 3, are mixed: although the learned networks perform significantly better than naive Bayes on some data sets... |

182 | Theory refinement on Bayesian networks - Buntine - 1991 |

176 | Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables
- Chickering, Heckerman
- 1997
(Show Context)
Citation Context ...en, 1995) or gradient descent (Binder et al., 1997). The problem of selecting the best structure is usually intractable in the presence of missing values. Several recent efforts (Geiger et al., 1996; =-=Chickering & Heckerman, 1996-=-) have examined approximations to the marginal score that can be evaluated efficiently. Additionally, Friedman (1997b) has proposed a variant of EM for selecting the graph structure thatBAYESIAN NETW... |

171 | A guide to the literature on learning probabilistic networks from data - Buntine - 1996 |

167 | Estimating probabilities: A crucial task in machine learning - Cestnik - 1990 |

157 | Adaptive probabilistic networks with hidden variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context ...mplies that, to maximize the choice of parameters for a fixed network structure, we must resort to search methods such as gradient descent over the space of parameters (e.g., using the techniques of (=-=Binder et al., 1997-=-)). When learning the network structure, this search must be repeated for each structure candidate, rendering the method computationally expensive. Whether we can find heuristic approaches that will a... |

155 | Learning Bayesian networks is np-complete
- Chickering
- 1996
(Show Context)
Citation Context ...oring function that evaluates each network with respect to the training data, and then to search for the best network according to this function. In general, this optimization problem is intractable (=-=Chickering, 1995-=-). Yet, for certain restricted classes of networks, there are efficient algorithms requiring polynomial time in the number of variables in the network. We indeed take advantage of these efficient algo... |

136 |
Probabilistic similarity networks
- Heckerman
(Show Context)
Citation Context ...fic class, that is, ˆ PD(A1,...,An | C = ci). The Bayesian network for ci is called a local network for ci. The set of local networks combined with a prior on C, P (C), is called a Bayesian multinet (=-=Heckerman, 1991-=-; Geiger & Heckerman, 1996). Formally, a multinet is a tuple M = hPC,B1,...,Bki where PC is a distribution on C, and Bi is a Bayesian network over A1,...,An for 1 apple i apple k = |Val(C)|. A multine... |

120 | Learning belief networks in the presence of missing values and hidden variables
- Friedman
- 1997
(Show Context)
Citation Context ...ributes and missing values. 6.1. Related work on naive Bayes There has been recent interest in explaining the surprisingly good performance of the naive Bayesian classifier (Domingos & Pazzani, 1996; =-=Friedman, 1997-=-a). The analysis provided by Friedman (1997a) is particularly illustrative, in that it focuses on characterizing how the bias and variance components of the estimation error combine to influence class... |

110 |
Semi-naive Bayesian classifier
- Kononenko
- 1991
(Show Context)
Citation Context ...tal results (see Figure 6) show that the methods we examine here are usually more accurate than the selective naive Bayesian classifier as used by John and Kohavi (1997). Work in the second category (=-=Kononenko, 1991-=-; Pazzani, 1995; Ezawa & Schuermann, 1995) are closer in spirit to our proposal, since they attempt to improve the predictive accuracy by removing some of the independence assumptions. The semi-naive ... |

98 | MLC++: A machine learning library in C
- Kohavi, John, et al.
- 1994
(Show Context)
Citation Context ...-5 24 vote 16 2 435 CV-5 25 waveform-21 21 3 300 4700 The accuracy of each classifier is based on the percentage of successful predictions on the test sets of each data set. We used the MLC++ system (=-=Kohavi et al., 1994-=-) to estimate the prediction accuracy for each classifier, as well as the variance of this accuracy. Accuracy was measured via the holdout method for the larger data sets (that is, the learning proced... |

93 |
Knowledge representation and inference in similarity networks and bayesian multinets
- Geiger, Heckerman
- 1996
(Show Context)
Citation Context ...s, ˆ PD(A1,...,An | C = ci). The Bayesian network for ci is called a local network for ci. The set of local networks combined with a prior on C, P (C), is called a Bayesian multinet (Heckerman, 1991; =-=Geiger & Heckerman, 1996-=-). Formally, a multinet is a tuple M = hPC,B1,...,Bki where PC is a distribution on C, and Bi is a Bayesian network over A1,...,An for 1 apple i apple k = |Val(C)|. A multinet M defines a joint distri... |

78 | M.: Building classifiers using Bayesian networks - Friedman, Goldszmidt - 1996 |

70 | Searching for dependencies in Bayesian classifiers
- Pazzani
- 1996
(Show Context)
Citation Context ...Figure 6) show that the methods we examine here are usually more accurate than the selective naive Bayesian classifier as used by John and Kohavi (1997). Work in the second category (Kononenko, 1991; =-=Pazzani, 1995-=-; Ezawa & Schuermann, 1995) are closer in spirit to our proposal, since they attempt to improve the predictive accuracy by removing some of the independence assumptions. The semi-naive Bayesian classi... |

64 | Discretizing Continuous Attributes While Learning Bayesian Networks - Friedman, Goldszmidt - 1996 |

49 | G.: Efficient learning of selective Bayesian network classifiers - Singh, Provan - 1995 |

43 | Learning Bayesian networks: A unification for discrete and Gaussian domains - Heckerman, Geiger - 1995 |

33 | Approximating probability distributions to reduce storage requirements - Lewis - 1959 |

31 |
A construction of Bayesian networks from databases based on an MDL scheme
- Suzuki
- 1993
(Show Context)
Citation Context ...art by examining a straightforward application of current Bayesian networks techniques. We learn networks using the score based on the minimum description length (MDL) principle (Lam & Bacchus, 1994; =-=Suzuki, 1993-=-), and use them for classification. The results, which are analyzed in Section 3, are mixed: although the learned networks perform significantly better than naive Bayes on some data sets, they perform... |

29 | An entropy-based learning algorithm of Bayesian conditional trees - Geiger - 1992 |

24 | Provan, “A comparison of induction algorithms for selective and nonselective Bayesian classifiers
- Singh, M
- 1995
(Show Context)
Citation Context ... combine several feature subset selection strategies with an unsupervised Bayesian network learning routine. This procedure, however, can be computationally intensive (e.g., some of their strategies (=-=Singh & Provan, 1995-=-) involve repeated calls to a the Bayesian network learning routine). 6.2. The conditional log likelihood Even though the use of log likelihood is warranted by an asymptotic argument, as we have seen,... |

22 | Properties of diagnostic data distributions - Dawid - 1976 |

22 |
Fraud/uncollectable debt detection using Bayesian network based learning system: A rare binary outcome with mixed data structures
- Ezawa, T
- 1995
(Show Context)
Citation Context ...that the methods we examine here are usually more accurate than the selective naive Bayesian classifier as used by John and Kohavi (1997). Work in the second category (Kononenko, 1991; Pazzani, 1995; =-=Ezawa & Schuermann, 1995-=-) are closer in spirit to our proposal, since they attempt to improve the predictive accuracy by removing some of the independence assumptions. The semi-naive Bayesian classifier (Kononenko, 1991) is ... |