## Sisterhood of Classifiers: A Comparative Study of Naive Bayes and Noisy-or Networks

Citations: | 1 - 0 self |

### BibTeX

@MISC{Chen_sisterhoodof,

author = {David Chen},

title = {Sisterhood of Classifiers: A Comparative Study of Naive Bayes and Noisy-or Networks},

year = {}

}

### OpenURL

### Abstract

Classification is a task central to many machine learning problems. In this paper we examine two Bayesian network classifiers, the naive Bayes and the noisy-or models. They are of particular interest because of their simple structures. We compare them on two dimensions: expressive power and ability to learn. As it turns out, naive Bayes, noisy-or, and logistic regression classifiers all have equivalent expressiveness. We show mathematical derivations of how to transform a classifer in one model into the other two. These classifiers differ on their ability to learn though. We conducted an experiment confirming the intuition that naive Bayes performs better than noisy-or when the data fits its independence assumptions, and vice versa. However, we still do not have a clear set of criteria for determining under exactly what conditions would each classifier excel. Further study of the strenghts and weaknesses of each classifier should provide deeper insight on how to improve the current models. One possible extension would be to combine the naive Bayes and noisy-or model so that the network will more closely depict the actual relationship between the attributes. 1

### Citations

9039 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...d never solve a more general problem as an intermediate 4Cause 1 Cause 2 Cause n C Effect 1 Effect 2 Effect n Figure 4: Combined naive Bayes and noisy-or classifier step [such as modeling p(A | C)]” =-=[37]-=-. However, as mentioned earlier, there are indications that no one classifier is superior in all cases. The classification problem can be formulated formally as follows: Let A = {A1, A2, . . . , An} b... |

8166 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...he complete data case using the weighted data we obtained in the previous step. The process continues until a certain convergence criteria is met. The algorithm was first introduced by Dempster et al =-=[7]-=-. It was later extended for graphical models by Lauritzen [22]. Of the three classifiers discussed, logistic regression seems to perform the best overall in practice. However, since it is not a Bayesi... |

7089 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...n a number of different formulations of the noisy-or model, but the central idea is the same. For an early discussion of the simplest and most intuitive canonical model, the binary noisy-or gate, see =-=[20, 30]-=-. We know that the size of the CPT of a node grows exponentially with the size of its parents. Thus, a node with many parents can be very costly to model. This is especially true in the noisy-or class... |

908 | Learning Bayesian networks: the combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...xpert knowledge is not always readily available and handcrafting is infeasible for large networks with thousands of nodes. As a result, there has been a lot research toward learning Bayesian networks =-=[4, 14, 21, 15, 17]-=-. These learning algorithms are unsupervised, in the sense that the learner does not differentiate between the attributes and the class variables. The objective is to find the network structure that b... |

765 | A comparison of event models for naive bayes text classification
- McCallum, Nigam
(Show Context)
Citation Context ... network will more closely depict the actual relationship between the attributes. 1 Introduction Classification has applications in many different fields including biology [11], information retrieval =-=[23, 24, 32]-=-, national security [1], spam filtering [3], etc. It is a basic task in performing data analysis or pattern recognition. The main goal of classification is to construct a function that will correctly ... |

597 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...y done by employing a heuristic search over the possible network spaces and finding the highest scoring network. However, many scoring functions do not accurately measure the goodness of the networks =-=[12]-=-. Moreover, it is proven recently that in general, identifying high-scoring structures is NP-hard [6]. Thus, we will concentrate our attentions to two particular types of Bayesian networks that have k... |

371 | On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes - Ng, Jordan |

348 | Naive Bayes at forty: The independence assumption in information retrieval
- Lewis
(Show Context)
Citation Context ... network will more closely depict the actual relationship between the attributes. 1 Introduction Classification has applications in many different fields including biology [11], information retrieval =-=[23, 24, 32]-=-, national security [1], spam filtering [3], etc. It is a basic task in performing data analysis or pattern recognition. The main goal of classification is to construct a function that will correctly ... |

299 | A tutorial on learning Bayesian networks
- Heckerman
- 1995
(Show Context)
Citation Context ...xpert knowledge is not always readily available and handcrafting is infeasible for large networks with thousands of nodes. As a result, there has been a lot research toward learning Bayesian networks =-=[4, 14, 21, 15, 17]-=-. These learning algorithms are unsupervised, in the sense that the learner does not differentiate between the attributes and the class variables. The objective is to find the network structure that b... |

234 | Learning Bayesian Networks with Local Structure
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...mputationally expensive. Finally, as Friedman and Goldszmidt noted, “learning many parameters is a liability, since a large number of parameters requires a large training set to be assessed reliably” =-=[13]-=-. Noisy-or solves this problem by assuming independences between the different parents or causes. In essence, the noisy-or model approximates the CPT for the common effect by using an OR function. Thi... |

216 |
The EM algorithm for graphical association models with missing data
- Lauritzen
- 1995
(Show Context)
Citation Context ...the previous step. The process continues until a certain convergence criteria is met. The algorithm was first introduced by Dempster et al [7]. It was later extended for graphical models by Lauritzen =-=[22]-=-. Of the three classifiers discussed, logistic regression seems to perform the best overall in practice. However, since it is not a Bayesian network classifier, we will not dwell on its learning metho... |

181 | Learning Bayesian Network Structures from Massive Datasets: The Sparse Candidate Algorithm
- Friedman, Nachman, et al.
- 1999
(Show Context)
Citation Context ...xpert knowledge is not always readily available and handcrafting is infeasible for large networks with thousands of nodes. As a result, there has been a lot research toward learning Bayesian networks =-=[4, 14, 21, 15, 17]-=-. These learning algorithms are unsupervised, in the sense that the learner does not differentiate between the attributes and the class variables. The objective is to find the network structure that b... |

171 | A guide to the literature on learning probabilistic networks from data
- Buntine
- 1996
(Show Context)
Citation Context |

132 | Large-Sample Learning of Bayesian Networks is NP-Hard
- Chickering, Heckerman, et al.
- 2004
(Show Context)
Citation Context ...g network. However, many scoring functions do not accurately measure the goodness of the networks [12]. Moreover, it is proven recently that in general, identifying high-scoring structures is NP-hard =-=[6]-=-. Thus, we will concentrate our attentions to two particular types of Bayesian networks that have known structures: the naive Bayes and the noisy-or models, which are shown in Figures 2 and 3 respecti... |

121 | An evaluation of naive bayesian anti-spam filtering, in
- Androutsopoulos, Koutsias, et al.
(Show Context)
Citation Context ...nship between the attributes. 1 Introduction Classification has applications in many different fields including biology [11], information retrieval [23, 24, 32], national security [1], spam filtering =-=[3]-=-, etc. It is a basic task in performing data analysis or pattern recognition. The main goal of classification is to construct a function that will correctly assign instances of events or objects to th... |

116 | Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base
- Shwe, Middleton, et al.
- 1991
(Show Context)
Citation Context ...work. For an example of a noisy-or network, see Figure 3. Introduced by Kim and Pearl in 1983 [20], the noisy-or gate has become widely used in many different fields, most notably in medical contexts =-=[35, 25, 28]-=-. However, unlike naive Bayes networks, there has been relatively little work on applying the noisy-or model to the classification problem. While naive Bayes is usually a complete network in and of it... |

108 |
Some practical issues in constructing belief networks
- Henrion
- 1987
(Show Context)
Citation Context ...With noise, each cause only produces the effect with a certain probability. Henrion later extended the noisy-or model for situations where the effect is present even when all of its causes are absent =-=[18]-=-. The extended model, the leaky noisy-or gate, is applicable to situations where a model does not capture all possible causes. Arguably, almost all situations encountered in practice belong to this cl... |

108 | Tackling the poor assumptions of naive Bayes text classifiers
- Rennie, Shih, et al.
- 2003
(Show Context)
Citation Context ... as separating housekeeping genes from tissue specific genes [11], filtering spam emails [3], analyzing Arabic text related to fanaticism [1], and probably the most prominently in text classification =-=[23, 24, 32, 33]-=-. Naive Bayes is so successful that Rennie called it the “de-facto standard text classifier” [32]. Despite its shortcomings stemming from its strong independence assumptions, naive Bayes is competitiv... |

105 |
Pattern Recognition and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...r The naive Bayes classifier is a simple, efficient, and compact classifier that is still competitive with state-of-the-art classifiers today. An early description of it can be found in Duda and Hart =-=[10]-=-. The first thing to notice about naive Bayes network is its simplicity. Unlike Bayesian networks in general, all of its nodes have at most one parent. This is crucial because the size of the CPTs gro... |

96 | An empirical study of the naive Bayes classifier
- Rish
(Show Context)
Citation Context ...main reason why simple models such as naive Bayes and noisy-or networks can be competitive as classifiers even though they do not approximate the probability distributions of the data well in general =-=[9, 34]-=-. With sufficient background information and a clear statement of the problem, we will dive in full details of the classifiers (naive Bayes, noisy-or, and logistic regression) in section 2. We will fo... |

81 |
A computational model for causal and diagnostic reasoning in inference systems
- JH, Pearl
- 1983
(Show Context)
Citation Context ...oisy-or classifiers are another type of Bayesian network classifier that reduces the complexity of the network. For an example of a noisy-or network, see Figure 3. Introduced by Kim and Pearl in 1983 =-=[20]-=-, the noisy-or gate has become widely used in many different fields, most notably in medical contexts [35, 25, 28]. However, unlike naive Bayes networks, there has been relatively little work on apply... |

72 |
A generalization of the Noisy-Or model
- Srinivas
- 1993
(Show Context)
Citation Context ...tances the leak is always present by definition. Some other extensions of the noisy-or model include the introduction of multi-valued variables [18, 8] as well as nodes that include multiple outcomes =-=[36, 8]-=-. In these models, the variables are no longer binary. Rather, their values represent the degree of intensity. The degree of the outcome is a maximum of the degrees produced by the causes. Thus, this ... |

69 | Parameter adjustment in Bayes networks: The generalized noisy orgate
- Diez
- 1993
(Show Context)
Citation Context ...roduces an additional parameter called the leak probability, which is the combined effect of all causes not in the network. Diez later proposed an alternative way to formulate the leaky noisy-or gate =-=[8]-=-. The two proposals differ in how they define their parameters. In Henrion’s model, the probabilities for each cause to produce the effect is a combined influence of the cause in question and the leak... |

65 | Causal independence for probability assessment and inference using Bayesian networks
- Heckerman, Brees
- 1994
(Show Context)
Citation Context ...come is a maximum of the degrees produced by the causes. Thus, this model of interaction could also be called a noisy-max. Heckerman went further and introduced independence of causal influence (ICI) =-=[16, 38]-=-. In an ICI model, the function need not to be an OR any more. Examples include noisy-and, noisy-max, noisy-min, noisy-add, etc. Causal independence is a collection of conditional independence asserti... |

57 | Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches
- Keogh, Pazzani
- 1999
(Show Context)
Citation Context ... value of a third node is known. Thus, the goal is to remove the independence assumptions between nodes that are the most correlated. However, learning tree structured Bayesian network is not trivial =-=[19, 40]-=-. As a result, in 2004, Peng et al proposed a model in between called the Chain augmented Naive Bayes (CAN) [31]. Instead of finding a tree structure, the problem reduces down to the simpler one of fi... |

47 | Augmenting Naive Bayes classifiers with statistical language models
- Peng, Schuurmans, et al.
- 2004
(Show Context)
Citation Context ... most correlated. However, learning tree structured Bayesian network is not trivial [19, 40]. As a result, in 2004, Peng et al proposed a model in between called the Chain augmented Naive Bayes (CAN) =-=[31]-=-. Instead of finding a tree structure, the problem reduces down to the simpler one of finding a chain structure to augment the network. While the most intuitive approach to improving the naive Bayes m... |

42 |
Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning
- Pearl
- 1985
(Show Context)
Citation Context ...s the name implies, these classifiers all use Bayesian networks to represent the relationships between the attributes and the class label. The term Bayesian networks was first coined by Pearl in 1985 =-=[29]-=-. It refers to a type of model that uses directed acyclic graphs (DAGs) and conditional probability tables (CPTs) to represent causal or temporal relationships between the nodes of the graphs. The gra... |

34 | Learning Bayesian Network Parameters from Small Data Sets: Application of Noisy-OR
- Onisko, Druzdzel, et al.
- 2001
(Show Context)
Citation Context ...work. For an example of a noisy-or network, see Figure 3. Introduced by Kim and Pearl in 1983 [20], the noisy-or gate has become widely used in many different fields, most notably in medical contexts =-=[35, 25, 28]-=-. However, unlike naive Bayes networks, there has been relatively little work on applying the noisy-or model to the classification problem. While naive Bayes is usually a complete network in and of it... |

26 |
Logistic discrimination
- Anderson
- 1982
(Show Context)
Citation Context ...estigation into noisy-or classifiers could provide insight into how to improve the other types of classifiers. 2.3 Logistic Regression Logistic regression, or sometimes called logistic discrimination =-=[2]-=-, has long been a tool for the statistics community. It is a statistical regression model for binary dependent variables. It attempts to learn functions of the form f : X → Y . In the context of our c... |

18 |
Generative and discriminative classifiers: naive Bayes and logistic regression. http://www.cs.cmu.edu/ tom/mlbook/NBayesLogReg.pdf
- Mitchell
(Show Context)
Citation Context ... equivalent classifiers will assign exactly the same class labels to the same instances. Additionally, both of these classifiers have the same expressive power as the logistical regression classifier =-=[26, 38]-=-. Despite their equivalent expressiveness, the three classifiers differ in how well they can learn from data. Logistical regression usually hold a slight advantage because it is well studied in statis... |

12 | Mining housekeeping genes with a Naive Bayes classifier
- Ferrari, Aitken
(Show Context)
Citation Context ...d noisy-or model so that the network will more closely depict the actual relationship between the attributes. 1 Introduction Classification has applications in many different fields including biology =-=[11]-=-, information retrieval [23, 24, 32], national security [1], spam filtering [3], etc. It is a basic task in performing data analysis or pattern recognition. The main goal of classification is to const... |

12 |
Noisy-or classifier
- Vomlel
- 2006
(Show Context)
Citation Context ... equivalent classifiers will assign exactly the same class labels to the same instances. Additionally, both of these classifiers have the same expressive power as the logistical regression classifier =-=[26, 38]-=-. Despite their equivalent expressiveness, the three classifiers differ in how well they can learn from data. Logistical regression usually hold a slight advantage because it is well studied in statis... |

5 |
and Adnan Darwiche. Reasoning about bayesian network classifiers
- Chan
- 2003
(Show Context)
Citation Context ...tic regression 3.2.1 From naive Bayes to logistic regression Given a naive Bayes classifier, the instance is classified to C = 1 when P r(C = 0 | a) < t. We can rewrite the equation in log-odds space =-=[5]-=-. The condition for C = 1 would now be: t log O(C = 0 | a) < log( ) (12) 1 − t Following Chan et al’s derivation [5], we have log O(C = 0 | a) = log O(C = 0) + ∑ log P r(aj | C = 0) P r(aj | C = 1) An... |

4 | Learning naive Bayes classifier from noisy data - Yang, Xia, et al. - 2003 |

4 | Learnability of augmented naive Bayes in nominal domains
- Zhang, Ling
- 2001
(Show Context)
Citation Context ... value of a third node is known. Thus, the goal is to remove the independence assumptions between nodes that are the most correlated. However, learning tree structured Bayesian network is not trivial =-=[19, 40]-=-. As a result, in 2004, Peng et al proposed a model in between called the Chain augmented Naive Bayes (CAN) [31]. Instead of finding a tree structure, the problem reduces down to the simpler one of fi... |

3 | Multiple Explanations Driven Naïve Bayes Classifier
- Almonayyes
(Show Context)
Citation Context ...t the actual relationship between the attributes. 1 Introduction Classification has applications in many different fields including biology [11], information retrieval [23, 24, 32], national security =-=[1]-=-, spam filtering [3], etc. It is a basic task in performing data analysis or pattern recognition. The main goal of classification is to construct a function that will correctly assign instances of eve... |