## Classification using Hierarchical Naïve Bayes models (2002)

### Cached

### Download Links

- [www.math.ntnu.no]
- [www.idi.ntnu.no]
- [www.cs.auc.dk]
- [www.idi.ntnu.no]
- [www.idi.ntnu.no]
- [springerlink.metapress.com]
- [www.idi.ntnu.no]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning 2006 |

Citations: | 11 - 1 self |

### BibTeX

@INPROCEEDINGS{Langseth02classificationusing,

author = {Helge Langseth and Thomas D. Nielsen},

title = {Classification using Hierarchical Naïve Bayes models},

booktitle = {Machine Learning 2006},

year = {2002},

pages = {63--135}

}

### OpenURL

### Abstract

Classification problems have a long history in the machine learning literature. One of the simplest, and yet most consistently well performing set of classifiers is the Nave Bayes models. However, an inherent problem with these classifiers is the assumption that all attributes used to describe an instance are conditionally independent given the class of that instance. When this assumption is violated (which is often the case in practice) it can reduce classification accuracy due to "information double-counting" and interaction omission.

### Citations

8073 | Maximum Likelihood from Incomplete Data via the EM algorithm - A, Rubin - 1977 |

7042 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...L2 A5sand Section 5 is devoted to empirical results. We discuss some aspects of the algorithm in further detail in Section 6 and conclude in Section 7. 2 Bayesian classifiers A Bayesian network (BN) (=-=Pearl 1988-=-; Jensen 2001) is a powerful tool for knowledge representation, as it provides a compact representation of a joint probability distribution over a set of variables. Formally, a BN over a set of discre... |

3915 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...decades, can therefore be seen as a model selection process where the task is to find the single model, from some set of models, with the highest classification accuracy. The Naïve Bayes (NB) models =-=(Duda and Hart 1973-=-) is a set of particularly simple models which 1shas shown to offer very good classification accuracy. NB models assume that all attributes are conditionally independent given the class, but this assu... |

2862 |
UCI repository of machine learning databases
- BLAKE, C
- 1998
(Show Context)
Citation Context ...will investigate the merits of the proposed learning algorithm by using it to learn classifiers for a number of different domains. All data-sets are taken from the Irvine Machine Learning Repository (=-=Blake and Merz 1998)-=-, see Table 1 for a summary of the 22 datasets used in this empirical study. We have compared the results of the HNB classifier to those of the Naïve Bayes model (Duda and Hart 1973), the TAN model (... |

2300 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ... The idea to use HNBs in classification was first explored by Zhang et al. (2002). Zhang et al. (2002) search for the model maximizing the BIC score, which is a form of penalized log likelihood, see (=-=Schwarz 1978-=-); hence they look for a scientific model (Cowell et al. 1999) where the key is to find an interesting latent structure. In this paper we take the technological modelling approach: Our goal is mainly ... |

1282 | Local computations with probabilities on graphical structures and their application to expert systems - Lauritzen, Spiegelhalter - 1988 |

1237 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ...(C|X, Y ) and the probability distribution P ′ (C|X, Y ), where the latter is encoded by the model where X ⊥Y |C. This distance can be described using the well-known Kullback-Leibler (KL) divergen=-=ce (Kullback and Leibler 1951) averaged o-=-ver the possible states of X and Y : E(KL(P ; P ′ )|X, Y )= � P (x, y) � � P (c|x, y) P (c|x, y)log P ′ � . (c|x, y) x,y In the context of classification, this distance measure can also be... |

1158 |
Modeling by Shortest Data Description
- Rissanen
- 1978
(Show Context)
Citation Context ...a is used to give each possible classifier a score which signals its appropriateness as a classification model. One such scoring function is based on the 3sminimum description length (MDL) principle (=-=Rissanen 1978; Lam and Bacchus 1994): MDL(B |DN) = lo-=-g N 2 � � � � ΘBS � � � − N� i=1 � � log PB c (i) , a (i) � � � � �� ΘBS . (2) That is, the best scoring model is the one that minimizes MDL(·|DN), where � ΘBS ... |

1025 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ... this problem, methods for handling the conditional dependence between the attributes have become a lively research area; these methods are typically grouped into three categories: Feature selection (=-=Kohavi and John 1997)-=-, feature grouping (Kononenko 1991; Pazzani 1995), and correlation modelling (Friedman et al. 1997). The approach taken in this paper is based on correlation modelling using Hierarchical Naïve Bayes ... |

752 | A study of cross-validation and bootstrap for accuracy estimation and model selection
- Kohavi
- 1995
(Show Context)
Citation Context ...ach classifier (in percentage of instances which are correctly classified), and give a standard deviation of this estimate. The standard deviations are the theoretical values calculated according to (=-=Kohavi 1995-=-), and are not necessarily the same as the empirical standard deviations observed during cross validation. For comparison of the algorithms 5 We used Clementine (SPSS Inc. 2002) to generate the C5.0 a... |

652 |
Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...f neural networks with one hidden layer trained by backpropagation. 5 As some of the learning algorithms require discrete variables, the attributes were discretized using the entropy-based method of (=-=Fayyad and Irani 1993-=-). In addition, instances containing missing attribute-values were removed; all pre-processing was performed using MLC++ (Kohavi et al. 1994). The accuracy-results are given in Table 2. For each datas... |

637 | Approximating discrete probability distributions with dependence trees - Chow, Liu - 1968 |

624 |
Probabilistic networks and expert systems
- Cowell, Dawid, et al.
- 1999
(Show Context)
Citation Context ...red by Zhang et al. (2002). Zhang et al. (2002) search for the model maximizing the BIC score, which is a form of penalized log likelihood, see (Schwarz 1978); hence they look for a scientific model (=-=Cowell et al. 1999-=-) where the key is to find an interesting latent structure. In this paper we take the technological modelling approach: Our goal is mainly to build an accurate classifier. As a spin-off we also provid... |

601 | On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...isingly good classification results. Resent research 4sinto explaining the merits of the NB model has emphasized the difference between the 0/1loss function and the log-loss, see e.g. (Friedman 1997; =-=Domingos and Pazzani 1997-=-). Friedman (1997, p. 76) concludes: Good probability estimates are not necessary for good classification; similarly, low classification error does not imply that the corresponding class probabilities... |

587 | Bayesian network classifiers
- Friedman, Geiger, et al.
(Show Context)
Citation Context ... lively research area; these methods are typically grouped into three categories: Feature selection (Kohavi and John 1997), feature grouping (Kononenko 1991; Pazzani 1995), and correlation modelling (=-=Friedman et al. 1997)-=-. The approach taken in this paper is based on correlation modelling using Hierarchical Naïve Bayes (HNB) models, see (Zhang et al. 2002). HNBs are tree-shaped Bayesian networks, with latent variable... |

571 |
Bayesian Networks and Decision Graphs
- Jensen
- 2002
(Show Context)
Citation Context ...ction 5 is devoted to empirical results. We discuss some aspects of the algorithm in further detail in Section 6 and conclude in Section 7. 2 Bayesian classifiers A Bayesian network (BN) (Pearl 1988; =-=Jensen 2001-=-) is a powerful tool for knowledge representation, as it provides a compact representation of a joint probability distribution over a set of variables. Formally, a BN over a set of discrete random var... |

496 | Causation, Prediction, and Search - Spirtes, Glymour, et al. - 1993 |

442 | The discipline of machine learning - Mitchell - 2006 |

441 |
Graphical models in applied multivariate statistics
- Whittaker
- 1990
(Show Context)
Citation Context ...hildren having large domains. Instead we utilize that: 2N · I(X, Y | C) L → χ 2� ��� � ��� � ��sp �� (C) ��sp ��−1 (X) ��sp �, ��−1 (Y ) where L →=-= means convergence in distribution as N →∞, see e.g. (Whittaker 1990). Final-=-ly, we calculate Q(X, Y |DN) =P (Z ≤ 2N · I(X, Y | C)) , (4) where Z is χ 2 distributed with |sp (C)| (|sp (X)|−1) (|sp (Y )|−1) degrees of freedom. The pairs {X, Y } are ordered according to ... |

367 | On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes - Ng, Jordan - 2002 |

288 | Context-specific independence in Bayesian networks
- Boutilier, Friedman, et al.
- 1996
(Show Context)
Citation Context ...ning induced angina has no effect in the model if chest pain is not of this particular type. Note that the classifier in this example uses the latent variable to encode context specific independence (=-=Boutilier et al. 1996-=-). 18s6 Discussion 6.1 Parameter learning The parameters in the model are estimated by their maximum likelihood values. This may not be optimal for classification, and recent research has shown some i... |

188 | Learning Bayesian Belief Networks : An Approach Based on
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...ve each possible classifier a score which signals its appropriateness as a classification model. One such scoring function is based on the 3sminimum description length (MDL) principle (Rissanen 1978; =-=Lam and Bacchus 1994): MDL(B |DN) = log N 2 � � � � ΘBS-=- � � � − N� i=1 � � log PB c (i) , a (i) � � � � �� ΘBS . (2) That is, the best scoring model is the one that minimizes MDL(·|DN), where � ΘBS � is the � maximum lik... |

158 | Adaptive probabilistic networks with hidden variables - Binder, Koller, et al. - 1997 |

142 | Inference for the generalization error - Nadeau, Bengio - 2003 |

140 | Pattern Classi and Scene Analysis - Duda, Hart - 1973 |

129 | Selection of relevant features in machine learning - Langley, Sage - 1994 |

109 |
Semi-naive Bayesian classifier
- Kononenko
- 1991
(Show Context)
Citation Context ...onditional dependence between the attributes have become a lively research area; these methods are typically grouped into three categories: Feature selection (Kohavi and John 1997), feature grouping (=-=Kononenko 1991;-=- Pazzani 1995), and correlation modelling (Friedman et al. 1997). The approach taken in this paper is based on correlation modelling using Hierarchical Naïve Bayes (HNB) models, see (Zhang et al. 200... |

106 | Probability propagation - G, Shenoy - 1990 |

98 | eld, \Machine learning library in C
- Kohavi, Sommer
- 1996
(Show Context)
Citation Context ...utes were discretized using the entropy-based method of (Fayyad and Irani 1993). In addition, instances containing missing attribute-values were removed; all pre-processing was performed using MLC++ (=-=Kohavi et al. 1994-=-). The accuracy-results are given in Table 2. For each dataset we have estimated the accuracy of each classifier (in percentage of instances which are correctly classified), and give a standard deviat... |

70 | A new look at causal independence
- HECKERMAN, BREESE
- 1994
(Show Context)
Citation Context ..., L = s) appears in case Di, and 0 otherwise; N(s) = � c∈sp (C) N(c, s). Note that Equation 5 is in fact an equality if the relationship between C and ch (C) satisfy independence of causal influen=-=ce (Heckerman and Breese 1994). S-=-tates are collapsed in a greedy manner, i.e., we find the pair of states with highest ∆L(li,lj |DN) and collapse those two states if ∆L(li,lj |DN) > 0. This is repeated (making use of local decomp... |

69 | Searching for dependencies in bayesian classifiers
- Pazzani
- 1996
(Show Context)
Citation Context ...dence between the attributes have become a lively research area; these methods are typically grouped into three categories: Feature selection (Kohavi and John 1997), feature grouping (Kononenko 1991; =-=Pazzani 1995)-=-, and correlation modelling (Friedman et al. 1997). The approach taken in this paper is based on correlation modelling using Hierarchical Naïve Bayes (HNB) models, see (Zhang et al. 2002). HNBs are t... |

67 | Hidden naive bayes - Zhang, Jiang, et al. - 2005 |

62 | Learning Bayesian network classifiers by maximizing conditional likelihood - Grossman, Domingos - 2004 |

52 |
On bias, variance, 0/1 loss and the curse of dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ...at is, model search based on MDLp is guaranteed to select the best classifier w.r.t. both log-loss and 0/1-loss when N →∞. Unfortunately, though, the score may not be successful for finite data se=-=ts (Friedman 1997-=-). To overcome this potential drawback, Kohavi and John (1997) describe the wrapper approach. Informally, this method amounts to estimating the accuracy of a given classifier by cross validation(based... |

49 | Induction of Recursive Bayesian Classifiers - Langley - 1993 |

47 | Learning Bayesian nets that perform well
- Greiner, Grove, et al.
- 1997
(Show Context)
Citation Context ... � maximum likelihood estimate of the parameters in the model, and � � � � ΘBS� is the dimension of the parameter space (i.e., the number of free parameters in the model). However, as poi=-=nted out in (Greiner et al. 1997; Friedman et al. 1997) a “globa-=-l” criteria like MDL may not be well suited for learning a classifier, as: N� log � � (i) (i) PB c , a �� = i=1 N� log � � (i) PB c � � (i) a �� + i=1 N� i=1 � � log PB... |

47 | Hierarchical latent class models for cluster analysis
- Zhang
- 2004
(Show Context)
Citation Context ... +log(n))). 3 Hierarchical Naïve Bayes models A special class of Bayesian networks is the so-called Hierarchical Naïve Bayes (HNB) models, a concept first introduced by Zhang et al. (2002), see also=-= (Zhang 2002;-=- Kočka and Zhang 2002). An HNB is a tree-shaped Bayesian network, where the variables are partitioned into three disjoint sets: {C} is the class variable, A is the set of attributes, and L is a set o... |

38 | On information and su�ciency - Kullback, Leibler - 1951 |

37 | Constructive induction of Cartesian product attributes
- Pazzani
- 1996
(Show Context)
Citation Context ...dence between the attributes have become a lively research area; these methods are typically grouped into three categories: Feature selection (Kohavi and John 1997), feature grouping (Kononenko 1991; =-=Pazzani 1996-=-a), and correlation modelling (Friedman et al. 1997). 1sThe approach taken in this paper is based on correlation modelling using Hierarchical Naïve Bayes (HNB) models (Zhang 2004; Zhang et al. 2003), ... |

36 | Lazy propagation in junction trees - Madsen, Jensen - 1998 |

25 | On the the optimality of the simple Bayesian classi under zero-one loss - Domingos, Pazzani - 1997 |

24 | Learning the dimensionality of hidden variables
- Elidan, Friedman
- 2001
(Show Context)
Citation Context ... PH (c D |a D ) , where f(D, li,lj) is true if case D includes either {L = li} or {L = lj}; cases which does not include these states cancel out. This is also referred to as local decomposability in (=-=Elidan and Friedman 2001-=-), i.e., the gain of collapsing two states li and lj is local to those states and it does not depend on whether or not other states have been collapsed. In order to avoid considering all possible comb... |

24 | Bayesian network classi - Friedman, Geiger - 1997 |

21 | Discrete factor analysis: Learning hidden variables in Bayesian network - Martin, Vanlehn - 1995 |

15 | Dimension correction for hierarchical latent class models, PProceeding of the Eighteenth Conference on Uncertainty
- Kocka, Zhang
- 2002
(Show Context)
Citation Context ...3 Hierarchical Naïve Bayes models A special class of Bayesian networks is the so-called Hierarchical Naïve Bayes (HNB) models, a concept first introduced by Zhang et al. (2002), see also (Zhang 2002=-=; Kočka and Zhang 2002-=-). An HNB is a tree-shaped Bayesian network, where the variables are partitioned into three disjoint sets: {C} is the class variable, A is the set of attributes, and L is a set of latent (or hidden) v... |

13 | C5.0: An Informal Tutorial - Quinlan |

12 | Contextspeci c independence in bayesian networks - Boutilier, Friedman, et al. - 1996 |

7 | Searching for dependencies in Bayesian classifiers, in: Learning from Data: Artificial Intelligence and Statistics V - Pazzani - 1997 |

6 | Probabilistic classifiers and the concepts they recognize - Jaeger |

6 | When discriminative learning of Bayesian network parameters is easy
- Wettig, Grünwald, et al.
- 2003
(Show Context)
Citation Context ...umber of existing tools may be able to improve the classification accuracy even further. These include feature selection (Kohavi and John 1997), and supervised learning of the probability parameters (=-=Wettig et al. 2003-=-). Finally, the proposed learning algorithm also provides an explicit semantics for the latent structure of a model. This allows a decision maker to easily deduce the rules which govern the classifica... |