## Supervised classification with conditional gaussian networks: Increasing the structure complexity from naive bayes

Venue: | International Journal of Approximate Reasoning |

Citations: | 5 - 0 self |

### BibTeX

@ARTICLE{Pérez_supervisedclassification,

author = {Aritz Pérez and Pedro Larrañaga and Iñaki Inza},

title = {Supervised classification with conditional gaussian networks: Increasing the structure complexity from naive bayes},

journal = {International Journal of Approximate Reasoning},

year = {},

volume = {43},

pages = {2006}

}

### OpenURL

### Abstract

Most of the Bayesian network-based classifiers are usually only able to handle discrete variables. However, most real-world domains involve continuous variables. A common practice to deal with continuous variables is to discretize them, with a subsequent loss of information. This work shows how discrete classifier induction algorithms can be adapted to the conditional Gaussian network paradigm to deal with continuous variables without discretizing them. In addition, three novel classifier induction algorithms and two new propositions about mutual information are introduced. The classifier induction algorithms presented are ordered and grouped according to their structural complexity: naive Bayes, tree augmented naive Bayes, k-dependence Bayesian classifiers and semi naive Bayes. All the classifier induction algorithms are empirically evaluated using predictive accuracy, and they are compared to linear discriminant analysis, as a continuous classic statistical benchmark classifier. Besides, the accuracies for a set of state-of-the-art classifiers are included in order to justify the use of linear discriminant analysis as the benchmark algorithm. In order to understand the behavior of the conditional Gaussian network-based classifiers better, the results include bias-variance decomposition of the expected misclassification rate. The study suggests that semi naive Bayes structure based classifiers and, especially, the novel wrapper condensed semi naive Bayes backward, outperform the behavior of the rest of the presented classifiers. They also obtain quite competitive results compared to the state-of-the-art algorithms included. Key words: conditional Gaussian network, Bayesian network, naive Bayes, tree augmented naive Bayes, k-dependence Bayesian classifiers, semi naive Bayes, filter, wrapper.

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ed on intrinsic characteristics of the data [45]. The advantages of filter approaches are related to the time complexity needed to make the selection. For example, a score based on information theory =-=[6]-=- used to select variables in a filter manner (entropy and mutual information measures), is correlation based feature selection (CFS) [26,64]. More examples based on information theory are the approach... |

7342 |
J.H.: Genetic Algorithms and
- Goldberg, Holland
- 1988
(Show Context)
Citation Context ...ure work line, related to the wrapper approach, consists in adapting more classifiers supported by BMN to directly operate with continuous variables. Randomized heuristics (such as genetic algorithms =-=[23]-=- or estimation distribution algorithms [41]) could be used as the search engine in the space of classifier structures. 5 Acknowledgments This work was supported in part by a PhD purpose grant from the... |

7052 | Probabilistic Reasoning in Intelligent Systems - Pearl - 1988 |

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ... that LDA obtains competitive results compared with the following set of well known state-of-the-art-algorithms: kNN [7] with different k, discrete versions of NB [11] and TAN [17], ID3 [53] and C4.5 =-=[54]-=-, and Multilayer Perceptron (MP) [56] (all of them implemented in Weka 3.4.3 statistical package [62]). The estimated predictive accuracies summarized in Table 2 have been obtained, for each classifie... |

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...inuous variable is present, it must be discretized, with a subsequent loss of information [63]. A battery of BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes =-=[11,39,46]-=-, tree augmented Bayesian network [17], k-dependence Bayesian classifier [57] and semi naive Bayes [37,49]. In the presence of continuous variables, another alternative is to assume that continuous va... |

3354 | Induction of Decision Trees
- Quinlan
- 1986
(Show Context)
Citation Context ... Table 2 shows that LDA obtains competitive results compared with the following set of well known state-of-the-art-algorithms: kNN [7] with different k, discrete versions of NB [11] and TAN [17], ID3 =-=[53]-=- and C4.5 [54], and Multilayer Perceptron (MP) [56] (all of them implemented in Weka 3.4.3 statistical package [62]). The estimated predictive accuracies summarized in Table 2 have been obtained, for ... |

2966 |
Data mining : practical machine learning tools and techniques with Java implementations
- Witten, Frank
- 2000
(Show Context)
Citation Context ...algorithms: kNN [7] with different k, discrete versions of NB [11] and TAN [17], ID3 [53] and C4.5 [54], and Multilayer Perceptron (MP) [56] (all of them implemented in Weka 3.4.3 statistical package =-=[62]-=-). The estimated predictive accuracies summarized in Table 2 have been obtained, for each classifier at each data set, by a 10-fold cross-validation process. In order to learn the discrete classifiers... |

2014 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ...ial variables which usually are mutually independent and capture much of the information of the original space. Standard transformation of the space of variables includes principal component analysis =-=[32]-=-. The variable selection techniques (see [25]) can be divided into two groups depending on the nature of the search score used by the selection process: filter [45] and wrapper approaches [35]. The sc... |

1139 |
Al Introduction to Multivariate Statistical Analysis
- Anderson
- 1985
(Show Context)
Citation Context ...s, which lead to decrease the predictive accuracy. On the other hand, if a joint variable Y k consist of a set of Gaussian variables, we propose that it follows a multidimensional normal distribution =-=[1]-=- conditioned to the class variable. The joint density is given by: 1 − p(yk | c) = (2π) 2 mk 1 1 c − − | Σk | 2 e 2 (yk−µc k )t (Σc k )−1 (yk−µ c k ) where Σ c k is the covariance matrix conditioned t... |

1102 |
Graphical Models
- LAURITZEN
- 1996
(Show Context)
Citation Context ... assigns a class label to instances described by a set of variables. There are numerous classifier paradigms, among which Bayesian networks (BN)[48,50], based on probabilistic graphical models (PGMs) =-=[3,42]-=-, are very effective and well-known in domains with uncertainty. A Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing conditional (in)dependence relatio... |

1031 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...nalysis [32]. The variable selection techniques (see [25]) can be divided into two groups depending on the nature of the search score used by the selection process: filter [45] and wrapper approaches =-=[35]-=-. The scores used in the filter approaches are based on intrinsic characteristics of the data [45]. The advantages of filter approaches are related to the time complexity needed to make the selection.... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...ctors included in the model. The structures of these classifiers range the simplest naive Bayes structure to the complete graphs. A classifier based on BNs can be constructed from a Bayesian approach =-=[2,19,22,27]-=-. It takes into account all possible models and all possible parameters, restricted to a special kind of structure and a family of probability functions. However, the classifiers included in this pape... |

741 |
Aha, “UCI repository of machine learning data bases,” http: //www.ics.uci.edu/~mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...er to study the nature of the error of the CGN-based classifiers, Kohavi and Wolpert’s bias-variance decomposition [36] is performed. The results have been obtained in eleven UCI repository data sets =-=[47]-=-, which only contain continuous predictor variables. In order to interpret the results, we must take into account that most parts of the UCI repository data sets are already preprocessed [34]: in the ... |

693 |
Optimal Statistical Decisions
- DeGroot
- 1970
(Show Context)
Citation Context ...tion, is symmetric. The classification process can be done in the following way with CGN: n� P (c | x) ∝ p(c, x) = P (c)p(x|c) = P (c) p(xi | pai) (1) i=1 where pai denotes a value of P ai. Moreover, =-=[3,8,19]-=- where mi|c and vi|c are defined as follows [19]: p(xi | pa i) ∼ N (mi|c, vi|c) (2) ni � mi|c = µi|c + j=1 vi|c = | ΣXi,PXi|c | | ΣPXi|c | βij|c(xj − µj|c) (3) PXi is the set of continuous predictors ... |

690 | An Introduction to Variable and Feature Selection
- Guyon, Elisseeff
- 2003
(Show Context)
Citation Context ...pendent and capture much of the information of the original space. Standard transformation of the space of variables includes principal component analysis [32]. The variable selection techniques (see =-=[25]-=-) can be divided into two groups depending on the nature of the search score used by the selection process: filter [45] and wrapper approaches [35]. The scores used in the filter approaches are based ... |

653 |
K.B.: Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...ch data set, by a 10-fold cross-validation process. In order to learn the discrete classifiers presented in Table 2 (NB, TAN and ID3), data sets have been discretized with the Fayyad and Irani method =-=[14]-=-. The parameters for the fkDB, wkDB, wSemiF and wSemiB algorithms are the 18sfollowing: (1) fkDB with k = 1. We have checked that fkDB obtains the best scores at k = 1. (2) wkDB with k = n − 1. Bear i... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...tree structure that maximizes the likelihood given the data. Hence, fTAN is considered a pure filter algorithm. Friedman et al.s algorithm [17] follows the general outline of Chow and Liu’s procedure =-=[5]-=-, but instead of using the mutual information between two variables, it uses class conditional mutual information between predictors given the class variable to construct the maximal weighted spanning... |

609 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...position can be useful to explain the behaviors of the different algorithms [59]. The concept of bias-variance decomposition was introduced to machine learning for mean squared error by German et al. =-=[21]-=-. Later versions for zero-one-loss functions were given by Friedman [16], Kohavi and Wolpert [36], Domingos [9] and James [28]. The decompositions have been performed following Kohavi and Wolpert’s pr... |

601 | On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...B structure. The accuracy obtained with this classifier in its discrete version is surprisingly high in some domains, even in data sets that do not obey the strong conditional independence assumption =-=[10]-=-. Thanks to the conditional independence assumption, the factorization of the joint probability is greatly simplified. A NB classifier structure example is shown in Figure 1(a), where each variable is... |

589 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...etized, with a subsequent loss of information [63]. A battery of BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes [11,39,46], tree augmented Bayesian network =-=[17]-=-, k-dependence Bayesian classifier [57] and semi naive Bayes [37,49]. In the presence of continuous variables, another alternative is to assume that continuous variables are sampled from a Gaussian di... |

572 |
Applied Multivariate Statistical Analysis
- Johnson, Wichern
- 1998
(Show Context)
Citation Context .... Therefore, a complete graph can be seen as the exact factorization of P (c, x). Wrapper condensed semi naive Bayes backward (wCSemiB) structure is shown in Figure 3. Quadratic discriminant analysis =-=[31]-=- taking into account the class distribution P (C), and a CSemi structure represent an equivalent discrimination rule, given the set of predictor variables included. The number of parameters necessary ... |

334 | An analysis of Bayesian classifiers
- Langley, Iba, et al.
- 1992
(Show Context)
Citation Context ...inuous variable is present, it must be discretized, with a subsequent loss of information [63]. A battery of BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes =-=[11,39,46]-=-, tree augmented Bayesian network [17], k-dependence Bayesian classifier [57] and semi naive Bayes [37,49]. In the presence of continuous variables, another alternative is to assume that continuous va... |

312 | Estimating continuous distributions in Bayesian classifiers
- John, Langley
- 1995
(Show Context)
Citation Context ...rete variable cannot have continuous parents. Although the Gaussian assumption for continuous variables is very strong, it usually provides a reasonable approximation to many real-world distributions =-=[30]-=-. The classifiers, inducted by the algorithms presented in this paper, are restricted to CGN models with continuous predictor variables and discrete class variable, which is the parent of all predicto... |

286 |
Principles of Neurodynamics
- Rosenblatt
- 1962
(Show Context)
Citation Context ... compared with the following set of well known state-of-the-art-algorithms: kNN [7] with different k, discrete versions of NB [11] and TAN [17], ID3 [53] and C4.5 [54], and Multilayer Perceptron (MP) =-=[56]-=- (all of them implemented in Weka 3.4.3 statistical package [62]). The estimated predictive accuracies summarized in Table 2 have been obtained, for each classifier at each data set, by a 10-fold cros... |

259 |
Feature Selection for knowledge Discovery and Data Mining
- Liu, Motoda
- 1999
(Show Context)
Citation Context ...cludes principal component analysis [32]. The variable selection techniques (see [25]) can be divided into two groups depending on the nature of the search score used by the selection process: filter =-=[45]-=- and wrapper approaches [35]. The scores used in the filter approaches are based on intrinsic characteristics of the data [45]. The advantages of filter approaches are related to the time complexity n... |

258 |
J.A.: Estimation of distribution algorithms: a new tool for evolutionary computation
- Larranaga, Lozano
- 2002
(Show Context)
Citation Context ...ach, consists in adapting more classifiers supported by BMN to directly operate with continuous variables. Randomized heuristics (such as genetic algorithms [23] or estimation distribution algorithms =-=[41]-=-) could be used as the search engine in the space of classifier structures. 5 Acknowledgments This work was supported in part by a PhD purpose grant from the Basque Government for the first author, by... |

245 |
E.; Learning Bayesian networks
- Neapolitan
- 2004
(Show Context)
Citation Context ...e construction of a classifier, that is, a function that assigns a class label to instances described by a set of variables. There are numerous classifier paradigms, among which Bayesian networks (BN)=-=[48,50]-=-, based on probabilistic graphical models (PGMs) [3,42], are very effective and well-known in domains with uncertainty. A Bayesian network is a directed acyclic graph of nodes representing variables a... |

211 | Induction of selective Bayesian classifiers
- Langley, Sage
- 1994
(Show Context)
Citation Context ...included in this paper are induced from a non-Bayesian point of view, which fixes a unique structure and its parameters. The structure is learned guided by a score function (likelihood [24], accuracy =-=[33,40,49]-=- or mutual information [17,57]). There are a lot of works in which the non-Bayesian approach for discrete variables is performed with different structure complexities [11,17,33,37,39,40,46,49,57]. The... |

193 | On bias, variance, 0/1-loss, and the curse-of-dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ...data sets less than BMN-based classifiers. Therefore, in general, they should have a lower variance and higher bias components in their associated decomposition of the expected misclassification rate =-=[16,36]-=-. Besides, a lower number of parameters allows a more reliable and robust computation of the necessary statistics Moreover, the parameters can be computed a priori, without taking into account the str... |

182 | Steps Towards an Artificial Intelligence
- Minsky
- 1961
(Show Context)
Citation Context ...inuous variable is present, it must be discretized, with a subsequent loss of information [63]. A battery of BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes =-=[11,39,46]-=-, tree augmented Bayesian network [17], k-dependence Bayesian classifier [57] and semi naive Bayes [37,49]. In the presence of continuous variables, another alternative is to assume that continuous va... |

174 |
Expert Systems and Probabilistic Network Models
- Castillo, Gutierrez, et al.
- 1997
(Show Context)
Citation Context ... assigns a class label to instances described by a set of variables. There are numerous classifier paradigms, among which Bayesian networks (BN)[48,50], based on probabilistic graphical models (PGMs) =-=[3,42]-=-, are very effective and well-known in domains with uncertainty. A Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing conditional (in)dependence relatio... |

173 | Bias plus variance decomposition for zero-one loss functions
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...ssary to design filter approaches, are introduced. The classifier induction algorithms presented are experimentally compared by means of estimated predictive accuracy. The bias-variance decomposition =-=[36]-=- of the expected misclassification cost is performed in order to analyze the behavior of the CGN-based classifiers presented in more detail. The paper is organized as follows. In Section 2, four kinds... |

164 | Graphical models for associations between variables, some of which are qualitative and some quantitative’, Annals of Statistics - Lauritzen, Wermuth - 1989 |

131 | Comparison of algorithms that select features for pattern classifiers
- Kudo, Sklansky
- 2000
(Show Context)
Citation Context ...arallelly to the structural learning process (especially in the wrapper approaches). The search process depends on the score and search strategy used. For a review of different search strategies, see =-=[38]-=-. Although some of the methods proposed in this work perform an implicit selection of variables, it is not our purpose to treat this process of selection explicitly. Structural learning usually involv... |

112 |
Learning gaussian networks
- Geiger, Heckerman
- 1994
(Show Context)
Citation Context ...ctors included in the model. The structures of these classifiers range the simplest naive Bayes structure to the complete graphs. A classifier based on BNs can be constructed from a Bayesian approach =-=[2,19,22,27]-=-. It takes into account all possible models and all possible parameters, restricted to a special kind of structure and a family of probability functions. However, the classifiers included in this pape... |

110 | Semi-naive Bayesian classifier - Kononenko - 1991 |

109 | Learning limited dependence Bayesian classifiers
- Sahami
- 1996
(Show Context)
Citation Context ...mation [63]. A battery of BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes [11,39,46], tree augmented Bayesian network [17], k-dependence Bayesian classifier =-=[57]-=- and semi naive Bayes [37,49]. In the presence of continuous variables, another alternative is to assume that continuous variables are sampled from a Gaussian distribution. This kind of Bayesian netwo... |

107 | Wrappers for Performance Enhancement and Oblivious Decision Graphs
- Kohavi
- 1995
(Show Context)
Citation Context ...ata sets [47], which only contain continuous predictor variables. In order to interpret the results, we must take into account that most parts of the UCI repository data sets are already preprocessed =-=[34]-=-: in the data sets included, there are few irrelevant or redundant variables, and little noise [59]. Thus, it is more difficult to obtain statistically significant differences between the results of t... |

106 | Efficient feature selection via analysis of relevance and redundancy
- Yu, Liu
(Show Context)
Citation Context ...the selection. For example, a score based on information theory [6] used to select variables in a filter manner (entropy and mutual information measures), is correlation based feature selection (CFS) =-=[26,64]-=-. More examples based on information theory are the approaches based on relevance concepts [60,61]. On the other hand, wrapper approaches use an estimated classification goodness measure as a score [3... |

84 |
The Use of Multiple Measurements
- Fisher
- 1936
(Show Context)
Citation Context ...76.0±1.4 84.5±0.9 86.3±1.3 11 94.9±4.1 94.9±4.1 98.9±2.3 98.3±2.7 96.6±2.9 93.8±5.5 97.2±4.0 100.0±0.0 Average 79.1 77.6 76.1 78.5 75.6 80.4 83.0 80.5 distribution. Linear discriminant analysis (LDA) =-=[15]-=- is included in the study as a classic statistical benchmark to compare it with the CGN-based classifiers presented. LDA also assumes that the continuous data is sampled from a multivariate Gaussian d... |

79 | Comparing Bayesian network classifiers
- Cheng, Greiner
- 1999
(Show Context)
Citation Context ...search space, the algorithms explore, for example, naive Bayes like structures [11,39,46], tree augmented networks [17,33], k-dependence networks [57], semi naive Bayes [37,49], unrestricted networks =-=[4,51]-=-, or Bayesian multinets [20]. Parametric learning consists in estimating parameters from the data. These 5sparameters model the dependence relations between variables, represented by the classifier st... |

79 |
Nearest Neighbour Pattern Classification
- Cover, Hart
- 1967
(Show Context)
Citation Context ...inuous data is sampled from a multivariate Gaussian density function. Table 2 shows that LDA obtains competitive results compared with the following set of well known state-of-the-art-algorithms: kNN =-=[7]-=- with different k, discrete versions of NB [11] and TAN [17], ID3 [53] and C4.5 [54], and Multilayer Perceptron (MP) [56] (all of them implemented in Weka 3.4.3 statistical package [62]). The estimate... |

69 | Searching for dependencies in bayesian classifiers
- Pazzani
- 1996
(Show Context)
Citation Context ...BMN-based classifier induction algorithms has been proposed in the literature: naive Bayes [11,39,46], tree augmented Bayesian network [17], k-dependence Bayesian classifier [57] and semi naive Bayes =-=[37,49]-=-. In the presence of continuous variables, another alternative is to assume that continuous variables are sampled from a Gaussian distribution. This kind of Bayesian network is known as a conditional ... |

63 | Decomposable graphical Gaussian model determination
- Giudici, Green
- 1999
(Show Context)
Citation Context ...ctors included in the model. The structures of these classifiers range the simplest naive Bayes structure to the complete graphs. A classifier based on BNs can be constructed from a Bayesian approach =-=[2,19,22,27]-=-. It takes into account all possible models and all possible parameters, restricted to a special kind of structure and a family of probability functions. However, the classifiers included in this pape... |

63 | Learning Bayesian network classifiers by maximizing conditional likelihood
- Grossman, Dominigos
- 2004
(Show Context)
Citation Context ...he classifiers included in this paper are induced from a non-Bayesian point of view, which fixes a unique structure and its parameters. The structure is learned guided by a score function (likelihood =-=[24]-=-, accuracy [33,40,49] or mutual information [17,57]). There are a lot of works in which the non-Bayesian approach for discrete variables is performed with different structure complexities [11,17,33,37... |

62 | Classification with hybrid generative/discriminative models
- Raina, Shen, et al.
- 2003
(Show Context)
Citation Context ... other hand, discriminative classifiers [24,58] directly model the posterior probability of the class conditioned to the predictor variables. The learning can also be done in a mixed way, as shown in =-=[55]-=-. The present work is performed from the point of view of generative learning. This paper presents the CGN paradigm and a battery of classifier induction algorithms supported by it, much of them adapt... |

57 | Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches
- Keogh, Pazzani
- 1999
(Show Context)
Citation Context ...included in this paper are induced from a non-Bayesian point of view, which fixes a unique structure and its parameters. The structure is learned guided by a score function (likelihood [24], accuracy =-=[33,40,49]-=- or mutual information [17,57]). There are a lot of works in which the non-Bayesian approach for discrete variables is performed with different structure complexities [11,17,33,37,39,40,46,49,57]. The... |

44 | A Unified Bias-Variance Decomposition and its Applications
- Domingos
- 2000
(Show Context)
Citation Context ...ecomposition was introduced to machine learning for mean squared error by German et al. [21]. Later versions for zero-one-loss functions were given by Friedman [16], Kohavi and Wolpert [36], Domingos =-=[9]-=- and James [28]. The decompositions have been performed following Kohavi and Wolpert’s proposal [36] with parameters N = 20 and m = 1/3|BD|, where N is the number of training sets, m is its size and |... |

40 |
Modern Mathematical Statistics
- Dudewicz, Mishra
- 1988
(Show Context)
Citation Context ...icance level in a non-paired Mann-Whitney test ◦sselected algorithm has obtained statistically significantly better results with respect to the rest of algorithms using a non-paired Mann-Whitney test =-=[12]-=-. The study has been performed at α = 10% and α = 5% significance levels, represented in Table 3 by “◦” and “•” symbols, respectively. For example, in the HAYES data set, fTAN has obtained a predictiv... |

35 | Feature subset selection: a correlation based filter approach
- Hall, Lloyd
- 1997
(Show Context)
Citation Context ...the selection. For example, a score based on information theory [6] used to select variables in a filter manner (entropy and mutual information measures), is correlation based feature selection (CFS) =-=[26,64]-=-. More examples based on information theory are the approaches based on relevance concepts [60,61]. On the other hand, wrapper approaches use an estimated classification goodness measure as a score [3... |