## Discriminative parameter learning of general Bayesian network classifiers (2003)

### Cached

### Download Links

- [cs.ualberta.ca]
- [www.cs.ualberta.ca]
- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the Fifteenth IEEE International Conference on Tools with Artificial Intelligence (ICTAI-03 |

Citations: | 5 - 1 self |

### BibTeX

@INPROCEEDINGS{Shen03discriminativeparameter,

author = {Bin Shen and Petr Musilek and Xiaoyuan Su and Russell Greiner and Corrine Cheng},

title = {Discriminative parameter learning of general Bayesian network classifiers},

booktitle = {In Proceedings of the Fifteenth IEEE International Conference on Tools with Artificial Intelligence (ICTAI-03},

year = {2003},

publisher = {Society Press}

}

### OpenURL

### Abstract

Greiner and Zhou [1] presented ELR, a discriminative parameter-learning algorithm that maximizes conditional likelihood (CL) for a fixed Bayesian Belief Network (BN) structure, and demonstrated that it often produces classifiers that are more accurate than the ones produced using the generative approach (OFE), which finds maximal likelihood parameters. This is especially true when learning parameters for incorrect structures, such as Naïve Bayes (NB). In searching for algorithms to learn better BN classifiers, this paper uses ELR to learn parameters of more nearly correct BN structures – e.g., of a general Bayesian network (GBN) learned from a structure-learning algorithm [2]. While OFE typically produces more accurate classifiers with GBN (vs. NB), we show that ELR does not, when the training data is not sufficient for the GBN structure learner to produce a good model. Our empirical studies also suggest that the better the BN structure is, the less advantages ELR has over OFE, for classification purposes. ELR learning on NB (i.e., with little structural knowledge) still performs about the same as OFE on GBN in classification accuracy, over a large number of standard benchmark datasets. 1.

### Citations

7489 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ... directed acyclic graph. Associated with each node ni∈N is a conditional probability distribution (CPtable), collectively represented by Θ={θi} which quantifies how much a node depends on its parents =-=[11]-=-. A classifier is a function that assigns a class label to instances, typically described by a set of attributes. Over the last decade or so, Bayesian networks have been used more frequently for class... |

4172 |
Pattern Classification and Scene Analysis
- DUDA, HART
- 1973
(Show Context)
Citation Context ...arning and data mining. An increasing number of projects are using Bayesian belief net (BN) classifiers, whose wide use was motivated by the simplicity and accuracy of the naïve Bayes (NB) classifier =-=[3]-=-. While these NB learners find parameters that work well for a fixed structure, it is desirable to optimize structure as well as parameters, towards achieving an accurate Bayesian network classifier. ... |

1885 |
Generalized Linear Model
- McCullagh, Nelder
- 1989
(Show Context)
Citation Context ...e actual parameter learner attempts to optimize the log conditional likelihood of a belief net B. Given a sample S, it can be approximated as: ∧ ( S) 1 LCL ( B) = ∑ log( PB ( c | e) ) (1) S < e, c>∈S =-=[16]-=- and [9] note that maximizing this score will typically produce a classifier that comes close to minimizing the classification error. Unfortunately, the complexity of finding the Bayesian network para... |

1141 | A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ... that ELR often produces better classifiers than generative learners: when the learner has complete data, ELR is often superior to the standard generative approach “Observed Frequency Estimate” (OFE) =-=[6]-=-, and when given incomplete data, ELR is often better than the EM [7] and APN [8] systems. ELR appears especially beneficial in the common situations where the given BN-structure is incorrect. Optimiz... |

1132 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...binations. We evaluated various algorithms over the standard 25 benchmark datasets used 3by Friedman et al. [9]: 23 from UCI repository [19], plus “mofn-3-7-10” and “corral”, which were developed by =-=[20]-=- to study feature selection. We also used the same 5fold cross validation and Train/Test learning schemas. As part of data preparation, continuous data are discretized using the supervised entropy-bas... |

901 | A tutorial on learning with Bayesian Networks
- Heckerman
- 1996
(Show Context)
Citation Context ... when the learner has complete data, ELR is often superior to the standard generative approach “Observed Frequency Estimate” (OFE) [6], and when given incomplete data, ELR is often better than the EM =-=[7]-=- and APN [8] systems. ELR appears especially beneficial in the common situations where the given BN-structure is incorrect. Optimization of BN structure is also an important learning task. Conceivably... |

702 |
Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...ature selection. We also used the same 5fold cross validation and Train/Test learning schemas. As part of data preparation, continuous data are discretized using the supervised entropy-based approach =-=[21]-=-. As mentioned in Section 3, CI-based algorithms can effectively learn GBN structures from complete datasets, provided enough data instances are available. We used Cheng’s Power Constructor (which imp... |

686 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...e BN Augmented Naïve-Bayes (BAN) classifier, Bayesian Multi-net classifier, etc.; we will not consider them here. X3 X4 X1 X3 X5 C X4 can be learned in polynomial time by using the Chow-Liu algorithm =-=[12]-=-. TAN classifiers are attractive as they embody a good tradeoff between the quality of the approximation of correlations among attributes, and the computational complexity in the learning stage [9]. G... |

637 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...ce (CI) based algorithm for learning BN structure, and the ELR algorithm for learning BN parameters. Section 4 presents our empirical experiments and analyses, based on 25 standard benchmark datasets =-=[9]-=- and the data generated from the Alarm [10] and Insurance [8] networks. We provide additional details, and additional data, in http://www.cs.ualberta.ca/~greiner/ELR.html. 12. Bayesian (network) clas... |

313 | Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier
- Domingos, Pazzani
- 1996
(Show Context)
Citation Context ..., to determine if correct structures can further improve ELR learning in classification tasks. 4.1 GBN + OFE vs. NB, TAN + OFE Given a typically incorrect structure such as NB, OFE can perform poorly =-=[22]-=- in classification tasks. We were therefore surprised when our experimental results (Table 1) showed OFE parameter learning on NB structure (NB+OFE) performed just about the same as GBN+OFE in classif... |

249 |
The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks
- Beinlich, Suermondt, et al.
- 1989
(Show Context)
Citation Context ...ructure, and the ELR algorithm for learning BN parameters. Section 4 presents our empirical experiments and analyses, based on 25 standard benchmark datasets [9] and the data generated from the Alarm =-=[10]-=- and Insurance [8] networks. We provide additional details, and additional data, in http://www.cs.ualberta.ca/~greiner/ELR.html. 12. Bayesian (network) classifiers A Bayesian network (BN) is a probab... |

162 | Adaptive Probabilistic Networks with Hidden Variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context ...arner has complete data, ELR is often superior to the standard generative approach “Observed Frequency Estimate” (OFE) [6], and when given incomplete data, ELR is often better than the EM [7] and APN =-=[8]-=- systems. ELR appears especially beneficial in the common situations where the given BN-structure is incorrect. Optimization of BN structure is also an important learning task. Conceivably, optimizing... |

67 | Learning Bayesian network classifiers by maximizing conditional likelihood
- Grossman, Domingos
- 2004
(Show Context)
Citation Context ...from the true models and sampled uniformly. 5. Conclusions and Future Work These results suggest it would be helpful to find structures that maximize classification accuracy or conditional likelihood =-=[25]-=-, especially when the data are 7insufficient for generative structure learning. While our experiments dealt with complete data cases, further studies are needed to learn GBN+ELR on incomplete/partial... |

65 | Structural extension to logistic regression: Discriminative parameter learning of belief net classiers, 2002
- Greiner, Zhou
(Show Context)
Citation Context ...aoyuan@ee.ualberta.ca Russell Greiner 1 greiner@cs.ualberta.ca Corrine Cheng 1 corrine@cs.ualberta.ca 1 Computing Science University of Alberta Edmonton, AB, Canada, T6G 2E8 Abstract Greiner and Zhou =-=[1]-=- presented ELR, a discriminative parameter-learning algorithm that maximizes conditional likelihood (CL) for a fixed Bayesian Belief Network (BN) structure, and demonstrated that it often produces cla... |

62 | Learning Bayesian Belief Network Classifiers: Algorithms and System - CHENG, GREINER - 2001 |

61 | A Bayesian approach to learning causal networks
- Heckerman
- 1995
(Show Context)
Citation Context ... of approaches to learn the structure of a Bayesian network: conditional independence (CI) based algorithms (using an information theoretic dependency analysis), and search-&-scoring based algorithms =-=[13]-=-. We will focus on the first approach. 3.1.1 CI-based algorithm. The Bayesian network structure encodes a set of conditional independence relationships among the nodes, which suggests the structure ca... |

55 | Why the logistic function? A tutorial discussion on probabilities and neural networks
- Jordan
- 1995
(Show Context)
Citation Context ...meters, towards achieving an accurate Bayesian network classifier. Most BN learners are generative, seeking parameters and structure that maximize likelihood [4]. By contrast, logistic regression (LR =-=[5]-=-) systems attempt to optimize the conditional likelihood (CL) of the class given the attributes; this typically produces better classification accuracy. Standard LR, however, makes the “naïve bayes” a... |

53 |
Algorithms for maximum-likelihood logistic regression
- Minka
- 2001
(Show Context)
Citation Context ...set to their maximum likelihood values using observed frequency estimates before gradient descent. ELR uses line-search and conjugate gradient techniques, which are known to be effective for LR tasks =-=[17]-=-. Our empirical studies also show that the number of iterations is crucial. We therefore use a type of cross validation (called “cross tuning”) to determine this number. ELR also incorporates several ... |

50 | Learning Bayesian nets that perform well
- Greiner, Grove, et al.
- 1997
(Show Context)
Citation Context ...4}, etc. We also assume the probability of asking any “What is P(C | E=e)?” query corresponds directly to natural frequency of the E=e event, which means we can infer this from the data sample S; see =-=[15]-=-. Our goal is to construct an effective Bayesian belief net (BN), B = 〈N, A, Θ〉 for this classification task. Here, given a node D∈N with immediate parents F⊂N, the parameter θd|f represents the netwo... |

48 | AUC: A Statistically Consistent and More Discriminating Measure than Accuracy
- Ling, Huang, et al.
- 2003
(Show Context)
Citation Context ...om: http://www.cs.ualberta.ca/~bshen/elr.htm. More comparative study results for evaluating classifiers based on probability estimation in terms of conditional likelihood and area-under-ROC-curve AUC =-=[26]-=- will be posted onto the above website. Acknowledgement We thank Dr. J. Cheng (Siemens Corporate Research), W. Zhou (U. of Waterloo), Dr. H. Zhang (U. of New Brunswick), X. Wu (U. of Alberta) and Dr. ... |

13 |
G: Algorithms for Bayesian belief-network precomputation. Methods Inf Med
- Herskovits, Cooper
- 1991
(Show Context)
Citation Context ... the INSURANCE network, a 27 variable (3 query variables, 12 evidence variables) BN for evaluating car insurance risks [8]. Complete datasets with all variables are sampled from the original networks =-=[24]-=-. We generated queries from these datasets by fixing one query variable and including all the evidence variables (all the other variables were removed) – e.g., a query generated from the ALARM network... |

9 |
On discriminative vs. generative classifiers
- Ng, Jordan
- 2002
(Show Context)
Citation Context ...ikelihood; it would be useful to instead use a learner that sought structures that optimized conditional likelihood, and then sought appropriate parameters for that structure, using either ELR or OFE =-=[23]-=-. We also notice the performance gaps in classification error between ELR and OFE shrink with better structural models. NB+ELR consistently yields better classification results over NB+OFE at signific... |

3 |
et al. Learning Bayesian networks from data: An information-theory based approach
- Cheng, Greiner, et al.
- 2002
(Show Context)
Citation Context ...dence relationships between the nodes. Using information theory, the conditional mutual information of two nodes X and Y, with respect to a (possibly empty) conditioning set of nodes C, is defined as =-=[14]-=-: P( x, y | c) I( X , Y | C) = ∑ P( x, y, c) log x, y, c P( x | c) P( y | c) Of course, we do not have access to the true P(a) distribution, but only a training sample S, from which we can compute emp... |

1 |
Pattern Recognition in Intelligence Systems: Networks of Plausible Inference
- Ripley
- 1998
(Show Context)
Citation Context ...able to optimize structure as well as parameters, towards achieving an accurate Bayesian network classifier. Most BN learners are generative, seeking parameters and structure that maximize likelihood =-=[4]-=-. By contrast, logistic regression (LR [5]) systems attempt to optimize the conditional likelihood (CL) of the class given the attributes; this typically produces better classification accuracy. Stand... |