## Order-based Discriminative Structure Learning for Bayesian Network Classifiers

Citations: | 1 - 1 self |

### BibTeX

@MISC{Pernkopf_order-baseddiscriminative,

author = {Franz Pernkopf and Jeff Bilmes},

title = {Order-based Discriminative Structure Learning for Bayesian Network Classifiers},

year = {}

}

### OpenURL

### Abstract

We introduce a simple empirical order-based greedy heuristic for learning discriminative Bayesian network structures. We propose two metrics for establishing the ordering of N features. They are based on the conditional mutual information. Given an ordering, we can find the discriminative classifier structure with O (N q) score evaluations (where constant q is the maximum number of parents per node). We present classification results on the UCI repository (Merz, Murphy, & Aha 1997), for a phonetic classification task using the TIMIT database (Lamel, Kassel, & Seneff 1986), and for the MNIST handwritten digit recognition task (Le-Cun et al. 1998). The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (naive greedy) Bayesian network learning approach, but does so with a factor of ∼10 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features. 1

### Citations

9359 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...hm proposed in (Cooper & Herskovits 1992), however, we use a discriminative scoring metric and suggest approaches for establishing the variable ordering based on conditional mutual information (CMI) (=-=Cover & Thomas 1991-=-). We provide results showing that the orderbased heuristic provides comparable results to the best procedure - the naive greedy heuristic using the CR score, but it requires only one tenth of the tim... |

7556 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...minative structure can be desirable. Section 4 introduces our order-based greedy heuristic. Experiments are shown in Section 5. Section 6 concludes. 2 Bayesian network classifiers A Bayesian network (=-=Pearl 1988-=-) B = 〈G,Θ〉 is a directed acyclic graph G = (Z,E) consisting of a set of nodes Z and a set of directed edges E connecting the nodes. This graph represents factorization properties of the distribution ... |

3110 | UCI repository of machine learning databases - Blake, Keogh, et al. - 1998 |

1153 | Wrappers for feature subset selection
- Kohavi, John
- 1997
(Show Context)
Citation Context ...n is currently terminated after a specified number of iterations (specifically 20). 5.1 Data characteristics UCI Data: We use 25 data sets from the UCI repository (Merz, Murphy, & Aha 1997) and from (=-=Kohavi & John 1997-=-). The same data sets, 5-fold cross-validation, and train/test learning schemes as in (Friedman, Geiger, & Goldszmidt 1997) are employed. TIMIT-4/6 Data: This data set is extracted from the TIMIT spee... |

1148 | A Bayesian method for the induction of probabilistic networks form data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...be discriminatively optimized in O ( N 2) using the CR. Our order-based structure learning is based on the observations in (Buntine 1991) and the framework is similar to the K2 algorithm proposed in (=-=Cooper & Herskovits 1992-=-), however, we use a discriminative scoring metric and suggest approaches for establishing the variable ordering based on conditional mutual information (CMI) (Cover & Thomas 1991). We provide results... |

666 |
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
- Schölkopf, Smola
- 2002
(Show Context)
Citation Context ...t, a commitment has been made to use a generative model for classification purposes; the alternative being a “discriminative” classifier such as logistic regression or support vector machines (SVMs) (=-=Schölkopf & Smola 2001-=-). There are a number of reasons why one might, in certain contexts, prefer a generative to a discriminative model including: parameter tying and domain knowledge-based hierarchical decomposition is f... |

610 | Dynamic Bayesian Networks: Representation, Inference and - Murphy, Mian - 2002 |

397 | M.: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes - Ng, Jordan - 2002 |

200 | Theory refinement on bayesian networks
- Buntine
- 1991
(Show Context)
Citation Context ...the number of parents per node). We learn a e.g., TAN classifier,which can be discriminatively optimized in O ( N 2) using the CR. Our order-based structure learning is based on the observations in (=-=Buntine 1991-=-) and the framework is similar to the K2 algorithm proposed in (Cooper & Herskovits 1992), however, we use a discriminative scoring metric and suggest approaches for establishing the variable ordering... |

148 | Speech database development: design and analysis of the Acoustic-Phonetic Corpus - Lamel, Kassel, et al. - 1986 |

94 |
Knowledge representation and inference in similarity networks and Bayesian multinets
- Geiger, Heckerman
(Show Context)
Citation Context ...sian network in a maximum likelihood (ML) sense is NP-complete (including paths (Meek 1995), polytrees (Dasgupta 1997), k-trees (Arnborg, Corneil, & Proskurowski 1987), and general Bayesian networks (=-=Geiger & Heckerman 1996-=-)). Learning the best “discriminative structure” is no less difficult, largely because the cost functions that are needed to be optimized do not in general decompose1 . As of yet, however, there has n... |

87 |
Causal Inference and Causal Explanation with Background Knowledge
- Meek
- 1995
(Show Context)
Citation Context ... been a number of negative results over the past years, showing that learning various forms of optimal constrained Bayesian network in a maximum likelihood (ML) sense is NP-complete (including paths (=-=Meek 1995-=-), polytrees (Dasgupta 1997), k-trees (Arnborg, Corneil, & Proskurowski 1987), and general Bayesian networks (Geiger & Heckerman 1996)). Learning the best “discriminative structure” is no less difficu... |

67 | Learning Bayesian network classifiers by maximizing conditional likelihood
- Grossman, Domingos
- 2004
(Show Context)
Citation Context ... likelihood (CL) using a conjugate gradient method. Similarly, in (Roos et al. 2005) conditions are provided for general Bayesian networks under which correspondence to logistic regression holds. In (=-=Grossman & Domingos 2004-=-) the CL function is used to learn a discriminative structure. The parameters are set using ML learning. They use a greedy hill climbing search with the CL function as a scoring measure, where at each... |

65 | Structural extension to logistic regression: Discriminative parameter learning of belief net classiers, 2002
- Greiner, Zhou
(Show Context)
Citation Context ...f discrete optimization is to minimize a cost function that is suitable for reducing classification errors, such as conditional likelihood (CL) or classification rate (CR). eral Bayesian networks in (=-=Greiner et al. 2005-=-) – they optimize parameters with respect to the conditional likelihood (CL) using a conjugate gradient method. Similarly, in (Roos et al. 2005) conditions are provided for general Bayesian networks u... |

61 | Dynamic Bayesian multinets - Bilmes - 2000 |

61 | Learning augmented Bayesian classifiers: A comparison of distribution-based and classification-based approaches
- Keogh, Pazzani
- 1999
(Show Context)
Citation Context ....g., tree augmented naive Bayes (TAN)) and the acyclicity property of Bayesian networks. In a similar algorithm, the classification rate (CR) has also been used for discriminative structure learning (=-=Keogh & Pazzani 1999-=-). This approach is computationally expensive, as a complete re-evaluation of the training set is needed for each considered edge. The CR (equivalently, empirical risk) is the discriminative criterion... |

56 | Goldszmidt M. Bayesian network classifiers - Friedman, Geiger - 1997 |

53 | Natural Statistical Models for Automatic Speech Recognition - Bilmes - 1998 |

23 | The sample complexity of learning fixed-structure bayesian networks
- Dasgupta
- 1997
(Show Context)
Citation Context ...ive results over the past years, showing that learning various forms of optimal constrained Bayesian network in a maximum likelihood (ML) sense is NP-complete (including paths (Meek 1995), polytrees (=-=Dasgupta 1997-=-), k-trees (Arnborg, Corneil, & Proskurowski 1987), and general Bayesian networks (Geiger & Heckerman 1996)). Learning the best “discriminative structure” is no less difficult, largely because the cos... |

21 | Discriminative versus generative parameter and structure learning of Bayesian Network Classiers - Pernkopf, Bilmes - 2005 |

15 | On discriminative bayesian network classifiers and logistic regression. Machine Learning - Special Issue on Graphical Models for Classification
- Roos, Wettig, et al.
(Show Context)
Citation Context ... classification rate (CR). eral Bayesian networks in (Greiner et al. 2005) – they optimize parameters with respect to the conditional likelihood (CL) using a conjugate gradient method. Similarly, in (=-=Roos et al. 2005-=-) conditions are provided for general Bayesian networks under which correspondence to logistic regression holds. In (Grossman & Domingos 2004) the CL function is used to learn a discriminative structu... |

7 |
Heterogeneous measurements for phonetic classification
- Halberstadt, Glass
- 1997
(Show Context)
Citation Context ...e phonetic transcription boundaries specify a set of frames belonging to a particular phoneme. From this set of frames - the phonetic segment - a single feature vector is derived. In accordance with (=-=Halberstadt & Glass 1997-=-) we combine the 61 phonetic labels into 39 classes, ignoring glottal stops. For training, 462 speakers from the standard NIST training set have been used. For testing the remaining 168 speakers from ... |

7 | Discriminative Learning of Bayesian Network Classifiers - Pernkopf |

6 |
A Supermodular-Submodular Procedure with Applications to Discriminative Structure Learning
- Narasimhan, Bilmes
- 2005
(Show Context)
Citation Context ...it is structured discriminatively) does not impair the generative model’s ability to easily deal with missing features (Figure 3). In the following, we present a simple synthetic example (similar to (=-=Narasimhan & Bilmes 2005-=-)) and results which indicate when a discriminative structure would be necessary for good classification performance in a generative model, regardless of the parameter learning method. The model consi... |

4 | Complexity of finding embeddings in a k-tree.SIAM journal on Algebraic and Discrete Methods - Arnborg, Corneil, et al. - 1987 |

2 |
Multi-interval discretizaton of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...nditioned on a single variable for ordering the variables (step1) and CR for parent selection in step 2 of the orderbased heuristic. Any continuous features were discretized using the procedure from (=-=Fayyad & Irani 1993-=-) where the codebook is produced using only the training data. Throughout our experiments, we use exactly the same data partitioning for each training procedure. We performed simple smoothing, where z... |

2 | Learning Bayesian network structure form massive datasets: The sparse candidate algorithm - Friedman, Nachman, et al. - 1999 |