## On supervised selection of Bayesian networks (1999)

Venue: | In UAI99 |

Citations: | 18 - 6 self |

### BibTeX

@INPROCEEDINGS{Kontkanen99onsupervised,

author = {Petri Kontkanen and Petri Myllymaki and Tomi Silander and Henry Tirri},

title = {On supervised selection of Bayesian networks},

booktitle = {In UAI99},

year = {1999},

pages = {334--342},

publisher = {Morgan Kaufmann Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori known that the chosen model will be used in the future for prediction tasks involving more \focused " predictive distributions. Although focused predictive distributions can be produced from the joint probability distribution by marginalization, in practice the best model in the unsupervised sense does not necessarily perform well in supervised domains. In particular, the standard marginal likelihood score is a criterion for the unsupervised task, and, although frequently used for supervised model selection also, does not perform well in such tasks. In this paper we study the performance of the marginal likelihood score empirically in supervised Bayesian network selection tasks by using a large number of publicly available classi cation data sets, and compare the results to those obtained by alternative model selection criteria, including empirical crossvalidation methods, an approximation of a supervised marginal likelihood measure, and a supervised version of Dawid's prequential (predictive sequential) principle. The results demonstrate that the marginal likelihood score does not perform well for supervised model selection, while the best results are obtained by using Dawid's prequential approach.

### Citations

7052 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...ake our presentation more concrete, for the remainder of the paper we assume the models M to represent di erent Bayesian network structures (for an introduction to Bayesian network models, see e.g., (=-=Pearl, 1988-=-)). Given a selection F = fM1;:::;Mmg of possible models (Bayesian network structures), and a data sample D, in the (unsupervised) model selection problem, the task is to choose a model M so that the ... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...n by using a di erent factorization. 3 EMPIRICAL RESULTS 3.1 TEST SETUP For our supervised model selection experiments, we used publicly available classi cation datasets from the UCI data repository (=-=Blake, Keogh, & Merz, 1998-=-). As discussed earlier, we wanted to eliminate the e ects of the model search procedure from our results, hence we restricted the possible models to Bayesian network structures sharing the property t... |

2307 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ... (Friedman et al., 1997) call the MDL score is actually only an approximation of the unsupervised marginal likelihood, also known as the Bayesian information criterion (BIC) or the Schwarz criterion (=-=Schwarz, 1978-=-). What is more, from the information theoretic point of view, the two-part MDL score used in (Friedman et al., 1997) can be regarded as a crude approximation of the stochastic complexity measure (Ris... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...Heckerman & Geiger, 1995; Cowell, 1992), or as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Je reys' prior (Je reys, 1946; =-=Berger, 1985-=-) can be given strong theoretical justi cation from the predictive performance point of view with respect to the so called minimax loss formulation (Rissanen, 1996; Grunwald, 1998). Some empirical res... |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ...twork domain is the (unsupervised) marginal likelihood, sometimes also called the evidence measure. By making certain technical assumptions, this criterion can be computed e ciently, as described in (=-=Cooper & Herskovits, 1992-=-; Heckerman, Geiger, & Chickering, 1995). Although this score can be shown to possess some desirable theoretical properties (see (Bernardo & Smith, 1994; Merhav &Feder, 1998)), the results hold only i... |

1039 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ... can be computed e ciently, as described in (Cooper & Herskovits, 1992; Heckerman, Geiger, & Chickering, 1995). Although this score can be shown to possess some desirable theoretical properties (see (=-=Bernardo & Smith, 1994-=-; Merhav &Feder, 1998)), the results hold only in speci c situations. Regardless of this, marginal likelihood is typically used also in model selection tasks where the optimality results no longer hol... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...rvised) marginal likelihood, sometimes also called the evidence measure. By making certain technical assumptions, this criterion can be computed e ciently, as described in (Cooper & Herskovits, 1992; =-=Heckerman, Geiger, & Chickering, 1995-=-). Although this score can be shown to possess some desirable theoretical properties (see (Bernardo & Smith, 1994; Merhav &Feder, 1998)), the results hold only in speci c situations. Regardless of thi... |

589 | Bayesian network classifiers - Friedman, Geiger, et al. - 1997 |

296 |
Stochastic Complexity in Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...h for model selection. Furthermore, as noted in (Dawid, 1992), from the information-theoretic point of view the prequential approach can be regarded as a predictive coding system discussed in, e.g., (=-=Rissanen, 1989-=-). However, it should be noted that in contrast to marginal likelihood and information-theoretic approaches, which are closely linked to a speci c losssfunction, the logarithmic loss, Dawid's prequent... |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...nown as Je reys' prior (Je reys, 1946; Berger, 1985) can be given strong theoretical justi cation from the predictive performance point of view with respect to the so called minimax loss formulation (=-=Rissanen, 1996-=-; Grunwald, 1998). Some empirical results concerning the e ect of Je reys' prior on predictive accuracy can be found in (Grunwald, Kontkanen, Myllymaki, Silander, & Tirri, 1998). In the remainder of t... |

194 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1939
(Show Context)
Citation Context ...erent priors (Heckerman & Geiger, 1995; Cowell, 1992), or as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Jeffreys' prior (=-=Jeffreys, 1946-=-; Berger, 1985) can be given strong theoretical justification from the predictive performance point of view with respect to the so called minimax loss formulation (Rissanen, 1996; Grunwald, 1998). Som... |

191 |
Bayesian analysis in expert systems
- Spiegelhalter, Dawid, et al.
- 1993
(Show Context)
Citation Context ...arely used in practice. For our purposes, we modify the prequential score for classi cation domains by using Cox's partial marginal likelihood principle (Cox, 1975) as suggested in (Dawid, 1991). In (=-=Spiegelhalter, Dawid, Lauritzen, & Cowell, 1993-=-), the resulting criterion was called the conditional node monitor. The third criterion, the supervised modi cation of the marginal likelihood score, was called the class sequential criterion in (Heck... |

171 |
Statistical theory: the prequential approach
- Dawid
- 1984
(Show Context)
Citation Context ...sed model selection tasks. As an alternative to the unsupervised marginal likelihood model selection score, we consider several other model selection criteria, including Dawid's prequential approach (=-=Dawid, 1984-=-), empirical crossvalidation methods (Stone, 1974; Geisser, 1975), and the supervised marginal likelihood approximation discussed in (Kontkanen, Myllymaki, Silander, & Tirri, 1998). As opposed to the ... |

136 | Universal prediction - Merhav, Feder - 1998 |

128 |
Partial likelihoods
- Cox
- 1975
(Show Context)
Citation Context ...es, but for some reason this method has been rarely used in practice. For our purposes, we modify the prequential score for classi cation domains by using Cox's partial marginal likelihood principle (=-=Cox, 1975-=-) as suggested in (Dawid, 1991). In (Spiegelhalter, Dawid, Lauritzen, & Cowell, 1993), the resulting criterion was called the conditional node monitor. The third criterion, the supervised modi cation ... |

102 | The predictive sample reuse method with applications - Geisser - 1975 |

72 | The minimum description length principle and reasoning under uncertainty
- Gru¨nwald
- 1998
(Show Context)
Citation Context ...dy be taken into account in the model selection decision (see the discussion in (Dawid, 1992)). Modifying the marginal likelihood approach for arbitrary loss functions is by no means straightforward (=-=Grunwald, 1998-=-). On the other hand, focused prediction tasks, such as classi cation, can be seen to de ne a focused loss function, so in general we can say that unsupervised learning deals with the logarithmic loss... |

50 |
Prequential Analysis, Stochastic Complexity and Bayesian Inference
- Dawid
- 1992
(Show Context)
Citation Context ...served that if the loss function is known in advance before choosing the model, for optimal performance it should already be taken into account in the model selection decision (see the discussion in (=-=Dawid, 1992-=-)). Modifying the marginal likelihood approach for arbitrary loss functions is by no means straightforward (Grunwald, 1998). On the other hand, focused prediction tasks, such as classi cation, can be ... |

38 | On predictive distributions and Bayesian networks - Kontkanen, Myllymäki, et al. - 2000 |

23 | Likelihoods and parameter priors for Bayesian networks
- Heckerman, Geiger
- 1996
(Show Context)
Citation Context ...M) de ned on the model parameters. This prior can either be regarded as a formalization of our prior domain knowledge, which leads to interesting questions about the compatibility of di erent priors (=-=Heckerman & Geiger, 1995-=-; Cowell, 1992), or as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Je reys' prior (Je reys, 1946; Berger, 1985) can be giv... |

23 | Models and selection criteria for regression and classification
- Heckerman, Meek
- 1997
(Show Context)
Citation Context ...of the di erent models. In this paper we focus on the scoring aspect of the model selection problem; in other words, we are interested in scoring functions, or (model selection) criteria in terms of (=-=Heckerman & Meek, 1997-=-), which de ne what networks are considered \good" models. The most commonly used model selection criterion in the Bayesian network domain is the (unsupervised) marginal likelihood, sometimes also cal... |

22 |
Fisherian inference in likelihood and prequential frames of reference
- Dawid, A
- 1991
(Show Context)
Citation Context ... method has been rarely used in practice. For our purposes, we modify the prequential score for classi cation domains by using Cox's partial marginal likelihood principle (Cox, 1975) as suggested in (=-=Dawid, 1991-=-). In (Spiegelhalter, Dawid, Lauritzen, & Cowell, 1993), the resulting criterion was called the conditional node monitor. The third criterion, the supervised modi cation of the marginal likelihood sco... |

19 | Minimum encoding approaches for predictive modeling
- Grünwald, Kontkanen, et al.
- 1998
(Show Context)
Citation Context ... of view with respect to the so called minimax loss formulation (Rissanen, 1996; Grunwald, 1998). Some empirical results concerning the e ect of Je reys' prior on predictive accuracy can be found in (=-=Grunwald, Kontkanen, Myllymaki, Silander, & Tirri, 1998-=-). In the remainder of this paper we do not address the problem of choosing the prior distributions, but use uniform priors for the parameters. The marginal likelihood measure (1) is the most commonly... |

17 |
Bayesian network classi ers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...s in empirical results would be caused the properties of di erent model selection criteria, or by the bias caused by the search algorithm used. When compared to a related empirical study reported in (=-=Friedman, Geiger, & Goldszmidt, 1997-=-), the workspresented here di ers in several aspects. First, the emphasis on this paper is strictly on the problem of selecting the Bayesian network structure, while (Friedman et al., 1997) consider t... |

15 |
The minimum description principle in coding and modeling
- Barron, Rissanen, et al.
- 1998
(Show Context)
Citation Context ... in (Friedman et al., 1997) can be regarded as a crude approximation of the stochastic complexity measure (Rissanen, 1989), for which much more elaborate formulations can be found in (Rissanen, 1996; =-=Barron, Rissanen, & Yu, 1998-=-). However, as discussed in (Kontkanen, Myllymaki, Silander, Tirri, & Grunwald, 1999; Friedman et al., 1997), these criteria are inherently unsupervised, and establishing supervised variants of them s... |

13 | BAYDA: Software for Bayesian classification and feature selection
- Kontkanen, Myllymaki, et al.
- 1998
(Show Context)
Citation Context ...d marginal likelihood modification also performs poorly, but we argue that this is caused by the crude approximation method used, not by the properties of the criterion itself (see the discussion in (=-=Kontkanen et al., 1998-=-)). Empirical crossvalidation methods perform relatively well, which is not surprising as they can be viewed as approximations of the supervised marginal likelihood, as demonstrated in Section 2.2.3. ... |

7 | On compatible priors for Bayesian networks
- Cowell
- 1992
(Show Context)
Citation Context ...ameters. This prior can either be regarded as a formalization of our prior domain knowledge, which leads to interesting questions about the compatibility of di erent priors (Heckerman & Geiger, 1995; =-=Cowell, 1992-=-), or as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Je reys' prior (Je reys, 1946; Berger, 1985) can be given strong theo... |

6 | An invariant form for the prior probability in estimation problems - reys, H - 1946 |

3 |
BAYDA: Software for Bayesian Classi cation and Feature Selection
- Kontkanen, Myllymaki, et al.
- 1998
(Show Context)
Citation Context ...tion criteria, including Dawid's prequential approach (Dawid, 1984), empirical crossvalidation methods (Stone, 1974; Geisser, 1975), and the supervised marginal likelihood approximation discussed in (=-=Kontkanen, Myllymaki, Silander, & Tirri, 1998-=-). As opposed to the unsupervised marginal likelihood, these criteria can be easily modi ed for di erent loss functions. Dawid's prequential approach can also be shown to possess certain elegant asymp... |