## Comparing Prequential Model Selection Criteria in Supervised Learning of Mixture Models (2001)

Venue: | Proceedings of the Eighth International Conference on Artificial Intelligence and Statistics |

Citations: | 6 - 3 self |

### BibTeX

@INPROCEEDINGS{Kontkanen01comparingprequential,

author = {Petri Kontkanen and Petri Myllymäki and Henry Tirri},

title = {Comparing Prequential Model Selection Criteria in Supervised Learning of Mixture Models},

booktitle = {Proceedings of the Eighth International Conference on Artificial Intelligence and Statistics},

year = {2001},

pages = {233--238},

publisher = {Morgan Kaufmann Publishers}

}

### OpenURL

### Abstract

In this paper we study prequential model selection criteria in supervised learning domains. The main problem with this approach is the fact that the criterion is sensitive to the ordering the data is processed with. We discuss several approaches for addressing the ordering problem, and compare empirically their performance in real-world supervised model selection tasks. The empirical results demonstrate that with the prequential approach it is quite easy to find predictive models that are significantly more accurate classifiers than the models found by the standard unsupervised marginal likelihood criterion. The results also suggest that averaging over random orderings may be a more sensible strategy for solving the ordering problem than trying to find the ordering optimizing the prequential model selection criterion. 1

### Citations

3921 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...nto two equal size sets, the training data and the test data. A pool of 40 candidate models (clusterings) z 1 ; : : : ; z 100 was then produced by running the K-means clustering algorithm (see, e.g., =-=[8]-=-) 40 times with the training data, starting from random initial points. The number of mixture components (the number of clusters, i.e., the number of possible values of Z) varied randomly between 3 an... |

2868 |
P.: UCI Repository of Machine Learning Databases
- Merz, Merphy
- 1996
(Show Context)
Citation Context ...sterings z as our models M . The prequential model selection criterion alternatives discussed in Section 3 were empirically validated by using 14 classification data sets from the UCI data repository =-=[2]-=-. A single model selection experiment was performed in the following way. The data was first partitioned into two equal size sets, the training data and the test data. A pool of 40 candidate models (c... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ... and consistency between different priors [12, 3], or only as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Jeffreys' prior =-=[14, 1]-=- can be given strong theoretical justification from the predictive performance point of view with respect to the so called minimax loss formulation [21, 10]. Some empirical results concerning the effe... |

721 |
Cross-Validatory Choices and Assessment of Statistical Prediction (with Discussion
- Stone
- 1974
(Show Context)
Citation Context ...be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive sequential) approaches [5, 6, 7, 19] or cross-validation =-=[23, 9]-=- lead to more accurate predictive models. In this paper we extend and elaborate our previous work in two ways. First, instead of constraining ourselves to simple variants of the Naive Bayes model, her... |

296 | Stochastic Complexity in Statistical Inquiry - Rissanen - 1989 |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...hat a certain prior known as Jeffreys' prior [14, 1] can be given strong theoretical justification from the predictive performance point of view with respect to the so called minimax loss formulation =-=[21, 10]-=-. Some empirical results concerning the effect of Jeffreys' prior on predictive accuracy can be found in [16, 17, 11]. In the remainder of this paper we do not address the important problem of choosin... |

249 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ...ally that marginal likelihood can be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive sequential) approaches =-=[5, 6, 7, 19]-=- or cross-validation [23, 9] lead to more accurate predictive models. In this paper we extend and elaborate our previous work in two ways. First, instead of constraining ourselves to simple variants o... |

194 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1939
(Show Context)
Citation Context ... and consistency between different priors [12, 3], or only as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Jeffreys' prior =-=[14, 1]-=- can be given strong theoretical justification from the predictive performance point of view with respect to the so called minimax loss formulation [21, 10]. Some empirical results concerning the effe... |

191 |
Bayesian analysis in expert systems
- Spiegelhalter, Dawid, et al.
- 1993
(Show Context)
Citation Context ... (v i jv i\Gamma1 ; u i ; M) N Y i=1 P (u i jv i\Gamma1 ; u i\Gamma1 ; M): (3) Of these two products, the first one was called the partial (marginal) likelihood in [4] and conditional node monitor in =-=[22]-=-. We now see that if we use the partial marginal likelihood as a basis for a prequential scoring function, this results in a sequential process where at time i, the classification predictive distribut... |

171 |
Statistical theory: the prequential approach
- Dawid
- 1984
(Show Context)
Citation Context ...ally that marginal likelihood can be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive sequential) approaches =-=[5, 6, 7, 19]-=- or cross-validation [23, 9] lead to more accurate predictive models. In this paper we extend and elaborate our previous work in two ways. First, instead of constraining ourselves to simple variants o... |

136 | Universal prediction
- Merhav, Feder
- 1998
(Show Context)
Citation Context ...om sample from some "true" but unknown probability distribution, it should be pointed out that the model selection problem can also be formalized without such an assumption, as demonstrated =-=in, e.g., [5, 20, 18]-=-. Given a set F = fM 1 ; : : : ; Mmg of possible models, and a data sample D, in the (unsupervised) model selection problem, the task is to choose a model M 2 F so that the resulting predictive distri... |

128 |
Partial likelihoods
- Cox
- 1975
(Show Context)
Citation Context ...Gamma1 ; u i\Gamma1 ; M) = N Y i=1 P (v i jv i\Gamma1 ; u i ; M) N Y i=1 P (u i jv i\Gamma1 ; u i\Gamma1 ; M): (3) Of these two products, the first one was called the partial (marginal) likelihood in =-=[4]-=- and conditional node monitor in [22]. We now see that if we use the partial marginal likelihood as a basis for a prequential scoring function, this results in a sequential process where at time i, th... |

102 |
The predictive sample reuse method with applications
- Geisser
- 1975
(Show Context)
Citation Context ...be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive sequential) approaches [5, 6, 7, 19] or cross-validation =-=[23, 9]-=- lead to more accurate predictive models. In this paper we extend and elaborate our previous work in two ways. First, instead of constraining ourselves to simple variants of the Naive Bayes model, her... |

72 | The minimum description length principle and reasoning under uncertainty
- Gru¨nwald
- 1998
(Show Context)
Citation Context ...hat a certain prior known as Jeffreys' prior [14, 1] can be given strong theoretical justification from the predictive performance point of view with respect to the so called minimax loss formulation =-=[21, 10]-=-. Some empirical results concerning the effect of Jeffreys' prior on predictive accuracy can be found in [16, 17, 11]. In the remainder of this paper we do not address the important problem of choosin... |

50 |
Prequential Analysis, Stochastic Complexity and Bayesian Inference
- Dawid
- 1992
(Show Context)
Citation Context ...ally that marginal likelihood can be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive sequential) approaches =-=[5, 6, 7, 19]-=- or cross-validation [23, 9] lead to more accurate predictive models. In this paper we extend and elaborate our previous work in two ways. First, instead of constraining ourselves to simple variants o... |

38 | On predictive distributions and Bayesian networks
- Kontkanen, Myllymäki, et al.
- 2000
(Show Context)
Citation Context ...ctive performance point of view with respect to the so called minimax loss formulation [21, 10]. Some empirical results concerning the effect of Jeffreys' prior on predictive accuracy can be found in =-=[16, 17, 11]-=-. In the remainder of this paper we do not address the important problem of choosing the prior distributions, but simply use uniform non-informative priors for the model parameters ` as well as for th... |

23 | Likelihoods and parameter priors for Bayesian networks
- Heckerman, Geiger
- 1996
(Show Context)
Citation Context ... parameters. This prior can either be regarded as a formalization of our prior domain knowledge, in which case we are faced with the question of compatibility and consistency between different priors =-=[12, 3]-=-, or only as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Jeffreys' prior [14, 1] can be given strong theoretical justifica... |

23 | Models and selection criteria for regression and classification
- Heckerman, Meek
- 1997
(Show Context)
Citation Context ...ven the sample data. Assuming all the models to be equally probable a priori, this leads to choosing the model maximizing the marginal likelihood or the evidence of the data. However, as discussed in =-=[13]-=-, the maximal evidence model represents well the joint distribution of the domain variables, and is hence a solution for unsupervised model selection tasks. Nevertheless, this approach is frequently u... |

22 |
Fisherian inference in likelihood and prequential frames of reference
- Dawid, A
- 1991
(Show Context)
Citation Context |

19 | Minimum encoding approaches for predictive modeling
- Grünwald, Kontkanen, et al.
- 1998
(Show Context)
Citation Context ...ctive performance point of view with respect to the so called minimax loss formulation [21, 10]. Some empirical results concerning the effect of Jeffreys' prior on predictive accuracy can be found in =-=[16, 17, 11]-=-. In the remainder of this paper we do not address the important problem of choosing the prior distributions, but simply use uniform non-informative priors for the model parameters ` as well as for th... |

18 | On supervised selection of Bayesian networks
- Kontkanen, Myllymäki, et al.
- 1999
(Show Context)
Citation Context ...eless, this approach is frequently used also for supervised model selection tasks, such as the classification problem at hand. This issue is discussed in more detail in Section 2. In our earlier work =-=[15]-=- we demonstrated empirically that marginal likelihood can be in practice a poor model selection criterion for classification domains, and that model selection criteria based on prequential (predictive... |

7 | On compatible priors for Bayesian networks
- Cowell
- 1992
(Show Context)
Citation Context ... parameters. This prior can either be regarded as a formalization of our prior domain knowledge, in which case we are faced with the question of compatibility and consistency between different priors =-=[12, 3]-=-, or only as a technical parameter representing no such information. In the latter case, it can be shown that a certain prior known as Jeffreys' prior [14, 1] can be given strong theoretical justifica... |

2 | Exploring the Robustness of Bayesian and Information-Theoretic Methods for Predictive Inference
- Kontkanen, Myllymaki, et al.
- 1999
(Show Context)
Citation Context ...ctive performance point of view with respect to the so called minimax loss formulation [21, 10]. Some empirical results concerning the effect of Jeffreys' prior on predictive accuracy can be found in =-=[16, 17, 11]-=-. In the remainder of this paper we do not address the important problem of choosing the prior distributions, but simply use uniform non-informative priors for the model parameters ` as well as for th... |

1 |
Competitive on-line statistics
- unknown authors
(Show Context)
Citation Context ...requential approaches where the model selection criteria are typically computed predictively and sequentially ("prequentially"). Theoretical frameworks for prequential model selection can be=-= found in [5, 6, 7, 19, 20, 24]-=-. It is noteworthy that although these frameworks are motivated by various different considerations, all the suggested approaches lead to quite similar results if the predictive accuracy is measured b... |