## An empirical study of minimum description length model selection with infinite parametric complexity (2006)

Venue: | JOURNAL OF MATHEMATICAL PSYCHOLOGY |

Citations: | 10 - 1 self |

### BibTeX

@ARTICLE{Rooij06anempirical,

author = {Steven de Rooij and Peter Grünwald},

title = {An empirical study of minimum description length model selection with infinite parametric complexity},

journal = {JOURNAL OF MATHEMATICAL PSYCHOLOGY},

year = {2006},

volume = {50},

pages = {180--192}

}

### OpenURL

### Abstract

Parametric complexity is a central concept in Minimum Description Length (MDL) model selection. In practice it often turns out to be infinite, even for quite simple models such as the Poisson and Geometric families. In such cases, MDL model selection as based on NML and Bayesian inference based on Jeffreys ’ prior can not be used. Several ways to resolve this problem have been proposed. We conduct experiments to compare and evaluate their behaviour on small sample sizes. We find interestingly poor behaviour for the plug-in predictive code; a restricted NML model performs quite well but it is questionable if the results validate its theoretical motivation. A Bayesian marginal distribution with Jeffreys’ prior can still be used if one sacrifices the first observation to make a proper posterior; this approach turns out to be most dependable.

### Citations

9127 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...terion is helped by the circumstance that (1) one of the two hypotheses equals the generating distribution and (2) the sample consists of outcomes which are i.i.d. according to this distribution. In [=-=Cover and Thomas 1991-=-], Sanov’s Theorem is used to show that in such a situation, the probability that the criterion prefers the wrong model (“error probability”) decreases exponentially in the sample size. If the Bayesia... |

2685 | Estimating the dimension of a model - Schwarz - 1978 |

1238 | Modeling by shortest data description - Rissanen - 1978 |

1148 |
Nationalism: Theory
- Smith
- 2001
(Show Context)
Citation Context ... then it is possible to compute a perfectly proper Bayesian posterior, after observing the first outcome, and use that as a prior to compute the marginal likelihood of the rest of the data. Refer to [=-=Bernardo and Smith 1994-=-] for more information on objective Bayesian theory. The resulting universal codes with lengths L BAYES(x2, . . . , xn | x1) are, in fact, conditional on the first outcome. Recent work by [Liang and B... |

325 |
Stochastic Complexity in Statistical Inquiry
- Rissanen
(Show Context)
Citation Context ...ypotheses, or models, for some phenomenon, based on the available data. Let M1 and M2 be two parametric models and D be the observed data. According to the Minimum Description Length (MDL) Principle [=-=Rissanen 1989-=-; Barron, Rissanen, and Yu 1998; Grünwald 2005], model selection should proceed by selecting the model that allows for the shortest description of the data. MDL was introduced to a psychological audie... |

314 | The minimum description length principle in coding and modeling,” this issue
- Barron, Rissanen, et al.
(Show Context)
Citation Context ... sense that it codes the data in not many more bits than the best code in the set, whichever data sequence is actually observed. We thus define universal codes on an individual sequence basis, as in [=-=Barron et al. 1998-=-], rather than in an expected sense. The difference between the codelength of the universal code and the codelength of the shortest code in the set is called the regret, which is a function of a concr... |

295 |
Universal coding, information, prediction, and estimation
- Rissanen
- 1984
(Show Context)
Citation Context ...mplemen7station hardly requires any arbitrary decisions. Here the outcomes are coded sequentially using the probability distribution indexed by the ML estimator for the previous outcomes [Dawid 1984; =-=Rissanen 1984-=-]; for a general introduction see [Wagenmakers, Grünwald, and Steyvers 2006] or [Grünwald 2005]. LPIPC(x n n� ) = L � xi | ˆµ(x i−1 ) � , i=1 where L � xi | ˆµ(x i−1 ) � = − ln P � xi | ˆµ(x i−1 ) � i... |

290 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...ple space X n , LU(x n ) = − ln PU (x n ). Many different constructions of universal codes have been proposed. Some are easy to implement, others have nice theoretical properties. The MDL philosophy [=-=Rissanen 1996-=-; Grünwald 2005] has it that the best universal code minimises the regret (1) in the worst case of all possible data sequences. This “minimax optimal” solution is called the “Normalised Maximum Likeli... |

68 |
Present position and potential developments: Some personal views. Statistical theory. The prequential approach (with discussion
- DAWID
- 1984
(Show Context)
Citation Context ...eover, its implemen7station hardly requires any arbitrary decisions. Here the outcomes are coded sequentially using the probability distribution indexed by the ML estimator for the previous outcomes [=-=Dawid 1984-=-; Rissanen 1984]; for a general introduction see [Wagenmakers, Grünwald, and Steyvers 2006] or [Grünwald 2005]. LPIPC(x n n� ) = L � xi | ˆµ(x i−1 ) � , i=1 where L � xi | ˆµ(x i−1 ) � = − ln P � xi |... |

57 | Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions
- Balasubramanian
- 1997
(Show Context)
Citation Context ...sponds to a universal code LBAYES(xn ) = − ln PBAYES(xn ). Like the NML, this can be approximated with an asymptotic formula. For exponential families such as the models under consideration, we have [=-=Balasubramanian 1997-=-]: LABAYES(x n � ) := L x n | ˆ � θ + k n ln + ln 2 2π � det I(θ) , (12) w(θ) where the asymptotic behaviour is the same as for the approximation of the NML codelength, roughly L ABAYES(x n ) − L BAYE... |

53 | MDL Denoising
- Rissanen
- 2000
(Show Context)
Citation Context ...have the almost minimax optimality property. 4.5 Renormalised Maximum Likelihood Related to the two-part restricted ANML, but more elegant, is Rissanen’s renormalised maximum likelihood (RNML) code, [=-=Rissanen 2000-=-; Grünwald 2005]. This is perhaps the most widely known approach to deal with infinite parametric complexity. The idea here is that the NML distribution is well-defined if the parameter range is restr... |

46 | Model selection based on minimum description length
- Grunwald
- 2000
(Show Context)
Citation Context ...sanen, and Yu 1998; Grünwald 2005], model selection should proceed by selecting the model that allows for the shortest description of the data. MDL was introduced to a psychological audience in 2000 [=-=Grünwald 2000-=-] and since then, it has been successfully applied in a number of psychological contexts [Myung, Pitt, Zhang, and Balasubramanian 2001; Pitt, Myung, and Zhang 2002; Lee and Navarro 2005; Chater 2005].... |

45 | Counting probability distributions: Differential geometry and model selection - Myung, Balasubramanian, et al. - 2000 |

44 |
Objective Bayesian methods for model selection: Introduction and comparison
- Berger, Pericchi
- 2001
(Show Context)
Citation Context ...s of the data than to let it depend on arbitrary decisions of the scientist, such as the choice for a maximum value for µ ∗ in the case of the restricted ANML criterion. As advocated for instance in [=-=Berger and Pericchi 1997-=-], arbitrariness can be reduced by conditioning on every outcome in turn and then using the mean or median codelength one so obtains. We have not gone to such lengths in this study. We compute Jeffrey... |

27 |
A Predictive Least-Squares Principle
- Rissanen
- 1986
(Show Context)
Citation Context ...rd it as an approximation”. There are also many results showing that the regret for the plug-in code grows as k 2 ln n, the same as the regret for the NML code, for a variety of models. Examples are [=-=Rissanen 1986-=-; Gerensce’r 1987; Wei 1990]. Finally, publications such as [Modha and Masry 1998; Kontkanen et al. 2001] show excellent behaviour of the plug-in criterion for model selection in regression and classi... |

15 |
Information theoretic asymptotics of Bayes methods
- Clarke, Barron
- 1990
(Show Context)
Citation Context ...hip to the real world. Moreover, in the objective Bayesian branch of Bayesian statistics, one does emphasise procedures with good frequentist behaviour [Berger 2004]. At least in restricted contexts [=-=Clarke and Barron 1990-=-; Clarke and Barron 1994], Jeffreys’ prior has the property that the Kullback-Leibler divergence between the true distribution and the posterior converges to zero quickly, no matter what the true dist... |

15 | Accumulative prediction error and the selection of time series models - Wagenmakers, Grünwald, et al. - 2006 |

13 | Toward a method for selecting among computational models for cognition - A, Myung, et al. - 2002 |

6 |
Jeffreys' Prior is Asymptotically Least Favourable under Entropy Risk
- Clarke, Barron
- 1994
(Show Context)
Citation Context ...oreover, in the objective Bayesian branch of Bayesian statistics, one does emphasise procedures with good frequentist behaviour [Berger 2004]. At least in restricted contexts [Clarke and Barron 1990; =-=Clarke and Barron 1994-=-], Jeffreys’ prior has the property that the Kullback-Leibler divergence between the true distribution and the posterior converges to zero quickly, no matter what the true distribution is. Consequentl... |

6 | Comparing prequential model selection criteria in supervised learning of mixture models
- Kontkanen, Myllymäki, et al.
- 2001
(Show Context)
Citation Context ...code grows as k 2 ln n, the same as the regret for the NML code, for a variety of models. Examples are [Rissanen 1986; Gerensce’r 1987; Wei 1990]. Finally, publications such as [Modha and Masry 1998; =-=Kontkanen et al. 2001-=-] show excellent behaviour of the plug-in criterion for model selection in regression and classification based on Bayesian networks, respectively. So, we were extremely puzzled by these results at fir... |

6 |
Prequential and cross-validated regression estimation
- Modha, Masry
- 1998
(Show Context)
Citation Context ...t, maximum likelihood testing) lead to relatively poor results. Our most surprising finding is the fact that the plug-in code – which has been shown to perform remarkably well in some other contexts [=-=Modha and Masry 1998-=-; Kontkanen, Myllymäki, and Tirri 2001] – shows relatively poor behaviour. We analyze the reasons for this behaviour in Section 6. Since the codes used in these approaches no longer minimise the worst... |

5 |
Theory of Probability (Third
- Jeffreys
- 1961
(Show Context)
Citation Context ... one reason is that it is uniform over all ‘distinguishable’ elements of the model [Balasubramanian 1997], which implies that the obtained results are independent of the parametrisation of the model [=-=Jeffreys 1961-=-]. It is defined as follows: w(θ) = � Θ � det I(θ) � . (13) det I(θ)dθ Unfortunately, the normalisation factor in Jeffreys’ prior diverges for both the Poisson model and the geometric model. But if on... |

5 | Minimum description length and psychological clustering models
- Lee, Navarro
- 2005
(Show Context)
Citation Context ...cal audience in 2000 [Grünwald 2000] and since then, it has been successfully applied in a number of psychological contexts [Myung, Pitt, Zhang, and Balasubramanian 2001; Pitt, Myung, and Zhang 2002; =-=Lee and Navarro 2005-=-; Chater 2005]. A substantial problem when applying MDL in practice is that its optimal implementation based on the normalised maximum likelihood (NML) code, does not exist for many (if not most!) sta... |

4 | Comparing model selection criteria for belief networks. Under submission - Allen, Madani, et al. - 2003 |

4 |
A minimum description length principle for perception
- Chater
- 2005
(Show Context)
Citation Context ...Grünwald 2000] and since then, it has been successfully applied in a number of psychological contexts [Myung, Pitt, Zhang, and Balasubramanian 2001; Pitt, Myung, and Zhang 2002; Lee and Navarro 2005; =-=Chater 2005-=-]. A substantial problem when applying MDL in practice is that its optimal implementation based on the normalised maximum likelihood (NML) code, does not exist for many (if not most!) standard statist... |

4 | Order estimation of stationary Gaussian ARMA processes using Rissanen’s complexity - Gerensce’r - 1987 |

4 |
Hypothesis testing for Poisson versus geometric distributions using stochastic complexity
- Lanterman
- 2004
(Show Context)
Citation Context ...? Acknowledgements The main idea for this article is not our own, but comes from Aaron D. Lanterman’s text “Hypothesis Testing for Poisson versus Geometric Distributions using Stochastic Complexity” [=-=Lanterman 2005-=-] which it is a pleasure to read. He deserves much credit. Furthermore we wish to thank the editors: their extensive, insightful and fair criticisms have led to significant improvements in the text. T... |

4 |
2004a). Exact minimax predictive density estimation and MDL
- Liang, Barron
- 2005
(Show Context)
Citation Context ... Smith 1994] for more information on objective Bayesian theory. The resulting universal codes with lengths L BAYES(x2, . . . , xn | x1) are, in fact, conditional on the first outcome. Recent work by [=-=Liang and Barron 2005-=-] suggests that, at least asymptotically and for one-parameter models, the universal code achieving the minimal expected redundancy 8sconditioned on the first outcome is given by the Bayesian universa... |

2 | The use of MDL to select among computational models of cognition - Myung, Pitt, et al. - 2001 |

1 | Rooij (2005). Asymptotic log-loss of prequential maximum likelihood codes - Grünwald, de |

1 |
A note on the applied use of mdl appproximations. Neural Computation 16
- Navarro
- 2004
(Show Context)
Citation Context ...ise where {fθ} is a parameterised set of functions and noise is a 0-mean, normally distributed noise term. Such models often appear in a psychological contexts [Myung, Balasubramanian, and Pitt 2000; =-=Navarro 2004-=-]. In such cases it is not clear how MDL model selection should be applied. A variety of remedies to this problem have been proposed. These amount to defining the codelength LM(D) using a universal co... |