## Model Selection based on Minimum Description Length (1999)

Venue: | Journal of Mathematical Psychology |

Citations: | 39 - 3 self |

### BibTeX

@ARTICLE{Grünwald99modelselection,

author = {P. Grünwald},

title = {Model Selection based on Minimum Description Length},

journal = {Journal of Mathematical Psychology},

year = {1999},

volume = {44},

pages = {133--152}

}

### Years of Citing Articles

### OpenURL

### Abstract

this paper is, of necessity, quite technical. To get a first but much gentler glimpse, we advise to just read the following (section 2) and the last section (7), which discusses in what sense we may expect Occam's razor to actually work. 2 The Fundamental Idea

### Citations

8659 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ded sequence C(x) retrieves the original sequence x. Moreover, in the encodings C(x), no comma's are needed to separate the codewords. Codes with this latter property are called `uniquely decodable' (=-=Cover and Thomas, 1991-=-). For various reasons (Rissanen, 1989), all codes considered in any application of MDL are uniquely decodable; henceforth whenever we speak of a code, we actually mean a uniquely decodable one. Compr... |

1696 | An introduction to Kolmogorov complexity and its applications, 2nd edition
- Li, Vitanyi
- 1997
(Show Context)
Citation Context .... It is easy to show that there exists quite a short program generating sequence (3) too, as we will make plausible in the next section. Kolmogorov Complexity We now define the Kolmogorov Complexity (=-=Li and Vit'anyi, 1997-=-) of a sequence as the length of the shortest program that prints the sequence and then halts. The lower the Kolmogorov complexity of a sequence, the more regular or equivalently, the less random, or,... |

1262 |
Statistical Decision Theory and Bayesian Analysis
- Berger
- 1985
(Show Context)
Citation Context ...1990) and several models arising in the analysis of DNA and protein structure (Dowe et al., 1996). 6 Comparison to other approaches Bayesian Evidence and Stochastic Complexity In Bayesian statistics (=-=Berger, 1985-=-) we always use a prior distribution w for all elements in the chosen model class M. We can then simply calculate the conditional probability of the data D given model class M as P (DjM) = P `2M P (Dj... |

1178 | Modeling by the shortest data description - Rissanen - 1978 |

500 |
Stochastic Complexity and
- Rissanen
- 1989
(Show Context)
Citation Context ...will become clear later): The stochastic complexity of the data set D with respect to the model class M is the shortest code length of D obtainable when the encoding is done with the help of class M (=-=Rissanen, 1987-=-; Rissanen, 1996). Here `with the help of' has a clear intuitive meaning: if there exists a model in M which captures the regularities in D well, or equivalently gives a good fit to D, then the code l... |

298 |
Stochastic Complexity and Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...ckpropagation neural networks - is better even than the class of all polynomials. We can define stochastic complexity even for such broad classes and select the one which allows for more compression (=-=Rissanen, 1989-=-) of the data at hand. 3 Codes, Probability Distributions and Hypotheses In the next section we give a formal definition of stochastic complexity. We now prepare this by first making precise the notio... |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...gets replaced by an integral. In this case, again for many model classes there exists a prior w(`) with which \Gamma log P av (DjM) approximates I(DjM) extremely well but it is not the uniform prior (=-=Rissanen, 1996-=-). For most other priors which give probability ? 0 to all models in M, the approximation to I(DjM) will still be reasonable but only for large datasets. SC as a two stage description of the data Anot... |

207 |
Minimum complexity density estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ...lidation seems to perform comparable to MDL (Kearns et al., 1997). 7 On Overfitting and Underfitting There have been several studies which indicate that MDL performs well both in theory and practice (=-=Barron and Cover, 1991-=-; Rissanen, 1989; Kearns et al., 1997; Kontkanen et al., 1996). We will not go into these studies any further but rather appeal to a common-sense argument, which, in a nutshell, says that there is a c... |

110 | An experimental and theoretical comparison of model selection methods
- MJ, Mansour, et al.
- 1995
(Show Context)
Citation Context ...overfits' whereas MDL does not (Rissanen, 1989), it is not clear to the present author whether this implies much for practical settings, for which cross-validation seems to perform comparable to MDL (=-=Kearns et al., 1997-=-). 7 On Overfitting and Underfitting There have been several studies which indicate that MDL performs well both in theory and practice (Barron and Cover, 1991; Rissanen, 1989; Kearns et al., 1997; Kon... |

82 |
Complexity regularization with application to artificial neural networks. In: Nonparametric functional estimation and related 28 Hannes Leeb and Benedikt M. Pötscher topics (Spetses
- Barron
- 1991
(Show Context)
Citation Context ... in this paper are very hard to compute. Examples of `strange' model classes for which two-part codes have already been constructed are contextfree grammars (Grunwald, 1996), various neural networks (=-=Barron, 1990-=-) and several models arising in the analysis of DNA and protein structure (Dowe et al., 1996). 6 Comparison to other approaches Bayesian Evidence and Stochastic Complexity In Bayesian statistics (Berg... |

68 |
Reconciling simplicity and likelihood principles in perceptual organization
- Chater
- 1996
(Show Context)
Citation Context ...v and Chaitin has generated a large body of research on Kolmogorov Complexity (Li and Vit'anyi, 1997), which has found its way into psychology before in a context very different from model selection (=-=Chater, 1996-=-). 1 By this we mean that a universal Turing Machine can be implemented in it (Li and Vit'anyi, 1997). Unfortunately, the Kolmogorov Complexity as such cannot be computed -- there can be no computer p... |

46 | A formal theory of inductive inference, part 1 and part 2 - Solomonoff - 1964 |

35 | A minimum description length approach to grammar inference
- Grünwald
- 1994
(Show Context)
Citation Context ...h the other approximations to SC mentioned in this paper are very hard to compute. Examples of `strange' model classes for which two-part codes have already been constructed are contextfree grammars (=-=Grunwald, 1996-=-), various neural networks (Barron, 1990) and several models arising in the analysis of DNA and protein structure (Dowe et al., 1996). 6 Comparison to other approaches Bayesian Evidence and Stochastic... |

10 | Comparing Bayesian model class selection criteria by discrete finite mixtures
- Kontkanen, Myllymaki, et al.
- 1996
(Show Context)
Citation Context ...997). 7 On Overfitting and Underfitting There have been several studies which indicate that MDL performs well both in theory and practice (Barron and Cover, 1991; Rissanen, 1989; Kearns et al., 1997; =-=Kontkanen et al., 1996-=-). We will not go into these studies any further but rather appeal to a common-sense argument, which, in a nutshell, says that there is a crucial difference between overfitting and underfitting. Let u... |

1 |
Circular The only book about MDL is `Stochastic Complexity in Statistical Inquiry
- Dowe, Allison, et al.
- 1996
(Show Context)
Citation Context ...two-part codes have already been constructed are contextfree grammars (Grunwald, 1996), various neural networks (Barron, 1990) and several models arising in the analysis of DNA and protein structure (=-=Dowe et al., 1996-=-). 6 Comparison to other approaches Bayesian Evidence and Stochastic Complexity In Bayesian statistics (Berger, 1985) we always use a prior distribution w for all elements in the chosen model class M.... |

1 |
Issues in selecting mathematical models of cognition. To appear
- Myung, Pitt
- 1997
(Show Context)
Citation Context ...) + L(` ) (Rissanen, 1996). This version of SC may be of particular interest to psychologists since one can compute it for just about any model class one can think of - in psychological applications (=-=Myung and Pitt, 1997-=-), one more often than not uses complicated models for which the other approximations to SC mentioned in this paper are very hard to compute. Examples of `strange' model classes for which two-part cod... |