## A Tutorial Introduction to the Minimum Description Length Principle (2005)

### Cached

### Download Links

- [arxiv.org]
- [www.cwi.nl]
- [www.cwi.nl]
- [omega.albany.edu:8008]
- [omega.math.albany.edu:8008]
- [homepages.cwi.nl]
- [homepages.cwi.nl]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Venue: | Advances in Minimum Description Length: Theory and Applications |

Citations: | 68 - 0 self |

### BibTeX

@INPROCEEDINGS{Grünwald05atutorial,

author = {Peter Grünwald},

title = {A Tutorial Introduction to the Minimum Description Length Principle},

booktitle = {Advances in Minimum Description Length: Theory and Applications},

year = {2005},

publisher = {MIT Press}

}

### OpenURL

### Abstract

### Citations

9811 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...ture. 9sFigure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial. This intuition is confirmed by numerous experiments on real-world data from a broad variety of sources [Rissanen 1989; =-=Vapnik 1998-=-; Ripley 1996]: if one naively fits a highdegree polynomial to a small sample (set of data points), then one obtains a very good fit to the data. Yet if one tests the inferred polynomial on a second s... |

9138 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...ly, each distribution assigns a number to each z, where most z’s are assigned small numbers. It turns out that this correspondence can be made mathematically precise by means of the Kraft inequality [=-=Cover and Thomas 1991-=-]. We neither precisely state nor prove this inequality; rather, in Figure 2.1 we state an immediate and fundamental consequence: probability mass functions correspond to codelength functions. The fol... |

2693 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...ditions on M, we can perform a Laplace approximation of the integral in (2.12). For the special case that M is an exponential family, we obtain the following expression for the regret [Jeffreys 1961; =-=Schwarz 1978-=-; Kass and Raftery 1995; Balasubramanian 1997]: − log ¯ PBayes(x n ) − [− log P (x n | ˆ θ(x n ))] = k n log 2 2π − log w(ˆ θ) + log � |I( ˆ θ)| + o(1). (2.23) Let us compare this with (2.21). Under t... |

2279 |
An introduction to Probability Theory and Its Applications Volume 1 Chapter V, Wiley, 3rd edition
- Feller
- 1968
(Show Context)
Citation Context ...among all possible codes for Z, the code with lengths 31s− log P (Z) ‘on average’ gives the shortest encodings of outcomes of P . Why should we be interested in the average? The law of large numbers [=-=Feller 1968-=-] implies that, for large samples of data distributed according to P , with high P -probability, the code that gives the shortest expected lengths will also give the shortest actual codelengths, which... |

1777 | An introduction to Kolmogorov complexity and its applications
- Li, Vitányi
- 1997
(Show Context)
Citation Context ...age in which to express properties of the data. The most general choice is a general-purpose 2 computer language such as C or Pascal. This choice leads to the definition of the Kolmogorov Complexity [=-=Li and Vitányi 1997-=-] of a sequence as the length of the shortest program that prints the sequence and then halts. The lower the Kolmogorov complexity of a sequence, the more regular it is. This notion seems to be highly... |

1590 | Information theory and an extension of the maximum likelihood principle - Akaike - 1973 |

1239 |
Modeling by Shortest Data Description
- Rissanen
- 1978
(Show Context)
Citation Context ...der one code, but quite short under another, our procedure is in danger of becoming arbitrary. Instead, we need some additional principle for designing a code for H. In the first publications on MDL [=-=Rissanen 1978-=-; Rissanen 1983], it was advocated to choose some sort of minimax code for H, minimizing, in some precisely defined sense, the shortest worst-case total description length L(H) + L(D|H), where the wor... |

1194 |
Pattern Recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...e 1.1: A simple, a complex and a trade-off (3rd degree) polynomial. This intuition is confirmed by numerous experiments on real-world data from a broad variety of sources [Rissanen 1989; Vapnik 1998; =-=Ripley 1996-=-]: if one naively fits a highdegree polynomial to a small sample (set of data points), then one obtains a very good fit to the data. Yet if one tests the inferred polynomial on a second set of data co... |

1150 |
Bayesian Theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...odelength to each of these. 3. Bayesian interpretation In many cases, refined MDL model selection coincides with Bayes factor model selection based on a non-informative prior such as Jeffreys’ prior [=-=Bernardo and Smith 1994-=-]. 4. Prequential interpretation Refined MDL model selection can be interpreted as selecting the model with the best predictive performance when sequentially predicting unseen test data, in the sense ... |

1144 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...max regret. 2.6.3 Bayesian Interpretation The Bayesian method of statistical inference provides several alternative approaches to model selection. The most popular of these is based on Bayes factors [=-=Kass and Raftery 1995-=-]. The Bayes factor method is very closely related to the refined MDL approach. Assuming uniform priors on models M (1) and M (2) , it tells us to select the model with 49slargest marginal likelihood ... |

557 | Three approaches to the quantitative definition of information - KOLMOGOROV - 1965 |

533 |
Stochastic Complexity
- Rissanen
- 1989
(Show Context)
Citation Context ...ferent - see Section 2.9. The first publications on MDL only mention two-part codes. Important progress was made by Rissanen [1984], in which prequential codes are employed for the first 19stime and [=-=Rissanen 1987-=-], introducing the Bayesian mixture codes into MDL. This led to the development of the notion of stochastic complexity as the shortest codelength of the data given a model [Rissanen 1986; Rissanen 198... |

488 |
Statistical Inference
- Casella, Berger
- 2002
(Show Context)
Citation Context ...we will make precise the crude form of MDL informally presented in Section 1.3. We will freely use some convenient statistical concepts which we review in this section; for details see, for example, [=-=Casella and Berger 1990-=-]. We also describe the model class of Markov chains of arbitrary order, which we use as our running example. These admit a simpler treatment than the polynomials, to which we return in Section 2.8. S... |

433 |
A universal prior for integers and estimation by minimum description length’, The Annals of Statistics 11(2
- Rissanen
- 1983
(Show Context)
Citation Context ...ut quite short under another, our procedure is in danger of becoming arbitrary. Instead, we need some additional principle for designing a code for H. In the first publications on MDL [Rissanen 1978; =-=Rissanen 1983-=-], it was advocated to choose some sort of minimax code for H, minimizing, in some precisely defined sense, the shortest worst-case total description length L(H) + L(D|H), where the worst-case is over... |

325 |
Stochastic Complexity in Statistical Inquiry
- Rissanen
(Show Context)
Citation Context ...e rightmost picture. 9sFigure 1.1: A simple, a complex and a trade-off (3rd degree) polynomial. This intuition is confirmed by numerous experiments on real-world data from a broad variety of sources [=-=Rissanen 1989-=-; Vapnik 1998; Ripley 1996]: if one naively fits a highdegree polynomial to a small sample (set of data points), then one obtains a very good fit to the data. Yet if one tests the inferred polynomial ... |

322 |
An information measure for classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ... substantially different from MDL. MDL is much closer related to the Minimum Message Length Principle, developed by Wallace and his co-workers in a series of papers starting with the ground-breaking [=-=Wallace and Boulton 1968-=-]; other milestones are [Wallace and Boulton 1975] and [Wallace and Freeman 1987]. Remarkably, Wallace developed his ideas without being aware of the notion of Kolmogorov complexity. Although Rissanen... |

314 | The minimum description length principle in coding and modeling,” this issue - Barron, Rissanen, et al. |

305 |
Inferring Decision Trees Using the Minimum Description Length Principle
- Quinlan, Rivest
- 1989
(Show Context)
Citation Context ...d be noted that Rissanen’s and Barron’s early theoretical papers on MDL already contain such principles, albeit in a slightly different form than in their recent papers. Early practical applications [=-=Quinlan and Rivest 1989-=-; Grünwald 1996] often do use ad hoc two-part codes which really are ‘crude’ in the sense defined here. 5. See the previous note. 6. For example, cross-validation cannot easily be interpreted in such ... |

296 |
Universal coding, information prediction and estimation
- Rissanen
- 1984
(Show Context)
Citation Context ...lds; we omit the details. We note that all general proofs of (2.30) that we are aware of show that (2.30) holds with probability 1 or in expectation for sequences generated by some distribution in M [=-=Rissanen 1984-=-; Rissanen 1986; Rissanen 1989]. Note that the expressions (2.21) and (2.25) for the regret of ¯ Pnml and ¯ PBayes hold for a much wider class of sequences; they also hold with probability 1 for i.i.d... |

291 | Fisher Information and Stochastic Complexity - Rissanen - 1996 |

266 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ... 19stime and [Rissanen 1987], introducing the Bayesian mixture codes into MDL. This led to the development of the notion of stochastic complexity as the shortest codelength of the data given a model [=-=Rissanen 1986-=-; Rissanen 1987]. However, the connection to Shtarkov’s normalized maximum likelihood code was not made until 1996, and this prevented the full development of the notion of ‘parametric complexity’. In... |

241 | On the length of programs for computing finite binary sequences: statistical considerations - Chaitin - 1969 |

213 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1946
(Show Context)
Citation Context ... two models M (1) and M (2) , then for large enough n, Bayes and refined MDL select the same model. If we equip the Bayesian universal model with a special prior known as the Jeffreys-Bernardo prior [=-=Jeffreys 1946-=-; Bernardo and Smith 1994], wJeffreys(θ) = � � |I(θ)| � , |I(θ)|dθ (2.24) then Bayes and refined NML become even more closely related: plugging in (2.24) into (2.23), we find that the right-hand side ... |

212 |
Minimum complexity density estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ..., we would obtain smaller code lengths - (2.8) would become L(k, θ (k) ) + L(D | k, θ (k) ) = − log P (D | k, ˆ θ (k) ) + k 2 log n + ck, (2.9) where ck is a small constant depending on k, but not n [=-=Barron and Cover 1991-=-]. In Section 2.6 we show that (2.9) is in some sense ‘optimal’. The Good News and the Bad News The good news is (1) we have found a principled, non-arbitrary manner to encode data D given a probabili... |

208 | The similarity metrics - Li, Chen, et al. - 2003 |

197 |
Estimation and inference by compact coding
- Wallace, Freeman
- 1987
(Show Context)
Citation Context ...sage Length Principle, developed by Wallace and his co-workers in a series of papers starting with the ground-breaking [Wallace and Boulton 1968]; other milestones are [Wallace and Boulton 1975] and [=-=Wallace and Freeman 1987-=-]. Remarkably, Wallace developed his ideas without being aware of the notion of Kolmogorov complexity. Although Rissanen became aware of Wallace’s work before the publication of [Rissanen 1978], he de... |

185 |
The selection of prior distributions by formal rules
- RE, Wasserman
- 1996
(Show Context)
Citation Context ...nt behavior of predictive (prequential) coding in Bayesian network model selection and regression. Also, ‘objective Bayesian’ model selection methods are frequently and successfully used in practice [=-=Kass and Wasserman 1996-=-]. Since these are based on non-informative priors such as Jeffreys’, they often coincide with a version of refined MDL and thus indicate successful performance of MDL. 23. But see Viswanathan., Walla... |

156 | Model selection and the principle of minimum description length
- Hansen, Yu
- 2001
(Show Context)
Citation Context ...ctive tasks: nonparametric inference, parameter estimation and regression and classification problems. We give a very brief overview of these - for details we refer to [Barron, Rissanen, and Yu 1998; =-=Hansen and Yu 2001-=-] and, for the classification case, [Grünwald and Langford 2004]. Non-Parametric Inference Sometimes the model class M is so large that it cannot be finitely parameterized. For example, let X = [0, 1]... |

141 | Probability and Finance: It's Only a Game - Shafer, Vovk - 2001 |

131 |
Universal sequential coding of single messages,” Probl
- Shtarkov
- 1987
(Show Context)
Citation Context ...at the more sequences x n with large P (x n | ˆ θ(x n )), the larger COMPn(M). In other words, the more sequences that can be fit well by an element of M, the larger M’s complexity. Proposition 2.14 [=-=Shtarkov 1987-=-] Suppose that COMPn(M) is finite. Then the minimax regret (2.16) is uniquely achieved for the distribution ¯ Pnml given by ¯Pnml(x n P (x ):= n | ˆ θ(xn )) � yn∈X n P (yn | ˆ θ(yn )) . (2.18) The dis... |

130 |
Complexity-based induction systems: comparisons and convergence theorems
- Solomonoff
- 1978
(Show Context)
Citation Context ...ate model for a sequence of data may be identified with the shortest program that prints the data. Solomonoff’s ideas were later extended by several authors, leading to an ‘idealized’ version of MDL [=-=Solomonoff 1978-=-; Li and Vitányi 1997; Gács, Tromp, and Vitányi 2001]. This idealized MDL is very general in scope, but not practically applicable, for the following two reasons: 1. uncomputability It can be shown th... |

129 |
Bayesian Statistics: An Introduction
- Lee
- 2004
(Show Context)
Citation Context ...t interpretations of MDL listed in Section 2.6 make clear that MDL applications can also be justified without adopting such a radical philosophy. 2.9.2 MDL and Bayesian Inference Bayesian statistics [=-=Lee 1997-=-; Bernardo and Smith 1994] is one of the most well-known, frequently and successfully applied paradigms of statistical inference. It is often claimed that ‘MDL is really just a special case of Bayes 1... |

110 | An Experimental and Theoretical Comparison of Model Selection Methods - Kearns, Mansour, et al. - 1997 |

96 | Probability Theory: The Logic of Science; Cambridge - Jaynes - 2003 |

85 |
Complexity regularization with application to artificial neural networks
- Barron
- 1991
(Show Context)
Citation Context ...se as being normally distributed. Alternatively, it has been tried to directly try to learn functions h ∈ H from the data, without making any probabilistic assumptions about the noise [Rissanen 1989; =-=Barron 1990-=-; Yamanishi 1998; Grünwald 1998; Grünwald 1999]. The idea is to learn a function h that leads to good predictions of future data from the same source in the spirit of Vapnik’s [1998] statistical learn... |

85 | The Role of Occam's Razor in Knowledge Discovery”, Data Mining and Knowledge Discovery, an International Journal
- Domingos
- 1999
(Show Context)
Citation Context ...hilosophy is quite agnostic about whether any of the models under consideration is ‘true’, or whether something like a ‘true distribution’ even exists. Nevertheless, it has been suggested [Webb 1996; =-=Domingos 1999-=-] that MDL embodies a naive belief that ‘simple models’ are ‘a priori more likely to be true’ than complex models. Below we explain why such claims are mistaken. 1.6 MDL and Occam’s Razor When two mod... |

82 |
Geometrical Foundations of Asymptotic Inference
- KASS, E, et al.
- 1997
(Show Context)
Citation Context ...Θ − log P (x n | ˆ θ(x n )) + k n log 2 2π − log w(ˆ θ) + log � ��Î(x n ) � � + o(1). (2.25) Here Î(xn ) is the so-called observed information, sometimes also called observed Fisher information; see [=-=Kass and Voss 1997-=-] for a definition. If M is an exponential family, 50sthen the observed Fisher information at x n coincides with the Fisher information at ˆθ(x n ), leading to (2.23). If M is not exponential, then if... |

75 | The Minimum Description Length Principle and Reasoning under Uncertainty
- Grünwald
- 1998
(Show Context)
Citation Context ...ted. Alternatively, it has been tried to directly try to learn functions h ∈ H from the data, without making any probabilistic assumptions about the noise [Rissanen 1989; Barron 1990; Yamanishi 1998; =-=Grünwald 1998-=-; Grünwald 1999]. The idea is to learn a function h that leads to good predictions of future data from the same source in the spirit of Vapnik’s [1998] statistical learning theory. Here prediction qua... |

69 | Present position and potential developments: Some personal views, statistical theory, the prequential approach - Dawid - 1984 |

66 |
Logically smooth density estimation
- Barron
- 1985
(Show Context)
Citation Context ...nspirational role in Rissanen’s development of MDL. Over the last fifteen years, several ‘idealized’ versions of MDL have been proposed, which are more directly based on Kolmogorov complexity theory [=-=Barron 1985-=-; Barron and Cover 1991; Li and Vitányi 67s1997; Vereshchagin and Vitányi 2002]. These are all based on two-part codes, where hypotheses are described using a universal programming language such as C ... |

64 | Strong optimality of the normalized ml models as universal codes and information in data
- Rissanen
- 2001
(Show Context)
Citation Context ...ian and minimax optimal (NML) universal models, but there several other types. We mention prequential universal models (Section 2.6.4), the Kolmogorov universal model, conditionalized two-part codes [=-=Rissanen 2001-=-] and Cesaro-average codes [Barron, Rissanen, and Yu 1998]. 2.6 Simple Refined MDL and its Four Interpretations In Section 2.4, we indicated that ‘crude’ MDL needs to be refined. In Section 2.5 we int... |

58 | Theory of Probability: A Critical Introductory Treatment. Vol.1 - Finetti - 1970 |

57 | Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions
- Balasubramanian
- 1997
(Show Context)
Citation Context ...e approximation of the integral in (2.12). For the special case that M is an exponential family, we obtain the following expression for the regret [Jeffreys 1961; Schwarz 1978; Kass and Raftery 1995; =-=Balasubramanian 1997-=-]: − log ¯ PBayes(x n ) − [− log P (x n | ˆ θ(x n ))] = k n log 2 2π − log w(ˆ θ) + log � |I( ˆ θ)| + o(1). (2.23) Let us compare this with (2.21). Under the regularity conditions needed for (2.21), t... |

56 | The consistency of the BIC Markov order estimator - Csiszár, Shields |

56 | Further experimental evidence against the utility of occams razor
- Webb
- 1996
(Show Context)
Citation Context ..., the MDL philosophy is quite agnostic about whether any of the models under consideration is ‘true’, or whether something like a ‘true distribution’ even exists. Nevertheless, it has been suggested [=-=Webb 1996-=-; Domingos 1999] that MDL embodies a naive belief that ‘simple models’ are ‘a priori more likely to be true’ than complex models. Below we explain why such claims are mistaken. 1.6 MDL and Occam’s Raz... |

53 | MDL Denoising
- Rissanen
- 2000
(Show Context)
Citation Context ..., this NML distribution is not well-defined. We can get reasonable alternative universal models after all using any of the methods described in Section 2.7.2; see [Barron, Rissanen, and Yu 1998] and [=-=Rissanen 2000-=-] for details. 62s‘Non-probabilistic’ Regression and Classification In the approach we just described, we modeled the noise as being normally distributed. Alternatively, it has been tried to directly ... |

50 | Prequential Analysis, Stochastic Complexity and Bayesian Inference - Dawid - 1992 |

50 | Algorithmic statistics - Gács, Tromp, et al. |

50 |
A Decision-Theoretic Extension of Stochastic Complexity and Its Application to Learning
- Yamanishi
- 1998
(Show Context)
Citation Context ...ormally distributed. Alternatively, it has been tried to directly try to learn functions h ∈ H from the data, without making any probabilistic assumptions about the noise [Rissanen 1989; Barron 1990; =-=Yamanishi 1998-=-; Grünwald 1998; Grünwald 1999]. The idea is to learn a function h that leads to good predictions of future data from the same source in the spirit of Vapnik’s [1998] statistical learning theory. Here... |

49 | Density estimation by stochastic complexity - Rissanen, Speed, et al. - 1992 |