## Hypothesis Selection and Testing by the MDL Principle (1998)

Venue: | The Computer Journal |

Citations: | 57 - 3 self |

### BibTeX

@ARTICLE{Rissanen98hypothesisselection,

author = {J. Rissanen},

title = {Hypothesis Selection and Testing by the MDL Principle},

journal = {The Computer Journal},

year = {1998},

volume = {42},

pages = {260--269}

}

### Years of Citing Articles

### OpenURL

### Abstract

ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data

### Citations

8563 | Elements of Information Theory - Cover, Thomas - 1991 |

1682 | An Introduction to Kolmogorov Complexity and its Applications
- Li, Vitányi
- 1997
(Show Context)
Citation Context ... n ). We can fix the problem by a normalization process, which is quite similar to that in the algorithmic theory of complexity when the programs are required to satisfy the prefix property, see e.g. =-=[8]-=-, f (x n ) = f (x n ;s#(x n )) # #(y n )## f (y n ;s#(y n )) dy n , (1) where # denotes a subset of the estimates that makes the integral finite. Moreover, that set ought to be small but easy to defin... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...le to many, even in principle. Nevertheless, one can evaluate such ratios for various extreme priors and the resulting lower and upper bounds do give useful information about the available confidence =-=[4]-=-, which is more than can be obtained with the traditional nonBayesian means. The fact that the MDL principle reduces the hypotheses in the test to simple ones provides a different way to assess the co... |

1160 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...a designed with the resulting model. This was done in the special case of classification models by Wallace and Boulton [1], as early as 1968 and for general parametric model classes later by Rissanen =-=[2]-=-. While quite crude, such a construct was shown to be asymptotically optimal in a strong sense [3]. We give below better constructs, which have certain optimum properties even non-asymptotically. Thes... |

498 | Stochastic Complexity - Rissanen - 1989 |

311 |
An Information Measure for Classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ...ed maximum-likelihood estimates of the parameters, followed by a code for the data designed with the resulting model. This was done in the special case of classification models by Wallace and Boulton =-=[1]-=-, as early as 1968 and for general parametric model classes later by Rissanen [2]. While quite crude, such a construct was shown to be asymptotically optimal in a strong sense [3]. We give below bette... |

285 | Universal coding, information, prediction, and estimation - Rissanen - 1984 |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...plexity, defined by the universal density function f (x n ) in whatever form it is written, represent a quite different philosophy for model selection than the one underlying the Bayesian methods. In =-=[10]-=- we evaluated the denominator C k (n) for `smooth' model classes and obtained the following sharp formula for the ideal code length of the normalized maximum likelihood, whether or not #(x n ) is a su... |

249 |
Stochastic complexity and modeling
- Rissanen
- 1986
(Show Context)
Citation Context ... Wallace and Boulton [1], as early as 1968 and for general parametric model classes later by Rissanen [2]. While quite crude, such a construct was shown to be asymptotically optimal in a strong sense =-=[3]-=-. We give below better constructs, which have certain optimum properties even non-asymptotically. These also appear to provide unsurpassed results in practice. The same principle can also be applied t... |

160 | A universal data compression system - Rissanen - 1983 |

107 | Information-theoretic asymptotics of bayes methods
- Clarke, Barron
- 1990
(Show Context)
Citation Context ...1/2 d# . THE COMPUTER JOURNAL, Vol. 42, No. 4, 1999 HYPOTHESIS SELECTION AND TESTING BY THE MDL PRINCIPLE 265 At least for classes of independent processes that satisfy suitable smoothness conditions =-=[16]-=-, the negative logarithm -log f # (x n ) agrees with Equation (9). Other priors than Jeffreys' can also be used, which may have computational advantage, although then the model complexity will no long... |

66 | Minimum description length induction, Bayesianism, and Kolmogorov complexity
- Vitányi, Li
(Show Context)
Citation Context ... used. For a fixed but large n we may then require the above equality to hold to within a `small' constant. A more refined variant of the above ideas was discussed in a recent paper by Vitanyi and Li =-=[6]-=-, for the purpose of elucidating what the authors call an `ideal' MDL principle and its relationship with a similar Bayesian principle, where a universal prior is taken to describe the optimal model, ... |

50 |
Prequential Analysis, Stochastic Complexity and Bayesian Inference
- Dawid
- 1992
(Show Context)
Citation Context ...pick, except for a set of the parameters of measure zero. Moreover, this length is also asymptotically the shortest possible for all typical sequences, generated by almost all the models in the class =-=[11]-=-. Because of such theorems the name `stochastic complexity' for the code length in Equation (9) seems justified. The important terms, other than the first, which represent the code length needed for t... |

45 | A strong version of the redundancy–capacity theorem of universal coding
- Merhav, Feder
(Show Context)
Citation Context ...e is nothing like the Jeffreys prior in that theory. The important connection between the lower bound on the code length mentioned above and the channel capacity has been revealed by Merhav and Feder =-=[17]-=-, where, moreover, it was shown that also the mixture densities reach the lower bound except for parameters in a vanishing set. Another important technique, which requires few conditions and is hence ... |

27 | Occam’s Two Razors: The Sharp and the Blunt
- Domingos
- 1998
(Show Context)
Citation Context ...ize that the models are in this case parametric. A failure to understand the qualifications in the concept of `shortest code length' has sometimes led to attempts to invalidate the MDL principle, see =-=[7]-=-, by deliberately assigning a shorter codeword to a more complex model of two models compared, the complexity measured for instance by the number of parameters. This, of course, is possible but only i... |

10 |
MDL estimation for small sample sizes and its application to linear regression
- Dom
- 1996
(Show Context)
Citation Context ...n criterion. A full derivation can be found in [13]. The first derivation of the exact formula for the NML density function in this special case with a different region of integration was reported in =-=[14]-=-. EXAMPLE. Consider the set of normal distributions f (y; , #) with variance # and the mean written as a linear combination of a variable number of regressor variables thus = # 1 x 1 + ...+ # k x k . ... |

8 |
Universal sequential coding of single messages
- unknown authors
- 1987
(Show Context)
Citation Context ... take it as all of # k .Wehavemoretosay about the selection of such sets later. Quite interestingly this normalized maximum likelihood (NML) model solves the following minimax problem due to Shtarkov =-=[9]-=-, min q max x n log f (x n ;s#(x n )) q(x n ) . (2) In words, it is the unique density function whose ideal code length exceeds the ideal optimal code length -log f (x n ;s#(x n )) by the least amount... |

7 |
Learning about the parameter of the Bernoulli model
- Vovk
- 1997
(Show Context)
Citation Context ...eir value is in shedding light on the central issues involved, the latter development, in particular, demonstrating the limitations of both the ideal MDL principle and the Bayesian approach; see also =-=[7]-=-. To fix the notations let M k = { f (x n ; #)} be a parametric class of densities as models, where # = # 1 ,...,# k is a parameter vector ranging over a subset # k of the kdimensional Euclidean space... |

6 |
The MDL principle in modeling and coding
- Barron, Rissanen, et al.
- 1998
(Show Context)
Citation Context ...accurately as desired. In the following example we describe the problem and give the result, which for small data sets provides a superior model selection criterion. A full derivation can be found in =-=[13]-=-. The first derivation of the exact formula for the NML density function in this special case with a different region of integration was reported in [14]. EXAMPLE. Consider the set of normal distribut... |

6 |
The MDL Principle in Modeling and Coding', special issue of
- Barron, Rissanen, et al.
- 1998
(Show Context)
Citation Context ...accurately as desired. In the following example we describe the problem and give the result, which for small data sets provides a superior model selection criterion. A full derivation can be found in =-=[1]-=-. The first derivation of the exact formula for the NML density function in this special case with a different region of integration was done in [5]. Example. Consider the set of normal distributions ... |

2 |
MDL denoising. http://www.cs.tut.fi/rissanen/ (submitted to
- Rissanen
- 1999
(Show Context)
Citation Context ...e then assumed that these have been sorted by declining importance, for instance, by a `greedy' algorithm. Actually, a more complete calculation gives a criterion which makes such sorting unnecessary =-=[15]-=-. We conclude this section with a brief account of other means to construct universal representatives for model classes. One of these is the so-called Jeffreys' mixture f # (x n ) = # # f (x n | #)d#(... |

1 |
Occam's two razors: the sharp and the blunt. The 4th Int
- Domingos
- 1998
(Show Context)
Citation Context ...ize that the models are in this case parametric. A failure to understand the qualifications in the concept of `shortest code length' has sometimes led to attempts to invalidate the MDL principle, see =-=[12]-=-, by deliberately assigning a shorter codeword to a more complex model of the two models compared, the complexity measured for instance by the number of parameters. This, of course, is possible but on... |