#### DMCA

## The minimum description length principle in coding and modeling (1998)

### Cached

### Download Links

- [stuff.mit.edu]
- [www.1stworks.com]
- [web.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE TRANS. INFORM. THEORY |

Citations: | 387 - 17 self |

### Citations

12158 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...t gives the path from the root to the leaf labeled Given a code tree let denote the length of the codeword (or path) that describes According to the theory of Shannon, Kraft, and McMillan, see, e.g., =-=[11]-=-, there exists a uniquely decodable code with lengths for if and only if the Kraft inequality (1) holds. Indeed, to each code there corresponds a subprobability mass function For a complete tree, in w... |

4121 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...eaking, the mixture version of the MDL is an approximate penalized likelihood criterion just as the two-stage MDL, which asymptotically behaves as the Bayesian Information Criterion (BIC) of Schwartz =-=[49]-=-. For large, in probability or almost surely From classical parametric estimation theory for regular families, such as nested exponential families, we have the following asymptotic expansion: This giv... |

2779 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ...ugh subsets but we must also be able to compare the performance of subsets of different sizes. Traditionally, the selection is done by hypothesis testing or by a variety of criteria such as AIC, BIC, =-=[1]-=-, [49], and cross validation. They are approximations to prediction errors or to Bayes model selection criterion but they are not derived from any principle outside the model selection problem itself.... |

1523 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ...ta. There are a number of ways to construct representatives of model classes or, what is sufficient, to compute their codelength. The first and the crudest of them, Wallace and Boulton [52], Rissanen =-=[38]-=-, is to encode the data with a (parametric) model defined by the maximum-likelihood estimates, quantized optimally to a finite precision, and then encode the estimates by a prefix code. For a reader w... |

910 |
Stochastic Processes.
- Doob
- 1953
(Show Context)
Citation Context ... for where Because all the models in the summation are singular relative to must be mutually singular with It follows that the log-likelihood ratio or redundancy tends almost surely to infinity, Doob =-=[20]-=-. We find that with probability one, for large and may be called the nonparametric complexity per sample at sample size B. Consistency of the MDL Order Estimates A test for any model selection and est... |

697 |
Three approaches to the quantitative definition of information
- Kolmogorov
- 1965
(Show Context)
Citation Context ...y models for the data, where the goodness can be measured in terms of codelength. Such a view of statistics also conforms nicely with the theory of algorithmic complexity, Solomonoff [47], Kolmogorov =-=[34]-=-, and can draw on its startling finding about the ultimate limitation on all statistical inference, namely, that there is no “mechanical,” i.e., algorithmic, way to find the “best” model of data among... |

517 |
A formal theory of inductive inference
- Solomonoff
- 1964
(Show Context)
Citation Context ...r good probability models for the data, where the goodness can be measured in terms of codelength. Such a view of statistics also conforms nicely with the theory of algorithmic complexity, Solomonoff =-=[47]-=-, Kolmogorov [34], and can draw on its startling finding about the ultimate limitation on all statistical inference, namely, that there is no “mechanical,” i.e., algorithmic, way to find the “best” mo... |

445 |
The determination of the order of an autoregression
- Hannan, Quinn
- 1979
(Show Context)
Citation Context ... identify the correct model with probability tending to as the sample size gets large. Consistency results for the predictive MDL principle can be found in [15], [17], and [32] for regression models, =-=[23]-=-, and [31] for time-series models, and [53] for stochastic regression models. For exponential families, [27] gives a consistency result for BIC. Predictive, two-stage, and mixture forms of the MDL pri... |

406 |
On the mathematical foundations of theoretical statistics
- Fisher
- 1922
(Show Context)
Citation Context ...ate in a direct way some of the most fundamental, albeit elusive, ideas the founding fathers of statistical inference have been groping for, like the objective of statistics is to reduce data, Fisher =-=[21]-=-, and that “we must not overfit data by too complex models.” Perhaps, a statistician can take solace in the fact that by the fundamental Kraft inequality, stated below, a codelength is just another wa... |

372 |
An information measure for classification
- Wallace, Boulton
- 1968
(Show Context)
Citation Context ...oding of the data. There are a number of ways to construct representatives of model classes or, what is sufficient, to compute their codelength. The first and the crudest of them, Wallace and Boulton =-=[52]-=-, Rissanen [38], is to encode the data with a (parametric) model defined by the maximum-likelihood estimates, quantized optimally to a finite precision, and then encode the estimates by a prefix code.... |

243 | Minimum complexity density estimation
- BARRON, COVER
- 1991
(Show Context)
Citation Context ...amilies of i.i.d., or Markov, distributions, where the parameter spaces are of different dimensions and absolutely continuous prior densities are assigned to each dimension. The proof is simple, [2], =-=[5]-=-. Let be the mixture of except for where Because all the models in the summation are singular relative to must be mutually singular with It follows that the log-likelihood ratio or redundancy tends al... |

238 |
A universal data compression system
- Rissanen
- 1983
(Show Context)
Citation Context ...hence the connection between the code redundancy and the number of bits needed to transmit the codebook was not made explicit until the emergence of a universal code based on context models, Rissanen =-=[40]-=-. We discuss briefly this latter type of universal codes. Fig. 1. Context tree for string 00100. An obvious way to design a universal code for data modeled by a finite-state machine is to estimate the... |

226 | The context-tree weighting method: basic properties
- Willems, Shtarkov, et al.
- 1995
(Show Context)
Citation Context ... by the ideal codelength of the universal Context algorithm. The same of course is true of the class of Tree Machines. We conclude this subsection by mentioning another universal code (Willems et al. =-=[55]-=-), where no Tree Machine needs to be found. Instead, by an algorithm one can compute the weighted sum over all complete subtrees of of the probabilities assigned to by the leaves of the subtrees. When... |

208 | A Course in Density Estimation
- Devroye
- 1987
(Show Context)
Citation Context ...o be developed, and NML’s connection to the mixture distributions in this context is yet to be explored. Fano’s inequality from Information Theory has always been used to derive the lower bounds [8], =-=[18]-=-, [25], [26], [58]. MDL-based density estimators now provide refinement to the lower bound and a matching upper bound as shown in Yang and Barron [57], revealing a Kolmogorov capacity characterization... |

170 |
Theory of Probability”, 3rd ed
- Jeffreys
- 1961
(Show Context)
Citation Context ...n the special case of the class of all discrete memoryless (i.i.d.) sources on a given finite alphabet in Xie and Barron [56]. Here we show it holds more generally. The key property of Jeffreys prior =-=[33]-=- for statistics and information theory is that it is the locally invariant measure that makes small Kullback–Leibler balls have equal prior probability (see Hartigan [24, pp. 48–49]). We next study th... |

169 |
Universal sequential coding of single messages
- Shtarkov
- 1987
(Show Context)
Citation Context ...of we do not know which Shannon tree to decode. If we code data using a distribution , the excess codelength, sometimes called regret, over the target value is which has the worst case value Shtarkov =-=[46]-=- posed the problem of choosing to minimize the worst case regret, and he found the unique solution to be given by the maximized likelihood, normalized thus (2) maximized likelihood term by the additio... |

139 | Information-theoretic asymptotics of Bayes methods - Clarke, Barron - 1990 |

132 |
Universal modeling and coding
- G
- 1981
(Show Context)
Citation Context ...data, including their number and the associated structure, and then use the result to encode the data. A particularly convenient way to do it is by an Arithmetic Code, see, e.g., Rissanen and Langdon =-=[39]-=-, which is capable of encoding the individual symbols, even if they are binary, without the need to block them as required in the conventional Huffman codes. However, a direct execution of this progra... |

112 |
Universal Noiseless Coding
- Davisson
- 1973
(Show Context)
Citation Context ...ual to Shannon’s mutual information when the Bayes optimal code is used. Consequently, the maximin average redundancy is which is recognized as the Shannon information capacity of the class (Davisson =-=[12]-=-). The minimax value and the maximin value of the redundancy (i.e., the capacity) are equal (see Davisson et al. [14] and Haussler [28]). In subsequent sections we will have more to say about the Kull... |

94 |
Approximation dans les espaces métriques et théorie de l’estimation
- Birgé
- 1983
(Show Context)
Citation Context ...yet to be developed, and NML’s connection to the mixture distributions in this context is yet to be explored. Fano’s inequality from Information Theory has always been used to derive the lower bounds =-=[8]-=-, [18], [25], [26], [58]. MDL-based density estimators now provide refinement to the lower bound and a matching upper bound as shown in Yang and Barron [57], revealing a Kolmogorov capacity characteri... |

94 |
A universal finite memory source
- Weinberger, Rissanen, et al.
- 1995
(Show Context)
Citation Context ...be approximated with an arithmetic code as well as desired. Collectively, all the special “encoding” nodes carve out from the tree a complete subtree , which defines a Tree Machine (Weinberger et al. =-=[54]-=-). If the data are generated by some Tree Machine in a class large enough to include the set of Markov chains as special Tree Machines, then with a somewhat more elaborated rule for selecting the enco... |

89 |
Present position and potential developments: Some personal views, statistical theory, the prequential approach
- Dawid
- 1984
(Show Context)
Citation Context ...PLE IN CODING AND MODELING 2751 of coding or estimation to the underlying source. Moreover, it has an intimate connection with the prequential approach to statistical inference as advocated by Dawid, =-=[15]-=-, [16]. Let be built on the plug-in predictive distribution based on an estimator , which often is a suitably modified maximum-likelihood estimator to avoid singular probabilities Consider first a fam... |

88 |
On some asymptotic properties of maximum likelihood estimates and related Bayes estimates
- Cam
- 1953
(Show Context)
Citation Context ... of order and the optimal cumulative risk or redundancy of order , we mention here that classic results on negligibility of the set of parameter values for which an estimator is superefficient (LeCam =-=[35]-=- assuming bounded loss) are extended in Barron and Hengartner [6] to the Kullback–Leibler loss using results of Rissanen [42] on the negligibility of the set of parameter values with coding redundancy... |

75 |
Asymptotic Minimax Regret for Data Compression
- Xie, Barron
- 2000
(Show Context)
Citation Context ... Thus optimization over to yield a minimax and maximin procedure is equivalent to choosing a mixture closest to the normalized maximum likelihood in the sense of Kullback–Leibler divergence (see also =-=[56]-=-). Moreover, this divergence represents the gap between the minimax value of the regret and the minimax value of the expected regret. When the gap is small, optimization of the worst case value of the... |

74 |
Logically smooth density estimation
- Barron
- 1985
(Show Context)
Citation Context ...ter values with has volume equal to zero, and [6] shows how this conclusion may be used to strengthen classical statistical results on the negligibility of superefficient parameter estimation. Barron =-=[2]-=- obtained strong pointwise lower bounds that hold for almost every sequence and almost every Related almost sure results appear in Dawid [17]. Here we show that the technique in [2] yields asymptotic ... |

68 | Bayes Theory - Hartigan - 1983 |

57 |
The estimation of the order of an ARMA process
- Hannan
- 1980
(Show Context)
Citation Context ...MDL principle that the model-selection criteria derived from it are consistent although there are obviously other ways to devise directly consistent model-selection criteria, see, for example, Hannan =-=[22]-=- and Merhav and Ziv [36]. The second inequality holds, because the sum is larger than the maximum of the summands. Thus the minimizing distribution is the distribution from the correct model class as ... |

54 | A strong version of the redundancy~capacity theorem of universal coding
- Merhav, Feder
- 1995
(Show Context)
Citation Context ...e boundary of has zero volume. If is any probability distribution for , then (Rissanen [56]) for each positive number and for all , except in a set whose volume goes to zero as Later Merhav and Feder =-=[37]-=- gave similar conclusions bounding the measure of the set of models for which the redundancy is a specified amount less than a target value. They use the minimax redundancy for the target value withou... |

53 |
Density estimation by stochastic complexity
- Rissanen, Speed, et al.
- 1992
(Show Context)
Citation Context ...ass of functions estimated at a better rate have a cover of asymptotically negligible size in comparison to This is shown in Barron and Hengartner [6], extending the arguments of Rissanen [42] and in =-=[45]-=-, and can also be shown by the methods of Merhav and Feder [37]. In the case of a Lipschitz or Sobolev class of functions on a bounded set, with the order of smoothness, and several other function cla... |

52 | Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics
- Haussler, Opper
- 1997
(Show Context)
Citation Context ...is taken over all joint probability densities on (which provide codes for in ). For a recent treatment of asymptotics and metric entropy characterization of the latter quantity see Haussler and Opper =-=[30]-=-. Here, following Yang and Barron [57], we focus on the relationship of the minimax risk to the nonparametric complexity and the Kolmogorov metric entropy as revealed through the resolvability and an ... |

50 |
On predictive least squares principles
- Wei
- 1992
(Show Context)
Citation Context ...y tending to as the sample size gets large. Consistency results for the predictive MDL principle can be found in [15], [17], and [32] for regression models, [23], and [31] for time-series models, and =-=[53]-=- for stochastic regression models. For exponential families, [27] gives a consistency result for BIC. Predictive, two-stage, and mixture forms of the MDL principle are studied and compared in [48] in ... |

44 | A general minimax result for relative entropy
- Haussler
- 1997
(Show Context)
Citation Context ...nized as the Shannon information capacity of the class (Davisson [12]). The minimax value and the maximin value of the redundancy (i.e., the capacity) are equal (see Davisson et al. [14] and Haussler =-=[28]-=-). In subsequent sections we will have more to say about the Kullback–Leibler divergence , including interpretations in coding and prediction, its asymptotics, and useful finite sample bounds. Both of... |

40 |
A source matching approach to finding minimax codes
- Davisson, Leon-Garcia
- 1980
(Show Context)
Citation Context ... is which is recognized as the Shannon information capacity of the class (Davisson [12]). The minimax value and the maximin value of the redundancy (i.e., the capacity) are equal (see Davisson et al. =-=[14]-=- and Haussler [28]). In subsequent sections we will have more to say about the Kullback–Leibler divergence , including interpretations in coding and prediction, its asymptotics, and useful finite samp... |

39 |
A lower bound on the risks of nonparametric estimates of densities
- Hasminskii
- 1978
(Show Context)
Citation Context ...eveloped, and NML’s connection to the mixture distributions in this context is yet to be explored. Fano’s inequality from Information Theory has always been used to derive the lower bounds [8], [18], =-=[25]-=-, [26], [58]. MDL-based density estimators now provide refinement to the lower bound and a matching upper bound as shown in Yang and Barron [57], revealing a Kolmogorov capacity characterization of th... |

21 |
On density estimation in the view of Kolmogorov’s ideas in approximation theory
- Hasminskii, Ibragimov
- 1990
(Show Context)
Citation Context ...ed, and NML’s connection to the mixture distributions in this context is yet to be explored. Fano’s inequality from Information Theory has always been used to derive the lower bounds [8], [18], [25], =-=[26]-=-, [58]. MDL-based density estimators now provide refinement to the lower bound and a matching upper bound as shown in Yang and Barron [57], revealing a Kolmogorov capacity characterization of the mini... |

19 |
information and stochastic complexity
- “Fisher
- 1996
(Show Context)
Citation Context ...ause this additional cost rises due to the unknown parameter, we call it the parametric complexity. Also in support of this terminology we note that other coding schemes, such as two-part codes as in =-=[43]-=- (which first describe parameter estimates to an optimal precision and then the data conditional on the parameter estimates), achieve a similar complexity term expressed in terms of the length of the ... |

18 | How well do Bayes methods work for on-line prediction of f+1; 1g
- Haussler, Barron
- 1992
(Show Context)
Citation Context ...of the sample, and is taking the expected value of as a function of the sample from Morever, we are interested in the minimax nonparametric complexity (redundancy) As shown in [3] (see also [4], [6], =-=[29]-=-, and [57]), this quantity provides for the mixture code an upper bound to the expected redundancy per sample and thereby it also provides and upper bound to the Cesaro average of the Kullback–Leibler... |

18 | Model selection and prediction: normal regression
- Speed, Yu
- 1993
(Show Context)
Citation Context ...and [53] for stochastic regression models. For exponential families, [27] gives a consistency result for BIC. Predictive, two-stage, and mixture forms of the MDL principle are studied and compared in =-=[48]-=- in terms of misfit probabilities and in two prediction frameworks for the regression model. It is worth noting that searching through all the subsets to find codelengths can be a nontrivial task on i... |

13 | Estimating the number of states of a finite-state source
- Ziv, Merhav
- 1992
(Show Context)
Citation Context ...odel-selection criteria derived from it are consistent although there are obviously other ways to devise directly consistent model-selection criteria, see, for example, Hannan [22] and Merhav and Ziv =-=[36]-=-. The second inequality holds, because the sum is larger than the maximum of the summands. Thus the minimizing distribution is the distribution from the correct model class as tends to infinity and un... |

12 |
coding, information, prediction, and estimation
- Universal
- 1984
(Show Context)
Citation Context ... criteria, be they in terms of probability of errors or some distance measure such as the absolute or squared errors, can be expressed in terms of codelength, and there is no conflict between the two =-=[41]-=-, [44]. According to this program, the problems of modeling and inference, then, are not to estimate any “true” data generating distribution with which to do inference, but to search for good probabil... |

11 |
MDL estimation for small sample sizes and its application to linear regression
- Dom
- 1996
(Show Context)
Citation Context ... be no larger than Put We then have the NML density function itself (32) (33) (34) The numerator in (23) has a very simple form (25) and the problem is to evaluate the integral in the denominator. In =-=[19]-=-, Dom evaluated such an integral in a domain that also restricts the range of the estimates to a hypercube. He did the evaluation in a direct manner using a coordinate transformation with its Jacobian... |

10 | prior is asymptotically least favorable under entropy risk - Jeffreys’ - 1994 |

10 |
Asymptotically minimax regret for exponential families
- Takeuchi, Barron
- 1997
(Show Context)
Citation Context ... close to , it is possible to obtain codelengths under suitable conditions such that uniformly in they do not exceed the minimax regret asymptotically. See Xie and Barron [56] and Takeuchi and Barron =-=[50]-=-. E. Strong Optimality of Stochastic Complexity We have seen that the solutions to the two minimax optimization problems behave in a similar manner, and the expressions (9) and (10) for the asymptotic... |

9 |
Information theory and superefficiency
- Barron, Hengartner
- 1998
(Show Context)
Citation Context ...dent of the sample, and is taking the expected value of as a function of the sample from Morever, we are interested in the minimax nonparametric complexity (redundancy) As shown in [3] (see also [4], =-=[6]-=-, [29], and [57]), this quantity provides for the mixture code an upper bound to the expected redundancy per sample and thereby it also provides and upper bound to the Cesaro average of the Kullback–L... |

9 |
Asymptotically optimal function estimation by minimum complexity criteria
- Barron, Yang, et al.
- 1994
(Show Context)
Citation Context ...ce of to is bounded by the index of resolvability, in the sense that in probability, where (13) is the squared Hellinger norm between distributions with densities and These bounds are used in [5] and =-=[7]-=- to derive convergence rates in nonparametric settings with the use of sequences of parametric models of size selected by MDL.BARRON et al.: THE MINIMUM DESCRIPTION LENGTH PRINCIPLE IN CODING AND MOD... |

9 | Information theoretic determination of minimax rates of convergence
- Yang, Barron
- 1999
(Show Context)
Citation Context ...ple, and is taking the expected value of as a function of the sample from Morever, we are interested in the minimax nonparametric complexity (redundancy) As shown in [3] (see also [4], [6], [29], and =-=[57]-=-), this quantity provides for the mixture code an upper bound to the expected redundancy per sample and thereby it also provides and upper bound to the Cesaro average of the Kullback–Leibler risk of t... |

9 | Data compression and histograms - Yu, Speed - 1992 |

8 |
Size of the error in the choice of a model to fit data from an exponential family
- Haughton
- 1989
(Show Context)
Citation Context ...or the predictive MDL principle can be found in [15], [17], and [32] for regression models, [23], and [31] for time-series models, and [53] for stochastic regression models. For exponential families, =-=[27]-=- gives a consistency result for BIC. Predictive, two-stage, and mixture forms of the MDL principle are studied and compared in [48] in terms of misfit probabilities and in two prediction frameworks fo... |

6 | noiseless universal coding for Markov sources - “Minimax - 1983 |

3 |
Bayes rules consistent in information
- “Are
- 1987
(Show Context)
Citation Context ...espect to , independent of the sample, and is taking the expected value of as a function of the sample from Morever, we are interested in the minimax nonparametric complexity (redundancy) As shown in =-=[3]-=- (see also [4], [6], [29], and [57]), this quantity provides for the mixture code an upper bound to the expected redundancy per sample and thereby it also provides and upper bound to the Cesaro averag... |

3 |
Contributions to information theory for abstract alphabets,” Arkiv für
- Tulcea
- 1960
(Show Context)
Citation Context ...ith a uniformity over all inside the probability. That is, remains not greater than , provided that the the sequences of distributions for remain compatible as is increased (Barron [2, p. 28], Tulcea =-=[51]-=-). Consequently, setting is the probability function for the statistic As a consequence of the factorization, the maximum-likelihood estimate may be regarded as a function of the sufficient statistic.... |

2 |
analysis, stochastic complexity and Bayesian inference,” presented at the
- “Prequential
- 1991
(Show Context)
Citation Context ... CODING AND MODELING 2751 of coding or estimation to the underlying source. Moreover, it has an intimate connection with the prequential approach to statistical inference as advocated by Dawid, [15], =-=[16]-=-. Let be built on the plug-in predictive distribution based on an estimator , which often is a suitably modified maximum-likelihood estimator to avoid singular probabilities Consider first a family of... |

1 |
data analysis,” Current Issues in Statistical Inference: Essays in Honor of D
- “Prequential
- 1992
(Show Context)
Citation Context ...egligibility of superefficient parameter estimation. Barron [2] obtained strong pointwise lower bounds that hold for almost every sequence and almost every Related almost sure results appear in Dawid =-=[17]-=-. Here we show that the technique in [2] yields asymptotic pointwise regret lower bounds for general codes that coincide (to within a small amount) with the asymptotic minimax regret including the con... |

1 |
Strong constistency of the predictive least squares criterion for order determination of autoregressive processes
- Hemerly, Davis
- 1989
(Show Context)
Citation Context ...the correct model with probability tending to as the sample size gets large. Consistency results for the predictive MDL principle can be found in [15], [17], and [32] for regression models, [23], and =-=[31]-=- for time-series models, and [53] for stochastic regression models. For exponential families, [27] gives a consistency result for BIC. Predictive, two-stage, and mixture forms of the MDL principle are... |

1 |
Model selection and forward validation,” Scand
- Hjorth
- 1982
(Show Context)
Citation Context ... with the mixture, they also identify the correct model with probability tending to as the sample size gets large. Consistency results for the predictive MDL principle can be found in [15], [17], and =-=[32]-=- for regression models, [23], and [31] for time-series models, and [53] for stochastic regression models. For exponential families, [27] gives a consistency result for BIC. Predictive, two-stage, and ... |

1 |
selection and testing by the MDL principle,” invited paper for the special issue of Comp. J. devoted to Kolmogorov complexity
- “Hypothesis
- 1997
(Show Context)
Citation Context ...ria, be they in terms of probability of errors or some distance measure such as the absolute or squared errors, can be expressed in terms of codelength, and there is no conflict between the two [41], =-=[44]-=-. According to this program, the problems of modeling and inference, then, are not to estimate any “true” data generating distribution with which to do inference, but to search for good probability mo... |

1 | Festschrift in Honor of L. Le Cam on His 70th - Yu, “Assouad, et al. - 1997 |