Results 1  10
of
46
The minimum description length principle in coding and modeling
 IEEE TRANS. INFORM. THEORY
, 1998
"... We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized ..."
Abstract

Cited by 315 (12 self)
 Add to MetaCart
We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples.
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 156 (5 self)
 Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
The Gaussian Watermarking Game
, 2000
"... Watermarking models a copyright protection mechanism where an original source sequence or "covertext" is modified before distribution to the public in order to embed some extra information. The embedding should be transparent (i.e., the modified data sequence or "stegotext" shoul ..."
Abstract

Cited by 112 (9 self)
 Add to MetaCart
Watermarking models a copyright protection mechanism where an original source sequence or "covertext" is modified before distribution to the public in order to embed some extra information. The embedding should be transparent (i.e., the modified data sequence or "stegotext" should be similar to the covertext) and robust (i.e., the extra information should be recoverable even if the stegotext is modified further, possibly by a malicious "attacker"). We compute the coding capacity of the watermarking game for a Gaussian covertext and squarederror distortions. Both the public version of the game (covertext known to neither attacker nor decoder) and the private version of the game (covertext unknown to attacker but known to decoder) are treated. While the capacity of the former cannot, of course, exceed the capacity of the latter, we show that the two are, in fact, identical. These capacities depend critically on whether the distortion constraints are required to be met in expectation or with probability one. In the former case the coding capacity is zero, whereas in the latter it coincides with the value of related zerosum dynamic mutual informations games of complete and perfect information. # Parts of this work were presented at the 2000 Conference on Information Sciences and Systems (CISS '00), Princeton University, Princeton, NJ, March 1517, 2000, and at the 2000 IEEE International Symposium on Information Theory (ISIT '00), Sorrento, Italy, June 2530, 2000.
Capacity bounds via duality with applications to multipleantenna systems on flatfading channels
 IEEE Trans. Inform. Theory
, 2003
"... A general technique is proposed for the derivation of upper bounds on channel capacity. The technique is based on a dual expression for channel capacity where the maximization (of mutual information) over distributions on the channel input alphabet is replaced with a minimization (of average relativ ..."
Abstract

Cited by 103 (36 self)
 Add to MetaCart
A general technique is proposed for the derivation of upper bounds on channel capacity. The technique is based on a dual expression for channel capacity where the maximization (of mutual information) over distributions on the channel input alphabet is replaced with a minimization (of average relative entropy) over distributions on the channel output alphabet. Every choice of an output distribution — even if not the channel image of some input distribution — leads to an upper bound on mutual information. The proposed approach is used in order to study multiantenna flat fading channels with memory where the realization of the fading process is unknown at the transmitter and unknown (or only partially known) at the receiver. It is demonstrated that, for high signaltonoise ratio (SNR), the capacity of such channels typically grows only doublelogarithmically in the SNR. This is in stark contrast to the case with perfect receiver side information where capacity grows logarithmically in the SNR. To better understand this phenomenon
Game Theory, Maximum Entropy, Minimum Discrepancy And Robust Bayesian Decision Theory
 ANNALS OF STATISTICS
, 2004
"... ..."
Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity
 IEEE Transactions on Information Theory
, 1998
"... The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles MDL and MML, abstracted as the ideal MDL principle and defined from Bayes's rule by means of Kolmogorov complexity. The basic conditi ..."
Abstract

Cited by 70 (7 self)
 Add to MetaCart
The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles MDL and MML, abstracted as the ideal MDL principle and defined from Bayes's rule by means of Kolmogorov complexity. The basic condition under which the ideal principle should be applied is encapsulated as the Fundamental Inequality, which in broad terms states that the principle is valid when the data are random, relative to every contemplated hypothesis and also these hypotheses are random relative to the (universal) prior. Basically, the ideal principle states that the prior probability associated with the hypothesis should be given by the algorithmic universal probability, and the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized. If we restrict the model class to the finite sets then application of the ideal principle turns into Kolmogorov's mi...
Strong Optimality of the Normalized ML Models as Universal Codes
 IEEE Transactions on Information Theory
, 2000
"... We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside ..."
Abstract

Cited by 64 (8 self)
 Add to MetaCart
We show that the normalized maximum likelihood (NML) distribution as a universal code for a parametric class of models is closest to the negative logarithm of the maximized likelihood in the mean code length distance, where the mean is taken with respect to the worst case model inside or outside the parametric class. We strengthen this result by showing that the same minmax bound results even when the data generating models are restricted to be most `benevolent' in minimizing the mean of the negative logarithm of the maximized likelihood. Further, we show for the class of exponential models that the bound cannot be beaten in essence by any code except when the mean is taken with respect to the most benevolent data generating models in a set of vanishing size. These results allow us to decompose the data into two parts, the first having all the useful information that can be extracted with the parametric models and the rest which has none. We also show that, if we change Ak...
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 61 (1 self)
 Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Hypothesis Selection and Testing by the MDL Principle
 The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually ..."
Abstract

Cited by 61 (3 self)
 Add to MetaCart
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data
Mutual Information, Metric Entropy, and Cumulative Relative Entropy Risk
 Annals of Statistics
, 1996
"... Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
Assume fP ` : ` 2 \Thetag is a set of probability distributions with a common dominating measure on a complete separable metric space Y . A state ` 2 \Theta is chosen by Nature. A statistician gets n independent observations Y 1 ; : : : ; Y n from Y distributed according to P ` . For each time t between 1 and n, based on the observations Y 1 ; : : : ; Y t\Gamma1 , the statistician produces an estimated distribution P t for P ` , and suffers a loss L(P ` ; P t ). The cumulative risk for the statistician is the average total loss up to time n. Of special interest in information theory, data compression, mathematical finance, computational learning theory and statistical mechanics is the special case when the loss L(P ` ; P t ) is the relative entropy between the true distribution P ` and the estimated distribution P t . Here the cumulative Bayes risk from time 1 to n is the mutual information between the random parameter \Theta and the observations Y 1 ; : : : ;...