Results 1  10
of
137
The minimum description length principle in coding and modeling
 IEEE Trans. Inform. Theory
, 1998
"... Abstract — We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized m ..."
Abstract

Cited by 305 (12 self)
 Add to MetaCart
Abstract — We review the principles of Minimum Description Length and Stochastic Complexity as used in data compression and statistical modeling. Stochastic complexity is formulated as the solution to optimum universal coding problems extending Shannon’s basic source coding theorem. The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. We assess the performance of the minimum description length criterion both from the vantage point of quality of data compression and accuracy of statistical inference. Context tree modeling, density estimation, and model selection in Gaussian linear regression serve as examples. Index Terms—Complexity, compression, estimation, inference, universal modeling.
Universal prediction
 IEEE Transactions on Information Theory
, 1998
"... Abstract — This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both th ..."
Abstract

Cited by 136 (11 self)
 Add to MetaCart
Abstract — This paper consists of an overview on universal prediction from an informationtheoretic perspective. Special attention is given to the notion of probability assignment under the selfinformation loss function, which is directly related to the theory of universal data compression. Both the probabilistic setting and the deterministic setting of the universal prediction problem are described with emphasis on the analogy and the differences between results in the two settings. Index Terms — Bayes envelope, entropy, finitestate machine, linear prediction, loss function, probability assignment, redundancycapacity, stochastic complexity, universal coding, universal prediction. I.
Universal Discrete Denoising: Known Channel
 IEEE Trans. Inform. Theory
, 2003
"... A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given singleletter fidelity criterion, we pr ..."
Abstract

Cited by 79 (32 self)
 Add to MetaCart
A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given singleletter fidelity criterion, we propose a discrete denoising algorithm that does not assume knowledge of statistical properties of the input sequence. Yet, the algorithm is universal in the sense of asymptotically performing as well as the optimum denoiser that knows the input sequence distribution, which is only assumed to be stationary and ergodic. Moreover, the algorithm is universal also in a semistochastic setting, in which the input is an individual sequence, and the randomness is due solely to the channel noise.
Lossy Source Coding
 IEEE Trans. Inform. Theory
, 1998
"... Lossy coding of speech, highquality audio, still images, and video is commonplace today. However, in 1948, few lossy compression systems were in service. Shannon introduced and developed the theory of source coding with a fidelity criterion, also called ratedistortion theory. For the first 25 year ..."
Abstract

Cited by 71 (1 self)
 Add to MetaCart
Lossy coding of speech, highquality audio, still images, and video is commonplace today. However, in 1948, few lossy compression systems were in service. Shannon introduced and developed the theory of source coding with a fidelity criterion, also called ratedistortion theory. For the first 25 years of its existence, ratedistortion theory had relatively little impact on the methods and systems actually used to compress real sources. Today, however, ratedistortion theoretic concepts are an important component of many lossy compression techniques and standards. We chronicle the development of ratedistortion theory and provide an overview of its influence on the practice of lossy source coding. Index TermsData compression, image coding, speech coding, rate distortion theory, signal coding, source coding with a fidelity criterion, video coding. I.
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Mixing Strategies for Density Estimation
 Ann. Statist
"... General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under KullbackLeibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by m ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
General results on adaptive density estimation are obtained with respect to any countable collection of estimation strategies under KullbackLeibler and square L 2 losses. It is shown that without knowing which strategy works best for the underlying density, a single strategy can be constructed by mixing the proposed ones to be adaptive in terms of statistical risks. A consequence is that under some mild conditions, an asymptotically minimaxrate adaptive estimator exists for a given countable collection of density classes, i.e., a single estimator can be constructed to be simultaneously minimaxrate optimal for all the function classes being considered. A demonstration is given for highdimensional density estimation on [0; 1] d where the constructed estimator adapts to smoothness and interactionorder over some piecewise Besov classes, and is consistent for all the densities with finite entropy. 1. Introduction. In Recent years, there has been an increasing interest in adaptive fu...
Universal Lossless Source Coding With the Burrows Wheeler Transform
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2002
"... The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes ar ..."
Abstract

Cited by 38 (3 self)
 Add to MetaCart
The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes are widely touted as lowcomplexity algorithms giving lossless coding rates better than those of the ZivLempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWTbased coding. The main results of this theoretical evaluation include: 1) statistical characterizations of the BWT output on both finite strings and sequences of length , 2) a variety of very simple new techniques for BWTbased lossless source coding, and 3) proofs of the universality and bounds on the rates of convergence of both new and existing BWTbased codes for finitememory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWTbased lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in ZivLempel style codes and, for some BWTbased codes, within a constant factor of the optimal rate of convergence for finitememory sources.
A Natural Law of Succession
, 1995
"... Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we presen ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
Consider the following problem. You are given an alphabet of k distinct symbols and are told that the i th symbol occurred exactly ni times in the past. On the basis of this information alone, you must now estimate the conditional probability that the next symbol will be i. In this report, we present a new solution to this fundamental problem in statistics and demonstrate that our solution outperforms standard approaches, both in theory and in practice.
The Mixture Approach to Universal Model Selection
 ECOLE NORMALE SUPERIEURE
, 1997
"... We build a model selection algorithm which has a mean Kullback risk upper bounded by the mean risk for the best model applied to half the sample augmented by an explicit penalty term of order one over the sample size. This estimator is not a "true" selection rule, but instead an adaptive convex comb ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
We build a model selection algorithm which has a mean Kullback risk upper bounded by the mean risk for the best model applied to half the sample augmented by an explicit penalty term of order one over the sample size. This estimator is not a "true" selection rule, but instead an adaptive convex combination of the models: this mixture approach is inspired by the context tree weighting method of information theory by Willems, Shtarkov and Tjalkens [6, 7], which is a "universal" data compression algorithm for stationary binary sources. Our algorithm is "universal" with respect to the statistical mean Kullback risk in the sense that it is almost optimal for any exchangeable sample distribution. We give an application to the estimation of a probability measure by adaptive histograms. We detail a factorised computation of the mixture which weights M models performing a number of operations of order log(M ), when the subdivisions underlying the histograms have nested cells. The end of the p...
Context tree estimation for not necessarily finite memory processes, via BIC and MDL
 IEEE Trans. Inf. Theory
, 2006
"... The concept of context tree, usually defined for finite memory processes, is extended to arbitrary stationary ergodic processes (with finite alphabet). These context trees are not necessarily complete, and may be of infinite depth. The familiar BIC and MDL principles are shown to provide strongly co ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
The concept of context tree, usually defined for finite memory processes, is extended to arbitrary stationary ergodic processes (with finite alphabet). These context trees are not necessarily complete, and may be of infinite depth. The familiar BIC and MDL principles are shown to provide strongly consistent estimators of the context tree, via optimization of a criterion for hypothetical context trees of finite depth, allowed to grow with the sample size n as o(log n). Algorithms are provided to compute these estimators in O(n) time, and to compute them online for all i ≤ n in o(n log n) time.