Results 1  10
of
49
Hypothesis Selection and Testing by the MDL Principle
 The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particul ..."
Abstract

Cited by 58 (3 self)
 Add to MetaCart
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data
Algorithmic Statistics
 IEEE Transactions on Information Theory
, 2001
"... While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or ..."
Abstract

Cited by 50 (13 self)
 Add to MetaCart
While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or probability distribution) where the data sample typically came from. The statistical theory based on such relations between individual objects can be called algorithmic statistics, in contrast to classical statistical theory that deals with relations between probabilistic ensembles. We develop the algorithmic theory of statistic, sufficient statistic, and minimal sufficient statistic. This theory is based on twopart codes consisting of the code for the statistic (the model summarizing the regularity, the meaningful information, in the data) and the modeltodata code. In contrast to the situation in probabilistic statistical theory, the algorithmic relation of (minimal) sufficiency is an absolute relation between the individual model and the individual data sample. We distinguish implicit and explicit descriptions of the models. We give characterizations of algorithmic (Kolmogorov) minimal sufficient statistic for all data samples for both description modes in the explicit mode under some constraints. We also strengthen and elaborate earlier results on the "Kolmogorov structure function" and "absolutely nonstochastic objects" those rare objects for which the simplest models that summarize their relevant information (minimal sucient statistics) are at least as complex as the objects themselves. We demonstrate a close relation between the probabilistic notions and the algorithmic ones: (i) in both cases there is an "information nonincrease" law; (ii) it is shown that a function is a...
Simplicity: A unifying principle in cognitive science?
 Trends in Cognitive Sciences
, 2003
"... This article reviews research exploring the idea that simplicity does, indeed, drive a wide range of cognitive processes. We outline mathematical theory, computational results, and empirical data underpinning this viewpoint. Key words: simplicity, Kolmogorov complexity, codes, learning, induction, B ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
This article reviews research exploring the idea that simplicity does, indeed, drive a wide range of cognitive processes. We outline mathematical theory, computational results, and empirical data underpinning this viewpoint. Key words: simplicity, Kolmogorov complexity, codes, learning, induction, Bayesian inference 30word summary:This article outlines the proposal that many aspects of cognition, from perception, to language acquisition, to highlevel cognition involve finding patterns that provide the simplest explanation of available data. 3 The cognitive system finds patterns in the data that it receives. Perception involves finding patterns in the external world, from sensory input. Language acquisition involves finding patterns in linguistic input, to determine the structure of the language. Highlevel cognition involves finding patterns in information, to form categories, and to infer causal relations. Simplicity and the problem of induction A fundamental puzzle is what we term the problem of induction: infinitely many patterns are compatible with any finite set of data (see Box 1). So, for example, an infinity of curves pass through any finite set of points (Box 1a); an infinity of symbol sequences are compatible with any subsequence of symbols (Box 1b); infinitely many grammars are compatible with any finite set of observed sentences (Box 1c); and infinitely many perceptual organizations can fit any specific visual input (Box 1d). What principle allows the cognitive system to solve the problem of induction, and choose appropriately from these infinite sets of possibilities? Any such principle must meet two criteria: (i) it must solve the problem of induction successfully; (ii) it must explain empirical data in cognition. We argue that the best approach to (i)...
The Fastest And Shortest Algorithm For All WellDefined Problems
, 2002
"... An algorithm M is described that solves any welldefined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and loworder additive terms. M optimally distributes resources between the execution of provably correct psolving programs and an enumeration of ..."
Abstract

Cited by 35 (7 self)
 Add to MetaCart
An algorithm M is described that solves any welldefined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and loworder additive terms. M optimally distributes resources between the execution of provably correct psolving programs and an enumeration of all proofs, including relevant proofs of program correctness and of time bounds on program runtimes. M avoids Blum's speedup theorem by ignoring programs without correctness proof. M has broader applicability and can be faster than Levin's universal search, the fastest method for inverting functions save for a large multiplicative constant. An extension of Kolmogorov complexity and two novel natural measures of function complexity are used to show that the most efficient program computing some function f is also among the shortest programs provably computing f.
Convergence and Loss Bounds for Bayesian Sequence Prediction
 In
, 2003
"... The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Baye ..."
Abstract

Cited by 22 (21 self)
 Add to MetaCart
The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Bayes mix $\xi$ defined as a weighted sum of distributions $ u\in M$. Various convergence results of the mixture posterior $\xi_t$ to the true posterior $\mu_t$ are presented. In particular a new (elementary) derivation of the convergence $\xi_t/\mu_t\to 1$ is provided, which additionally gives the rate of convergence. A general sequence predictor is allowed to choose an action $y_t$ based on $x_1...x_{t1}$ and receives loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of the sequence. No assumptions are made on the structure of $\ell$ (apart from being bounded) and $M$. The Bayesoptimal prediction scheme $\Lambda_\xi$ based on mixture $\xi$ and the Bayesoptimal informed prediction scheme $\Lambda_\mu$ are defined and the total loss $L_\xi$ of $\Lambda_\xi$ is bounded in terms of the total loss $L_\mu$ of $\Lambda_\mu$. It is shown that $L_\xi$ is bounded for bounded $L_\mu$ and $L_\xi/L_\mu\to 1$ for $L_\mu\to \infty$. Convergence of the instantaneous losses is also proven.
Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences
 Proceedings of the 12th Eurpean Conference on Machine Learning (ECML2001
, 2001
"... Solomonoff's uncomputable universal prediction scheme ß allows to predict the next symbol x k of a sequence x 1 ...x k1 for any Turing computable, but otherwise unknown, probabilistic environment µ . This scheme will be generalized to arbitrary environmental classes, which, among others ..."
Abstract

Cited by 21 (15 self)
 Add to MetaCart
Solomonoff's uncomputable universal prediction scheme ß allows to predict the next symbol x k of a sequence x 1 ...x k1 for any Turing computable, but otherwise unknown, probabilistic environment µ . This scheme will be generalized to arbitrary environmental classes, which, among others, allows the construction of computable universal prediction schemes ß . Convergence of ß to µ in a conditional mean squared sense and with µ probability 1 is proven. It is shown that the average number of prediction errors made by the universal ß scheme rapidly converges to those made by the best possible informed µ scheme. The schemes, theorems and proofs are given for general finite alphabet, which results in additional complications as compared to the binary case. Several extensions of the presented theory and results are outlined. They include general loss functions and bounds, games of chance, infinite alphabet, partial and delayed prediction, classification, and more active systems.
Complexity distortion theory
, 2003
"... Complexity distortion theory (CDT) is a mathematical framework providing a unifying perspective on media representation. The key component of this theory is the substitution of the decoder in Shannon’s classical communication model with a universal Turing machine. Using this model, the mathematical ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
Complexity distortion theory (CDT) is a mathematical framework providing a unifying perspective on media representation. The key component of this theory is the substitution of the decoder in Shannon’s classical communication model with a universal Turing machine. Using this model, the mathematical framework for examining the efficiency of coding schemes is the algorithmic or Kolmogorov complexity. CDT extends this framework to include distortion by defining the complexity distortion function. We show that despite their different natures, CDT and rate distortion theory (RDT) predict asymptotically the same results, under stationary and ergodic assumptions. This closes the circle of representation models, from probabilistic models of information proposed by Shannon in information and rate distortion theories, to deterministic algorithmic models, proposed by Kolmogorov in Kolmogorov complexity theory and its extension to lossy source coding, CDT.
Applying MDL to Learning Best Model Granularity
, 1994
"... The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends critically on the granularity, for example the choice of precision of the parameters. Too high precision generally involves modeling of accidental noise and too low precision may lead to confusion of models that should be distinguished. This precision is often determined ad hoc. In MDL the best model is the one that most compresses a twopart code of the data set: this embodies "Occam's Razor." In two quite different experimental settings the theoretical value determined using MDL coincides with the best value found experimentally. In the first experiment the task is to recognize isolated handwritten characters in one subject's handwriting, irrespective of size and orientation. Base...