Results 1 - 10
of
45
Hypothesis Selection and Testing by the MDL Principle
- The Computer Journal
, 1998
"... ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particul ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
ses where the variance is known or taken as a parameter. 1. INTRODUCTION Although the term `hypothesis' in statistics is synonymous with that of a probability `model' as an explanation of data, hypothesis testing is not quite the same problem as model selection. This is because usually a particular hypothesis, called the `null hypothesis', has already been selected as a favorite model and it will be abandoned in favor of another model only when it clearly fails to explain the currently available data. In model selection, by contrast, all the models considered are regarded on the same footing and the objective is simply to pick the one that best explains the data. For the Bayesians certain models may be favored in terms of a prior probability, but in the minimum description length (MDL) approach to be outlined below, prior knowledge of any kind is to be used in selecting the tentative models, which in the end, unlike in the Bayesians' case, can and will be fitted to data
Algorithmic Statistics
- IEEE Transactions on Information Theory
, 2001
"... While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or ..."
Abstract
-
Cited by 41 (8 self)
- Add to MetaCart
While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or probability distribution) where the data sample typically came from. The statistical theory based on such relations between individual objects can be called algorithmic statistics, in contrast to classical statistical theory that deals with relations between probabilistic ensembles. We develop the algorithmic theory of statistic, sufficient statistic, and minimal sufficient statistic. This theory is based on two-part codes consisting of the code for the statistic (the model summarizing the regularity, the meaningful information, in the data) and the model-to-data code. In contrast to the situation in probabilistic statistical theory, the algorithmic relation of (minimal) sufficiency is an absolute relation between the individual model and the individual data sample. We distinguish implicit and explicit descriptions of the models. We give characterizations of algorithmic (Kolmogorov) minimal sufficient statistic for all data samples for both description modes in the explicit mode under some constraints. We also strengthen and elaborate earlier results on the "Kolmogorov structure function" and "absolutely non-stochastic objects" those rare objects for which the simplest models that summarize their relevant information (minimal sucient statistics) are at least as complex as the objects themselves. We demonstrate a close relation between the probabilistic notions and the algorithmic ones: (i) in both cases there is an "information non-increase" law; (ii) it is shown that a function is a...
Simplicity: A unifying principle in cognitive science?
- Trends in Cognitive Sciences
, 2003
"... This article reviews research exploring the idea that simplicity does, indeed, drive a wide range of cognitive processes. We outline mathematical theory, computational results, and empirical data underpinning this viewpoint. Key words: simplicity, Kolmogorov complexity, codes, learning, induction, B ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
This article reviews research exploring the idea that simplicity does, indeed, drive a wide range of cognitive processes. We outline mathematical theory, computational results, and empirical data underpinning this viewpoint. Key words: simplicity, Kolmogorov complexity, codes, learning, induction, Bayesian inference 30-word summary:This article outlines the proposal that many aspects of cognition, from perception, to language acquisition, to high-level cognition involve finding patterns that provide the simplest explanation of available data. 3 The cognitive system finds patterns in the data that it receives. Perception involves finding patterns in the external world, from sensory input. Language acquisition involves finding patterns in linguistic input, to determine the structure of the language. High-level cognition involves finding patterns in information, to form categories, and to infer causal relations. Simplicity and the problem of induction A fundamental puzzle is what we term the problem of induction: infinitely many patterns are compatible with any finite set of data (see Box 1). So, for example, an infinity of curves pass through any finite set of points (Box 1a); an infinity of symbol sequences are compatible with any subsequence of symbols (Box 1b); infinitely many grammars are compatible with any finite set of observed sentences (Box 1c); and infinitely many perceptual organizations can fit any specific visual input (Box 1d). What principle allows the cognitive system to solve the problem of induction, and choose appropriately from these infinite sets of possibilities? Any such principle must meet two criteria: (i) it must solve the problem of induction successfully; (ii) it must explain empirical data in cognition. We argue that the best approach to (i)...
The Fastest And Shortest Algorithm For All Well-Defined Problems
, 2002
"... An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low-order additive terms. M optimally distributes resources between the execution of provably correct p-solving programs and an enumeration of ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low-order additive terms. M optimally distributes resources between the execution of provably correct p-solving programs and an enumeration of all proofs, including relevant proofs of program correctness and of time bounds on program runtimes. M avoids Blum's speed-up theorem by ignoring programs without correctness proof. M has broader applicability and can be faster than Levin's universal search, the fastest method for inverting functions save for a large multiplicative constant. An extension of Kolmogorov complexity and two novel natural measures of function complexity are used to show that the most efficient program computing some function f is also among the shortest programs provably computing f.
Convergence and Loss Bounds for Bayesian Sequence Prediction
- In
, 2003
"... The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Baye ..."
Abstract
-
Cited by 21 (20 self)
- Add to MetaCart
The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with Bayes rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. If $\mu$ is unknown, but known to belong to a class $M$ one can base ones prediction on the Bayes mix $\xi$ defined as a weighted sum of distributions $ u\in M$. Various convergence results of the mixture posterior $\xi_t$ to the true posterior $\mu_t$ are presented. In particular a new (elementary) derivation of the convergence $\xi_t/\mu_t\to 1$ is provided, which additionally gives the rate of convergence. A general sequence predictor is allowed to choose an action $y_t$ based on $x_1...x_{t-1}$ and receives loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of the sequence. No assumptions are made on the structure of $\ell$ (apart from being bounded) and $M$. The Bayes-optimal prediction scheme $\Lambda_\xi$ based on mixture $\xi$ and the Bayes-optimal informed prediction scheme $\Lambda_\mu$ are defined and the total loss $L_\xi$ of $\Lambda_\xi$ is bounded in terms of the total loss $L_\mu$ of $\Lambda_\mu$. It is shown that $L_\xi$ is bounded for bounded $L_\mu$ and $L_\xi/L_\mu\to 1$ for $L_\mu\to \infty$. Convergence of the instantaneous losses is also proven.
Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences
- Proceedings of the 12th Eurpean Conference on Machine Learning (ECML-2001
, 2001
"... Solomonoff's uncomputable universal prediction scheme ß allows to predict the next symbol x k of a sequence x 1 ...x k-1 for any Turing computable, but otherwise unknown, probabilistic environment µ . This scheme will be generalized to arbitrary environmental classes, which, among others ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
Solomonoff's uncomputable universal prediction scheme ß allows to predict the next symbol x k of a sequence x 1 ...x k-1 for any Turing computable, but otherwise unknown, probabilistic environment µ . This scheme will be generalized to arbitrary environmental classes, which, among others, allows the construction of computable universal prediction schemes ß . Convergence of ß to µ in a conditional mean squared sense and with µ probability 1 is proven. It is shown that the average number of prediction errors made by the universal ß scheme rapidly converges to those made by the best possible informed µ scheme. The schemes, theorems and proofs are given for general finite alphabet, which results in additional complications as compared to the binary case. Several extensions of the presented theory and results are outlined. They include general loss functions and bounds, games of chance, infinite alphabet, partial and delayed prediction, classification, and more active systems.
Quantum Kolmogorov complexity based on classical descriptions
- IEEE Trans. Inform. Theory
, 2001
"... Abstract—We develop a theory of the algorithmic information in bits contained in an individual pure quantum state. This extends classical Kolmogorov complexity to the quantum domain retaining classical descriptions. Quantum Kolmogorov complexity coincides with the classical Kolmogorov complexity on ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Abstract—We develop a theory of the algorithmic information in bits contained in an individual pure quantum state. This extends classical Kolmogorov complexity to the quantum domain retaining classical descriptions. Quantum Kolmogorov complexity coincides with the classical Kolmogorov complexity on the classical domain. Quantum Kolmogorov complexity is upper bounded and can be effectively approximated from above under certain conditions. With high probability a quantum object is incompressible. Upper- and lower bounds of the quantum complexity of multiple copies of individual pure quantum states are derived and may shed some light on the no-cloning properties of quantum states. In the quantum situation complexity is not sub-additive. We discuss some relations with “no-cloning ” and “approximate cloning ” properties. Keywords — Algorithmic information theory, quantum; classical descriptions of quantum states; information theory, quantum; Kolmogorov complexity, quantum; quantum cloning. I.
Applying MDL to Learning Best Model Granularity
, 1994
"... The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends critically on the granularity, for example the choice of precision of the parameters. Too high precision generally involves modeling of accidental noise and too low precision may lead to confusion of models that should be distinguished. This precision is often determined ad hoc. In MDL the best model is the one that most compresses a two-part code of the data set: this embodies "Occam's Razor." In two quite different experimental settings the theoretical value determined using MDL coincides with the best value found experimentally. In the first experiment the task is to recognize isolated handwritten characters in one subject's handwriting, irrespective of size and orientation. Base...

