Results 1 - 10
of
33
Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement
- MACHINE LEARNING
, 1997
"... We study task sequences that allow for speeding up the learner's average reward intake through appropriate shifts of inductive bias (changes of the learner's policy). To evaluate long-term effects of bias shifts setting the stage for later bias shifts we use the "success-story algorithm" (SSA). SSA ..."
Abstract
-
Cited by 58 (27 self)
- Add to MetaCart
We study task sequences that allow for speeding up the learner's average reward intake through appropriate shifts of inductive bias (changes of the learner's policy). To evaluate long-term effects of bias shifts setting the stage for later bias shifts we use the "success-story algorithm" (SSA). SSA is occasionally called at times that may depend on the policy itself. It uses backtracking to undo those bias shifts that have not been empirically observed to trigger longterm reward accelerations (measured up until the current SSA call). Bias shifts that survive SSA represent a lifelong success history. Until the next SSA call, they are considered useful and build the basis for additional bias shifts. SSA allows for plugging in a wide variety of learning algorithms. We plug in (1) a novel, adaptive extension of Levin search and (2) a method for embedding the learner's policy modification strategy within the policy itself (incremental self-improvement). Our inductive transfer case studies...
Optimal Ordered Problem Solver
, 2002
"... We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, eciently searching not only the ..."
Abstract
-
Cited by 47 (12 self)
- Add to MetaCart
We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, eciently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience.
The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions
- Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence
, 2002
"... Solomonoff's optimal but noncomputable method for inductive inference assumes that observation sequences x are drawn from an recursive prior distribution p(x). Instead of using the unknown p() he predicts using the celebrated universal enumerable prior M() which for all exceeds any recursive p() ..."
Abstract
-
Cited by 37 (13 self)
- Add to MetaCart
Solomonoff's optimal but noncomputable method for inductive inference assumes that observation sequences x are drawn from an recursive prior distribution p(x). Instead of using the unknown p() he predicts using the celebrated universal enumerable prior M() which for all exceeds any recursive p(), save for a constant factor independent of x. The simplicity measure M() naturally implements "Occam's razor " and is closely related to the Kolmogorov complexity of . However, M assigns high probability to certain data that are extremely hard to compute. This does not match our intuitive notion of simplicity. Here we suggest a more plausible measure derived from the fastest way of computing data. In absence of contrarian evidence, we assume that the physical world is generated by a computational process, and that any possibly infinite sequence of observations is therefore computable in the limit (this assumption is more radical and stronger than Solomonoff's).
REINFORCEMENT LEARNING WITH SELF-MODIFYING POLICIES
, 1997
"... A learner’s modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselve ..."
Abstract
-
Cited by 28 (20 self)
- Add to MetaCart
A learner’s modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience — this is what we call “learning to learn”. How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner’s life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call — this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.
A computer scientist’s view of life, the universe, and everything
- Foundations of Computer Science: Potential - Theory - Cognition
, 1997
"... Is the universe computable? If so, it may be much cheaper in terms of information requirements to compute all computable universes instead of just ours. I apply basic concepts of Kolmogorov complexity theory to the set of possible universes, and chat about perceived and true randomness, life, genera ..."
Abstract
-
Cited by 27 (11 self)
- Add to MetaCart
Is the universe computable? If so, it may be much cheaper in terms of information requirements to compute all computable universes instead of just ours. I apply basic concepts of Kolmogorov complexity theory to the set of possible universes, and chat about perceived and true randomness, life, generalization, and learning in a given universe. Preliminaries Assumptions. A long time ago, the Great Programmer wrote a program that runs all possible universes on His Big Computer. “Possible ” means “computable”: (1) Each universe evolves on a discrete time scale. (2) Any universe’s state at a given time is describable by a finite number of bits. One of the many universes is ours, despite some who evolved in it and claim it is incomputable. Computable universes. Let TM denote an arbitrary universal Turing machine with unidirectional output tape. TM’s input and output symbols are “0”, “1”, and “, ” (comma). TM’s possible input programs can be ordered
The Fastest And Shortest Algorithm For All Well-Defined Problems
, 2002
"... An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low-order additive terms. M optimally distributes resources between the execution of provably correct p-solving programs and an enumeration of ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
An algorithm M is described that solves any well-defined problem p as quickly as the fastest algorithm computing a solution to p, save for a factor of 5 and low-order additive terms. M optimally distributes resources between the execution of provably correct p-solving programs and an enumeration of all proofs, including relevant proofs of program correctness and of time bounds on program runtimes. M avoids Blum's speed-up theorem by ignoring programs without correctness proof. M has broader applicability and can be faster than Levin's universal search, the fastest method for inverting functions save for a large multiplicative constant. An extension of Kolmogorov complexity and two novel natural measures of function complexity are used to show that the most efficient program computing some function f is also among the shortest programs provably computing f.
Blind source separation using algorithmic information theory
- Neurocomputing
, 1998
"... Previous approaches for the blind source separation problem have used independent component analysis making the separated components statistically independent. In this paper, a new contrast for blind source separation of natural signals is proposed, which measures the algorithmic complexity of the ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
Previous approaches for the blind source separation problem have used independent component analysis making the separated components statistically independent. In this paper, a new contrast for blind source separation of natural signals is proposed, which measures the algorithmic complexity of the sources and also the complexity of the mixing mapping. No assumptions about underlying probability distributions of the sources are necessary. Instead, it is required that the independent source signals have low complexity, which is generally true for natural signals. Connection to previous approaches is shown by demonstrating that minimum mutual information coincides with minimizing complexity in a special case. An experiment is presented, where a difficult problem of separating correlated signals is considered. The complexity minimization method is seen to give clearly more accurate results than the reference method utilizing ICA.
Algorithmic Theories Of Everything
, 2000
"... The probability distribution P from which the history of our universe is sampled represents a theory of everything or TOE. We assume P is formally describable. Since most (uncountably many) distributions are not, this imposes a strong inductive bias. We show that P(x) is small for any universe x lac ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
The probability distribution P from which the history of our universe is sampled represents a theory of everything or TOE. We assume P is formally describable. Since most (uncountably many) distributions are not, this imposes a strong inductive bias. We show that P(x) is small for any universe x lacking a short description, and study the spectrum of TOEs spanned by two Ps, one reflecting the most compact constructive descriptions, the other the fastest way of computing everything. The former derives from generalizations of traditional computability, Solomonoff’s algorithmic probability, Kolmogorov complexity, and objects more random than Chaitin’s Omega, the latter from Levin’s universal search and a natural resource-oriented postulate: the cumulative prior probability of all x incomputable within time t by this optimal algorithm should be 1/t. Between both Ps we find a universal cumulatively enumerable measure that dominates traditional enumerable measures; any such CEM must assign low probability to any universe lacking a short enumerating program. We derive P-specific consequences for evolving observers, inductive reasoning, quantum physics, philosophy, and the expected duration of our universe.
Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decisions
- Proceedings of the 12 th Eurpean Conference on Machine Learning (ECML-2001
, 2001
"... Abstract. Decision theory formally solves the problem of rational agents in uncertain worlds if the true environmental probability distribution is known. Solomonoff’s theory of universal induction formally solves the problem of sequence prediction for unknown distributions. We unify both theories an ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
Abstract. Decision theory formally solves the problem of rational agents in uncertain worlds if the true environmental probability distribution is known. Solomonoff’s theory of universal induction formally solves the problem of sequence prediction for unknown distributions. We unify both theories and give strong arguments that the resulting universal AIξ model behaves optimally in any computable environment. The major drawback of the AIξ model is that it is uncomputable. To overcome this problem, we construct a modified algorithm AIξ tl, which is still superior to any other time t and length l bounded agent. The computation time of AIξ tl is of the order t·2 l. 1
Feature extraction through LOCOCODE
- NEURAL COMPUTATION
, 1998
"... "Low-complexity coding and decoding" (Lococode) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods it explicitly takes into account the information-theoretic complexity of the code generator: it computes lococodes that (1) convey information about the input d ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
"Low-complexity coding and decoding" (Lococode) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods it explicitly takes into account the information-theoretic complexity of the code generator: it computes lococodes that (1) convey information about the input data and (2) can be computed and decoded by low-complexity mappings. We implement Lococode by training autoassociators with Flat Minimum Search, a recent, general method for discovering low-complexity neural nets. It turns out that this approach can unmix an unknown number of independent data sources by extracting a minimal number of low-complexity features necessary for representing the data. Experiments show: unlike codes obtained with standard autoencoders, lococodes are based on feature detectors, never unstructured, usually sparse, sometimes factorial or local (depending on statistical properties of the data). Although Lococode is not explicitly designed to enforce sparse or factorial codes, it extracts optimal codes for difficult versions of the "bars" benchmark problem, whereas ICA and PCA do not. It also produces familiar, biologically plausible feature detectors when applied to real world images. As a preprocessor for a vowel recognition benchmark problem it sets the stage for excellent classification performance. Our results reveil an interesting, previously ignored connection between two important fields: regularizer research, and ICA-related research.

