Results 1  10
of
30
Toward a method of selecting among computational models of cognition
 Psychological Review
, 2002
"... The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to ..."
Abstract

Cited by 75 (4 self)
 Add to MetaCart
The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length, which provides an intuitive and theoretically wellgrounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger
Algorithmic Theories Of Everything
, 2000
"... The probability distribution P from which the history of our universe is sampled represents a theory of everything or TOE. We assume P is formally describable. Since most (uncountably many) distributions are not, this imposes a strong inductive bias. We show that P(x) is small for any universe x lac ..."
Abstract

Cited by 32 (15 self)
 Add to MetaCart
The probability distribution P from which the history of our universe is sampled represents a theory of everything or TOE. We assume P is formally describable. Since most (uncountably many) distributions are not, this imposes a strong inductive bias. We show that P(x) is small for any universe x lacking a short description, and study the spectrum of TOEs spanned by two Ps, one reflecting the most compact constructive descriptions, the other the fastest way of computing everything. The former derives from generalizations of traditional computability, Solomonoff’s algorithmic probability, Kolmogorov complexity, and objects more random than Chaitin’s Omega, the latter from Levin’s universal search and a natural resourceoriented postulate: the cumulative prior probability of all x incomputable within time t by this optimal algorithm should be 1/t. Between both Ps we find a universal cumulatively enumerable measure that dominates traditional enumerable measures; any such CEM must assign low probability to any universe lacking a short enumerating program. We derive Pspecific consequences for evolving observers, inductive reasoning, quantum physics, philosophy, and the expected duration of our universe.
Feature extraction through LOCOCODE
 NEURAL COMPUTATION
, 1998
"... "Lowcomplexity coding and decoding" (Lococode) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods it explicitly takes into account the informationtheoretic complexity of the code generator: it computes lococodes that (1) convey information about the input d ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
"Lowcomplexity coding and decoding" (Lococode) is a novel approach to sensory coding and unsupervised learning. Unlike previous methods it explicitly takes into account the informationtheoretic complexity of the code generator: it computes lococodes that (1) convey information about the input data and (2) can be computed and decoded by lowcomplexity mappings. We implement Lococode by training autoassociators with Flat Minimum Search, a recent, general method for discovering lowcomplexity neural nets. It turns out that this approach can unmix an unknown number of independent data sources by extracting a minimal number of lowcomplexity features necessary for representing the data. Experiments show: unlike codes obtained with standard autoencoders, lococodes are based on feature detectors, never unstructured, usually sparse, sometimes factorial or local (depending on statistical properties of the data). Although Lococode is not explicitly designed to enforce sparse or factorial codes, it extracts optimal codes for difficult versions of the "bars" benchmark problem, whereas ICA and PCA do not. It also produces familiar, biologically plausible feature detectors when applied to real world images. As a preprocessor for a vowel recognition benchmark problem it sets the stage for excellent classification performance. Our results reveil an interesting, previously ignored connection between two important fields: regularizer research, and ICArelated research.
Exploring the Predictable
, 2002
"... Details of complex event sequences are often not predictable, but their reduced abstract representations are. I study an embedded active learner that can limit its predictions to almost arbitrary computable aspects of spatiotemporal events. It constructs probabilistic algorithms that (1) control in ..."
Abstract

Cited by 21 (10 self)
 Add to MetaCart
Details of complex event sequences are often not predictable, but their reduced abstract representations are. I study an embedded active learner that can limit its predictions to almost arbitrary computable aspects of spatiotemporal events. It constructs probabilistic algorithms that (1) control interaction with the world, (2) map event sequences to abstract internal representations (IRs), (3) predict IRs from IRs computed earlier. Its goal is to create novel algorithms generating IRs useful for correct IR predictions, without wasting time on those learned before. This requires an adaptive novelty measure which is implemented by a coevolutionary scheme involving two competing modules collectively designing (initially random) algorithms representing experiments. Using special instructions, the modules can bet on the outcome of IR predictions computed by algorithms they have agreed upon. If their opinions dier then the system checks who's right, punishes the loser (the surprised one), and rewards the winner. An evolutionary or reinforcement learning algorithm forces each module to maximize reward. This motivates both modules to lure each other into agreeing upon experiments involving predictions that surprise it. Since each module essentially can veto experiments it does not consider profitable, the system is motivated to focus on those computable aspects of the environment where both modules still have confident but different opinions. Once both share the same opinion on a particular issue (via the loser's learning process, e.g., the winner is simply copied onto the loser), the winner loses a source of reward  an incentive to shift the focus of interest onto novel experiments. My simulations include an example where surprisegeneration of this kind helps to speed up ...
Initialization and Optimization of Multilayered Perceptrons
 3rd Conf. on Neural Networks and Their Applications
, 1997
"... Despite all the progress in neural networks field the technology is brittle and sometimes difficult to apply. Good initialization of adaptive parameters in neural networks and optimization of architecture are the key factor to create robust neural networks. Methods of initialization of MLPs are revi ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Despite all the progress in neural networks field the technology is brittle and sometimes difficult to apply. Good initialization of adaptive parameters in neural networks and optimization of architecture are the key factor to create robust neural networks. Methods of initialization of MLPs are reviewed and new methods based on clusterization techniques are suggested. Penalty term added to the error function leads to optimized, small and accurate networks. I. Introduction F INDING global minimum of a nonlinear function with many parameters is an NPhard problem [1]. Learning in neural networks is most frequently based on minimization of a cost function. Good initialization of adaptive parameters may enable finding solutions in complex, realworld problems and may significantly decrease learning time. Subsequent optimization should lead to compact networks capable of good generalization. In this paper methods of initializationand optimizationof the multilayerperceptrons (MLPs) used f...
Learning To Learn Using Gradient Descent
 IN LECTURE NOTES ON COMP. SCI. 2130, PROC. INTL. CONF. ON ARTI NEURAL NETWORKS (ICANN2001
, 2001
"... This paper introduces the application of gradient descent methods to metalearning. The concept of "metalearning", i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its appealing applications. Previous ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
This paper introduces the application of gradient descent methods to metalearning. The concept of "metalearning", i.e. of a system that improves or discovers a learning algorithm, has been of interest in machine learning for decades because of its appealing applications. Previous
An Efficient MDLBased Construction of RBF Networks
, 1998
"... We propose a method for optimizing the complexity of Radial Basis Function (RBF) networks. The method involves two procedures: adaptation (training) and selection. The first procedure adaptively changes the locations and the width of the basis functions and trains the linear weights. The selectio ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We propose a method for optimizing the complexity of Radial Basis Function (RBF) networks. The method involves two procedures: adaptation (training) and selection. The first procedure adaptively changes the locations and the width of the basis functions and trains the linear weights. The selection procedure performs the elimination of the redundant basis functions using an objective function based on the Minimum Description Length (MDL) principle. By iteratively combining these two procedures we achieve a controlled way of training and modifying RBF networks, which balances accuracy, training time, and complexity of the resulting network. We test the proposed method on function approximation and classification tasks, and compare it to some other recently proposed methods.
The Vanishing Gradient Problem during Learning . . .
 INTERNATIONAL JOURNAL OF UNCERTAINTY, FUZZINESS AND KNOWLEDGEBASED SYSTEMS
"... ... In this article the decaying error flow is theoretically analyzed. Then methods trying to overcome vanishing gradients are briefly discussed. Finally, experiments comparing conventional algorithms and alternative methods are presented. With advanced methods long time lag problems can be solv ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
... In this article the decaying error flow is theoretically analyzed. Then methods trying to overcome vanishing gradients are briefly discussed. Finally, experiments comparing conventional algorithms and alternative methods are presented. With advanced methods long time lag problems can be solved in reasonable time.
Lococode Performs Nonlinear ICA Without Knowing The Number Of Sources
, 1999
"... Lowcomplexity coding and decoding (Lococode), a novel approach to sensory coding, trains autoassociators (AAs) by Flat Minimum Search (FMS), a recent general method for finding lowcomplexity networks with high generalization capability. FMS works by minimizing both training error and required weig ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Lowcomplexity coding and decoding (Lococode), a novel approach to sensory coding, trains autoassociators (AAs) by Flat Minimum Search (FMS), a recent general method for finding lowcomplexity networks with high generalization capability. FMS works by minimizing both training error and required weight precision. We find that as a byproduct Lococode separates nonlinear superpositions of sources without knowing their number. Assuming that the input data can be reduced to few simple causes (this is often the case with visual data), according to our theoretical analysis the hidden layer of an FMStrained AA tends to code each input by a sparse code based on as few simple, independent features as possible. In experiments Lococode extracts optimal codes for difficult, nonlinear versions of the "noisy bars" benchmark problem, while traditional ICA and PCA do not.