Results 1 - 10
of
176
Metalearning and neuromodulation
, 2002
"... This paper presents a computational theory on the roles of the ascending neuromodulatory systems from the viewpoint that they mediate the global signals that regulate the distributed learning mechanisms in the brain. Based on the review of experimental data and theoretical models, it is proposed tha ..."
Abstract
-
Cited by 96 (4 self)
- Add to MetaCart
This paper presents a computational theory on the roles of the ascending neuromodulatory systems from the viewpoint that they mediate the global signals that regulate the distributed learning mechanisms in the brain. Based on the review of experimental data and theoretical models, it is proposed that dopamine signals the error in reward prediction, serotonin controls the time scale of reward prediction, noradrenaline controls the randomness in action selection, and acetylcholine controls the speed of memory update. The possible interactions between those neuromodulators and the environment are predicted on the basis of computational theory of metalearning.
Multiple model-based reinforcement learning
- Neural Computation
, 2002
"... We propose a modular reinforcement learning architecture for non-linear, non-stationary control tasks, which we call multiple model-based reinforcement learn-ing (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environme ..."
Abstract
-
Cited by 85 (5 self)
- Add to MetaCart
We propose a modular reinforcement learning architecture for non-linear, non-stationary control tasks, which we call multiple model-based reinforcement learn-ing (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The 1 system is composed of multiple modules, each of which consists of a state predic-tion model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules as well as to gate the learning of the predic-tion models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite state case and continuous-time, continuous state case. The performance of MMRL was demonstrated for discrete case in a non-stationary hunting task in a grid world and for continuous case in a non-linear, non-stationary control task of swinging up a pendulum with variable physical parameters. 1
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
- In Proceedings of the International Conference on Machine Learning
, 2011
"... In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty int ..."
Abstract
-
Cited by 84 (15 self)
- Add to MetaCart
In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, pilco can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-ofthe-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks. 1. Introduction and Related
Dopamine: generalization and bonuses
, 2002
"... In the temporal difference model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. This model is supported by a wealth of experimental data. However, in certain circumstances, the activity of the dopamine cells seems anomalous under the model, as they r ..."
Abstract
-
Cited by 52 (3 self)
- Add to MetaCart
In the temporal difference model of primate dopamine neurons, their phasic activity reports a prediction error for future reward. This model is supported by a wealth of experimental data. However, in certain circumstances, the activity of the dopamine cells seems anomalous under the model, as they respond in particular ways to stimuli that are not obviously related to predictions of reward. In this paper, we address two important sets of anomalies, those having to do with generalization and novelty. Generalization responses are treated as the natural consequence of partial information; novelty responses are treated by the suggestion that dopamine cells multiplex information about reward bonuses, including exploration bonuses and shaping bonuses. We interpret this additional role for dopamine in terms of the mechanistic attentional and psychomotor effects of dopamine, having the computational role of guiding exploration.
Reinforcement learning for imitating constrained reaching movements
- RSJ Advanced Robotics
, 2007
"... The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
(Show Context)
The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot must then be able to reproduce the task even when unexpected perturbations occur. In this case, it has to learn a new solution. In this paper, we describe a system to teach to the robot constrained reaching tasks. Our system is based on a dynamical system generator modulated by a learned speed trajectory. This system is combined with a reinforcement learning module to allow the robot to adapt the trajectory when facing a new situation, for example in the presence of obstacles.
Temporal sequence learning, prediction and control - a review of different models and their relation to biological mechanisms
- Neural Computation
, 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are reward-based (e.g. TD-learning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe, that reward-based and correlation based learning are indeed very similar. Machine-control is then used to introduce the problem of closed-loop control (e.g. “actor-critic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus and
Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning
, 2001
"... A spike-timing-dependent Hebbian mechanism governs the plasticity of recurrent excitatory synapses in the neocortex: synapses that are activated a few milliseconds before a postsynaptic spike are potentiated, while those that are activated a few milliseconds after are depressed. We show that such a ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
A spike-timing-dependent Hebbian mechanism governs the plasticity of recurrent excitatory synapses in the neocortex: synapses that are activated a few milliseconds before a postsynaptic spike are potentiated, while those that are activated a few milliseconds after are depressed. We show that such a mechanism can implement a form of temporal difference learning for prediction of input sequences. Using a biophysical model of a cortical neuron, we show that a temporal difference rule used in conjunction with dendritic backpropagating action potentials reproduces the temporally asymmetric window of Hebbian plasticity observed physiologically. Furthermore, the size and shape of the window vary with the distance of the synapse from the soma. Using a simple example, we show how a spike-timing-based temporal difference learning rule can allow a network of neocortical neurons to predict an input a few milliseconds before the input’s expected arrival.
Reconciling reinforcement learning models with behavioral extinction and renewal: Implications for addiction, relapse, and problem gambling
- Psychological Review
, 2007
"... Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated val ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL models are based on the hypothesis that dopamine carries a reward prediction error signal; these models predict reward by driving that reward error to zero. The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: (a) a TDRL process that learns the value of situation–action pairs and (b) a situation recognition process that categorizes the observed cues into situations. This model has implications for dysfunctional states, including relapse after addiction and problem gambling.
A mechanism for error detection in speeded response time tasks
- Journal of Experimental Psychology: General
, 2005
"... The concept of error detection plays a central role in theories of executive control. In this article, the authors present a mechanism that can rapidly detect errors in speeded response time tasks. This error monitor assigns values to the output of cognitive processes involved in stimulus categoriza ..."
Abstract
-
Cited by 37 (11 self)
- Add to MetaCart
(Show Context)
The concept of error detection plays a central role in theories of executive control. In this article, the authors present a mechanism that can rapidly detect errors in speeded response time tasks. This error monitor assigns values to the output of cognitive processes involved in stimulus categorization and response generation and detects errors by identifying states of the system associated with negative value. The mechanism is formalized in a computational model based on a recent theoretical framework for understanding error processing in humans (C. B. Holroyd & M. G. H. Coles, 2002). The model is used to simulate behavioral and event-related brain potential data in a speeded response time task, and the results of the simulation are compared with empirical data. Frontal parts of the brain, including the prefrontal cortex (Luria, 1973; Stuss & Knight, 2002), the anterior cingulate cortex (Devinsky, Morrell, & Vogt, 1995; Posner & DiGirolamo, 1998), and their connections with the basal ganglia (L. L. Brown, Schneider, & Lidsky, 1997; Cummings, 1993), are thought to compose an executive system for cognitive control. The functions of this system are thought to include setting high-level goals, directing other
Isotropic Sequence Order Learning
, 2003
"... In this article, we present an isotropic unsupervised algorithm for temporal sequence learning. Nospecial reward signal is used such that all inputs are completely isotropic. All input signals are bandpass filtered before converging onto a linear output neuron. All synaptic weights change according ..."
Abstract
-
Cited by 24 (15 self)
- Add to MetaCart
In this article, we present an isotropic unsupervised algorithm for temporal sequence learning. Nospecial reward signal is used such that all inputs are completely isotropic. All input signals are bandpass filtered before converging onto a linear output neuron. All synaptic weights change according to the correlation of bandpass-filtered inputs with the derivative of the output. We investigate the algorithm in an open- and a closed-loop condition, the latter being defined by embedding the learning system into a behavioral feedback loop. In the open-loop condition, we find that the linear structure of the algorithm allows analytically calculating the shape of the weight change, which is strictly heterosynaptic and follows the shape of the weight change curves found in spike-time-dependent plasticity. Furthermore, we show that synaptic weights stabilize automatically when no more temporal differences exist between the inputs without additional normalizing measures. In the second part of this study, the algorithm is is placed in an environment that leads to closed sensormotor loop. To this end, a robot is programmed with a prewired retraction reflex reaction in response to collisions. Through isotropic sequence order (ISO) learning, the robot achieves collision avoidance by learning the correlation between his early range-finder signals and the later occurring collision signal. Synaptic weights stabilize at the end of learning as theoretically predicted. Finally, we discuss the relation of ISO learning with other drive reinforcement models and with the commonly used temporal difference learning algorithm. This study is followed up by a mathematical analysis of the closed-loop situation in the companion article in this issue, “ISO Learning Approximates a Solution to the Inverse-Controller Problem in an Unsupervised Behavioral Paradigm” (pp. 865–884).