• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Reinforcement learning in continuous time and space (2000)

by K Doya
Venue:Neural Comput
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 61
Next 10 →

Multiple model-based reinforcement learning

by Kenji Doya, Kazuyuki Samejima - Neural Computation , 2002
"... We propose a modular reinforcement learning architecture for non-linear, non-stationary control tasks, which we call multiple model-based reinforcement learn-ing (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environme ..."
Abstract - Cited by 32 (1 self) - Add to MetaCart
We propose a modular reinforcement learning architecture for non-linear, non-stationary control tasks, which we call multiple model-based reinforcement learn-ing (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The 1 system is composed of multiple modules, each of which consists of a state predic-tion model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules as well as to gate the learning of the predic-tion models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite state case and continuous-time, continuous state case. The performance of MMRL was demonstrated for discrete case in a non-stationary hunting task in a grid world and for continuous case in a non-linear, non-stationary control task of swinging up a pendulum with variable physical parameters. 1

Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning

by Rajesh P.N. Rao, Terrence J. Sejnowski , 2001
"... this article, we explore the hypothesis that recurrent excitation in neocortical circuits subserves the function of prediction and generation of temporal sequences (for related ideas, see Jordan, 1986; Elman, 1990; Minai & Levy, 1993; Montague & Sejonowski, 1994; Abbott & Blum, 1996; Rao & Ballard, ..."
Abstract - Cited by 25 (0 self) - Add to MetaCart
this article, we explore the hypothesis that recurrent excitation in neocortical circuits subserves the function of prediction and generation of temporal sequences (for related ideas, see Jordan, 1986; Elman, 1990; Minai & Levy, 1993; Montague & Sejonowski, 1994; Abbott & Blum, 1996; Rao & Ballard, 1997; Barlow, 1998; Westerman, Northmore, & Elias, 1999). In particular, we show that a temporal-difference-based learning rule for prediction (Sutton, 1988), when applied to backpropagating action potentials in dendrites, reproduces the temporally asymmetric window of Hebbian plasticity obtained in physiological experiments (see section 3). We examine the stability of the learning rule in section 4 and discuss possible biophysical mechanisms for implementing this rule in section 5. We also provide a simple example demonstrating how such a learning mechanism may allow cortical networks to learn to predict their inputs using recurrent excitation. The model predicts that cortical neurons may employ different temporal windows of plasticity at different dendritic locations to allow them to capture correlations between pre- and postsynaptic activity at different timescales (see section 6). A preliminary report of this work appeared as Rao and Sejnowski (2000)

Reinforcement learning for imitating constrained reaching movements

by Florent Guenter, Micha Hersch, Sylvain Calinon, Aude Billard - RSJ Advanced Robotics , 2007
"... The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot ..."
Abstract - Cited by 24 (6 self) - Add to MetaCart
The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot must then be able to reproduce the task even when unexpected perturbations occur. In this case, it has to learn a new solution. In this paper, we describe a system to teach to the robot constrained reaching tasks. Our system is based on a dynamical system generator modulated by a learned speed trajectory. This system is combined with a reinforcement learning module to allow the robot to adapt the trajectory when facing a new situation, for example in the presence of obstacles.

Temporal sequence learning, prediction and control - a review of different models and their relation to biological mechanisms

by Florentin Wörgötter, Bernd Porr - Neural Computation , 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract - Cited by 17 (3 self) - Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are reward-based (e.g. TD-learning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe, that reward-based and correlation based learning are indeed very similar. Machine-control is then used to introduce the problem of closed-loop control (e.g. “actor-critic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus and

epsilon-MDPs: Learning in Varying Environments

by István Szita, Bálint Takács, András Lörincz, Sridhar Mahadevan , 2002
"... In this paper #-MDP-models are introduced and convergence theorems are proven using the generalized MDP framework of Szepesvari and Littman. Using this model family, we show that Q-learning is capable of finding near-optimal policies in varying environments. ..."
Abstract - Cited by 13 (4 self) - Add to MetaCart
In this paper #-MDP-models are introduced and convergence theorems are proven using the generalized MDP framework of Szepesvari and Littman. Using this model family, we show that Q-learning is capable of finding near-optimal policies in varying environments.

Isotropic Sequence Order Learning

by Bernd Porr, Florentin Wörgötter , 2003
"... In this article, we present an isotropic unsupervised algorithm for temporal sequence learning. Nospecial reward signal is used such that all inputs are completely isotropic. All input signals are bandpass filtered before converging onto a linear output neuron. All synaptic weights change according ..."
Abstract - Cited by 12 (8 self) - Add to MetaCart
In this article, we present an isotropic unsupervised algorithm for temporal sequence learning. Nospecial reward signal is used such that all inputs are completely isotropic. All input signals are bandpass filtered before converging onto a linear output neuron. All synaptic weights change according to the correlation of bandpass-filtered inputs with the derivative of the output. We investigate the algorithm in an open- and a closed-loop condition, the latter being defined by embedding the learning system into a behavioral feedback loop. In the open-loop condition, we find that the linear structure of the algorithm allows analytically calculating the shape of the weight change, which is strictly heterosynaptic and follows the shape of the weight change curves found in spike-time-dependent plasticity. Furthermore, we show that synaptic weights stabilize automatically when no more temporal differences exist between the inputs without additional normalizing measures. In the second part of this study, the algorithm is is placed in an environment that leads to closed sensormotor loop. To this end, a robot is programmed with a prewired retraction reflex reaction in response to collisions. Through isotropic sequence order (ISO) learning, the robot achieves collision avoidance by learning the correlation between his early range-finder signals and the later occurring collision signal. Synaptic weights stabilize at the end of learning as theoretically predicted. Finally, we discuss the relation of ISO learning with other drive reinforcement models and with the commonly used temporal difference learning algorithm. This study is followed up by a mathematical analysis of the closed-loop situation in the companion article in this issue, “ISO Learning Approximates a Solution to the Inverse-Controller Problem in an Unsupervised Behavioral Paradigm” (pp. 865–884).

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

by Marc Peter Deisenroth, Carl Edward Rasmussen - In Proceedings of the International Conference on Machine Learning , 2011
"... In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty int ..."
Abstract - Cited by 11 (4 self) - Add to MetaCart
In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, pilco can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-ofthe-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks. 1. Introduction and Related

Poincare-map-based reinforcement learning for biped walking

by Jun Morimoto, Jun Nakanishi, Gen Endo, Gordon Cheng, Poincaré Map - Proc. IEEE Int. Conf. Robotics and Automation, ICRA’05 , 2005
"... Abstract — We propose a model-based reinforcement learning algorithm for biped walking in which the robot learns to appropriately modulate an observed walking pattern. Viapoints are detected from the observed walking trajectories using the minimum jerk criterion. The learning algorithm modulates the ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
Abstract — We propose a model-based reinforcement learning algorithm for biped walking in which the robot learns to appropriately modulate an observed walking pattern. Viapoints are detected from the observed walking trajectories using the minimum jerk criterion. The learning algorithm modulates the via-points as control actions to improve walking trajectories. This decision is based on a learned model of the Poincaré map of the periodic walking pattern. The model maps from a state in the single support phase and the control actions to a state in the next single support phase. We applied this approach to both a simulated robot model and an actual biped robot. We show that successful walking policies are acquired.

Robust Reinforcement Learning

by Jun Morimoto, Kenji Doya - Advances in Neural Information Processing Systems 13 , 2001
"... This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the differenc ..."
Abstract - Cited by 9 (2 self) - Add to MetaCart
This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results.

Feedforward Neural Networks in Reinforcement Learning Applied to High-dimensional Motor Control

by Rémi Coulom - In 13th International Conference on Algorithmic Learning Theory , 2002
"... Local linear function approximators are often preferred to feedforward neural networks to estimate value functions in reinforcement learning. Still, motor tasks usually solved by this kind of methods have a low-dimensional state space. This article demonstrates that feedforward neural networks can b ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Local linear function approximators are often preferred to feedforward neural networks to estimate value functions in reinforcement learning. Still, motor tasks usually solved by this kind of methods have a low-dimensional state space. This article demonstrates that feedforward neural networks can be applied successfully to high-dimensional problems. The main difficulties of using backpropagation networks in reinforcement learning are reviewed, and a simple method to perform gradient descent efficiently is proposed. It was tested successfully on an original task of learning to swim by a complex simulated articulated robot, with 4 control variables and 12 independent state variables.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University