## Reinforcement Learning In Continuous Time and Space (2000)

Venue: | Neural Computation |

Citations: | 115 - 4 self |

### BibTeX

@ARTICLE{Doya00reinforcementlearning,

author = {Kenji Doya},

title = {Reinforcement Learning In Continuous Time and Space},

journal = {Neural Computation},

year = {2000},

volume = {12},

pages = {219--245}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value functions and for improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived and their correspondences with the conventional residual gradient, TD(0), and TD() algorithms are shown. For policy improvement, two methods, namely, a continuous actor-critic method and a value-gradient based greedy policy, are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived....

### Citations

1331 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...! 0, the policy will be a "bang-bang" control law u j = u max j sign " @f(x; u) @u j T @V (x) @x T # : (25) 4.3 Advantage Updating When the model of the dynamics is not available, like =-=in Q-learning (Watkins, 1989), we can -=-select a greedy action by directly learning the term to be maximized in the HJB equation r(x(t); u(t)) + @V 3 (x) @x f(x(t); u(t)): This idea has been implemented in the "advantage updating"... |

1314 | Reinforcement learning: a survey - Kaelbling, Littman, et al. - 1996 |

1238 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...al hundred trials using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto et al., 1983; =-=Sutton, 1988-=-; Sutton and Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applic... |

863 | Reinforcement learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...als using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto et al., 1983; Sutton, 1988; =-=Sutton and Barto, 1998-=-) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale pr... |

589 |
Neurons with graded response have collective computational properties like those of two-state neurons
- Hopfield
- 1984
(Show Context)
Citation Context ...cept in the experiments of Figure 6. Both the value and policy functions were implemented by normalized Gaussian networks, as described in Appendix B. A sigmoid output function s(x) = 2 arctan( 2 x) (=-=Hopfield, 1984-=-) was used in both (19) and (24). In order to promote exploration, we incorporated a noise term oen(t) in both policies (19) and (24) (see equations (33) and (34) in Appendix B). We used low-pass filt... |

479 |
Neuronlike adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...ccomplished in several hundred trials using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (=-=Barto et al., 1983-=-; Sutton, 1988; Sutton and Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of suc... |

356 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...laborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difficulties by using appropriate function approximators (Gordon, 1996; =-=Sutton, 1996-=-; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 199... |

313 | Robot learning from demonstration
- Atkeson, Schaal
- 1997
(Show Context)
Citation Context ...ic examples (Tsitsiklis and Van Roy, 1997), methods that dynamically allocate or reshape basis functions have been 23 successfully used with continuous RL algorithms, for example, in a swing-up task (=-=Schaal, 1997-=-) and in a stand-up task for a three-link robot (Morimoto and Doya, 1998). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully, for example, the propertie... |

282 | Improving elevator performance using reinforcement learning
- Crites, Barto
- 1996
(Show Context)
Citation Context ...s for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (=-=Crites and Barto, 1996-=-; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The pro... |

239 | Residual Algorithms: Reinforcement learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ...me form of the TD error. The update algorithms are derived either by using a single step or exponentially weighed eligibility traces. The relationships of these algorithms with the residual gradient (=-=Baird, 1995-=-), TD(0), and TD() algorithms (Sutton, 1988) for discrete cases are also shown. Next, we formulate methods for improving the policy using 3 the value function, namely, the continuous actor-critic meth... |

228 |
TD-gammon a self-teaching backgammon program, achieves master-level play. Neural Computation 1994;6(2):215–9
- Tesauro
- 2008
(Show Context)
Citation Context ...ach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (=-=Tesauro, 1994-=-), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Su... |

224 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...de to eliminate some of these difficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (=-=Moore, 1994-=-; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formula... |

220 | An analysis of temporal-difference learning with function approximation - Tsitsiklis, Roy - 1997 |

211 | Stable function approximation in dynamic programming
- Gordon
- 1995
(Show Context)
Citation Context ...finite time step, as shown in sections 3.2 and 3.3, it becomes equivalent to a discrete-time TD algorithm, for which some convergent properties have been shown with the use of function approximators (=-=Gordon, 1995-=-; Tsitsiklis and Van Roy, 1997). For example, the convergence of TD algorithms has been shown with the use of a linear function approximator and on-line sampling (Tsitsiklis and Van Roy, 1997), which ... |

170 | Reward functions for accelerated learning
- Mataric
- 1994
(Show Context)
Citation Context ...ful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (=-=Mataric, 1994-=-) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The progress of RL research so far, however, has been mostly constrained to the discrete formulation... |

122 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...ilable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; =-=Singh and Bertsekas, 1997-=-), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The progress of RL research so far, however, has been mostly c... |

115 | Reinforcement learning methods for continuous-time Markov decision problems - Bradtke, Duff - 1995 |

112 | Reinforcement learning with soft state aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...te some of these difficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; =-=Singh et al., 1995-=-; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-t... |

65 |
A stochastic reinforcement learning algorithm for learning real-valued functions
- Gullapalli
- 1990
(Show Context)
Citation Context ...n approximator with a parameter vector w A , n(t) 2 R m is noise, and s() is a monotonically increasing output function. The parameters are updated by the stochastic real-valued (SRV) unit algorithm (=-=Gullapalli, 1990-=-) assw A i = j A ffi(t)n(t) @A(x(t); w A ) @w A i : (20) 4.2 Value-Gradient Based Policy In discrete problems, a greedy policy can be found by one-ply search for an action that maximizes the sum of th... |

63 | TD models: Modeling the world at a mixture of time scales
- Sutton
- 1995
(Show Context)
Citation Context ...utton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (=-=Sutton, 1995-=-). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resorting to the explicit discretization of time, stat... |

58 |
A Menu of Designs for Reinforcement Learning Over Time
- Werbos
- 1990
(Show Context)
Citation Context ...he continuous framework has the following possible advantages: 1. A smooth control performance can be achieved. 2. An efficient control policy can be derived using the gradient of the value function (=-=Werbos, 1990-=-). 3. There is no need to guess how to partition the state, action, and time: it is the task of the function approximation and numerical integration algorithms to find the right granularity. There hav... |

55 | Reinforcement learning Applied to Linear Quadratic Regulation - Bradtke - 1993 |

50 | Using local trajectory optimizers to speed up global optimization in dynamic programming
- Atkeson
- 1994
(Show Context)
Citation Context ...ionship with "advantage updating" (Baird, 1993) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limite=-=d torque (Atkeson, 1994-=-; Doya, 1996) using normalized Gaussian basis function networks for representing the value function, the policy, and the model. We test: 1) the performance of the discrete actor-critic, continuous act... |

45 | Advantage Updating
- Baird
- 1993
(Show Context)
Citation Context ...when a model is available for the input gain of the system dynamics, we derive a closed-form feedback policy that is suitable for real-time implementation. Its relationship with "advantage updati=-=ng" (Baird, 1993-=-) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1994; Doya, 1996) using normalized Ga... |

27 | Action-based sensor space categorization for robot learning
- Asada, Noda, et al.
- 1996
(Show Context)
Citation Context ...ficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; =-=Asada et al., 1996-=-; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical system... |

27 | Temporal difference in learning in continuous time and space
- Doya
- 1996
(Show Context)
Citation Context ...dvantage updating" (Baird, 1993) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1=-=994; Doya, 1996-=-) using normalized Gaussian basis function networks for representing the value function, the policy, and the model. We test: 1) the performance of the discrete actor-critic, continuous actor-critic, a... |

20 | Stable fitted reinforcement learning
- Gordon
- 1996
(Show Context)
Citation Context ...nageable, an elaborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difficulties by using appropriate function approximators (=-=Gordon, 1996-=-; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale method... |

15 | Reinforcement learning of dynamic motor sequence: Learning to stand up
- Morimoto, Doya
- 1998
(Show Context)
Citation Context ...mically allocate or reshape basis functions have been 23 successfully used with continuous RL algorithms, for example, in a swing-up task (Schaal, 1997) and in a stand-up task for a three-link robot (=-=Morimoto and Doya, 1998-=-). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully, for example, the properties of the function approximators and the methods for exploration, remains... |

15 | Adaptive choice of grid and time in reinforcement learning
- Pareigis
- 1997
(Show Context)
Citation Context ...ppropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; =-=Pareigis, 1998-=-), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resort... |

10 | Efficient nonlinear control with actor-tutor architecture
- Doya
- 1997
(Show Context)
Citation Context ...ects of the learning parameters, including the action cost, exploration noise, and landscape of the reward function. Then, we test the algorithms in a more challenging task, i.e., cart-pole swing-up (=-=Doya, 1997-=-), in which the state space is higher-dimensional and the system input gain is state-dependent. 2 The Optimal Value Function for a Discounted Reward Task In this paper, we consider the continuous-time... |

10 |
Reinforcement learning applied to a differential game
- Harmon, Baird, et al.
- 1996
(Show Context)
Citation Context ...r t + flV t 0 V t01 by taking the discount factor fl = 1 0 1t ' e 0 1t and rescaling the values as V t = 1 1t V (t). The update schemes (14) and (15) correspond to the residual-gradient (Baird, 1995; =-=Harmon et al., 1996-=-) and TD(0) algorithms, respectively. Note that time step 1t of the Euler differentiation does not have to be equal to the control cycle of the physical system. 3.3 Exponential Eligibility Trace: TD()... |

10 | A convergent reinforcement learning algorithm in the continuous case based on a finite difference method
- Munos
- 1997
(Show Context)
Citation Context ...ue function that satisfies the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for gird sizes taken to zero (=-=Munos, 1997-=-; Munos and Bourgine, 1998). However, the direct implementation of such methods is impractical in a high-dimensional state space. An HJB based method that uses function approximators was presented by ... |

8 |
Reinforcement learning for continuous stochastic control problems
- Munos, Bourgine
- 1998
(Show Context)
Citation Context ...hat satisfies the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for gird sizes taken to zero (Munos, 1997; =-=Munos and Bourgine, 1998-=-). However, the direct implementation of such methods is impractical in a high-dimensional state space. An HJB based method that uses function approximators was presented by Dayan and Singh (1996) . T... |

6 | Improving policies without measuring merits - Dayan, Singh - 1996 |

2 |
On-line estimation of the optimal value function: HJBestimators
- Peterson
- 1993
(Show Context)
Citation Context ... Bertsekas (1995) and Fleming and Soner (1993)). Methods for learning the optimal value function that satisfies the HJB equation have been studied using a grid-based discretization of space and time (=-=Peterson, 1993-=-) and convergence proofs have been shown for gird sizes taken to zero (Munos, 1997; Munos and Bourgine, 1998). However, the direct implementation of such methods is impractical in a high-dimensional s... |