#### DMCA

## Reinforcement Learning In Continuous Time and Space (2000)

### Cached

### Download Links

- [meta.rad.atr.co.jp]
- [homes.cs.washington.edu]
- [homes.cs.washington.edu]
- [www.cns.atr.jp]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 176 - 7 self |

### Citations

1712 | Reinforcement learning: A survey - Kaelbling, Littman, et al. - 1996 |

1670 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...! 0, the policy will be a "bang-bang" control law u j = u max j sign " @f(x; u) @u j T @V (x) @x T # : (25) 4.3 Advantage Updating When the model of the dynamics is not available, like =-=in Q-learning (Watkins, 1989), we can -=-select a greedy action by directly learning the term to be maximized in the HJB equation r(x(t); u(t)) + @V 3 (x) @x f(x(t); u(t)): This idea has been implemented in the "advantage updating"... |

1520 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...al hundred trials using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto et al., 1983; =-=Sutton, 1988-=-; Sutton and Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applic... |

1233 |
Reinforcement learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...als using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto et al., 1983; Sutton, 1988; =-=Sutton and Barto, 1998-=-) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale pr... |

829 |
Neurons with graded response have collective computational properties like those of two–state neurons
- Hopfield
- 1984
(Show Context)
Citation Context ...cept in the experiments of Figure 6. Both the value and policy functions were implemented by normalized Gaussian networks, as described in Appendix B. A sigmoid output function s(x) = 2 arctan( 2 x) (=-=Hopfield, 1984-=-) was used in both (19) and (24). In order to promote exploration, we incorporated a noise term oen(t) in both policies (19) and (24) (see equations (33) and (34) in Appendix B). We used low-pass filt... |

751 | Dynamic Programming and Optimal Control. Athena Scientific - Bertsekas - 2005 |

604 |
Neuron-like adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...ccomplished in several hundred trials using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (=-=Barto et al., 1983-=-; Sutton, 1988; Sutton and Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of suc... |

433 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...laborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difficulties by using appropriate function approximators (Gordon, 1996; =-=Sutton, 1996-=-; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 199... |

399 | Learning from demonstration
- Schaal
- 1997
(Show Context)
Citation Context ...ic examples (Tsitsiklis and Van Roy, 1997), methods that dynamically allocate or reshape basis functions have been 23 successfully used with continuous RL algorithms, for example, in a swing-up task (=-=Schaal, 1997-=-) and in a stand-up task for a three-link robot (Morimoto and Doya, 1998). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully, for example, the propertie... |

323 | Improving elevator performance using reinforcement learning
- Crites, Barto
- 1996
(Show Context)
Citation Context ...s for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (=-=Crites and Barto, 1996-=-; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The pro... |

313 | An analysis of temporal-difference learning with function approximation. - Tsitsiklis, Roy - 1997 |

306 | Residual algorithms: Reinforcement learning with function approximation
- Baird
- 1995
(Show Context)
Citation Context ...me form of the TD error. The update algorithms are derived either by using a single step or exponentially weighed eligibility traces. The relationships of these algorithms with the residual gradient (=-=Baird, 1995-=-), TD(0), and TD() algorithms (Sutton, 1988) for discrete cases are also shown. Next, we formulate methods for improving the policy using 3 the value function, namely, the continuous actor-critic meth... |

286 |
TD-Gammon, a self-teaching backgammon program, achieves master-level play
- Tesauro
- 1994
(Show Context)
Citation Context ...ach to control and decision problems for which optimal solutions are analytically unavailable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (=-=Tesauro, 1994-=-), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Su... |

263 | Stable function approximation in dynamic programming.
- Gordon
- 1995
(Show Context)
Citation Context ...finite time step, as shown in sections 3.2 and 3.3, it becomes equivalent to a discrete-time TD algorithm, for which some convergent properties have been shown with the use of function approximators (=-=Gordon, 1995-=-; Tsitsiklis and Van Roy, 1997). For example, the convergence of TD algorithms has been shown with the use of a linear function approximator and on-line sampling (Tsitsiklis and Van Roy, 1997), which ... |

255 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...de to eliminate some of these difficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (=-=Moore, 1994-=-; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formula... |

195 | Reward functions for accelerated learning.
- Mataric
- 1994
(Show Context)
Citation Context ...ful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; Singh and Bertsekas, 1997), and robot navigation (=-=Mataric, 1994-=-) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The progress of RL research so far, however, has been mostly constrained to the discrete formulation... |

193 |
Learning to predict by the methods of temporal di®erences
- Sutton
- 1988
(Show Context)
Citation Context ...ral hundred trials using the value-gradient based policy with a learned dynamic model. 1 Introduction The temporal dierence (TD) family of reinforcement learning (RL) algorithms (Barto et al., 1983; =-=Sutton, 1988-=-; Sutton and Barto, 1998) provides an eective approach to control and decision problems for which optimal solutions are analytically unavailable or diÆcult to obtain. A number of successful applicati... |

138 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...ilable or difficult to obtain. A number of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites and Barto, 1996; Zhang and Dietterich, 1996; =-=Singh and Bertsekas, 1997-=-), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling et al. (1996) and Sutton and Barto (1998) for a review). The progress of RL research so far, however, has been mostly c... |

135 | Reinforcement learning methods for continuous-time Markov decision problems, - Bradtke - 1995 |

125 | Reinforcement Learning with Soft State Aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...te some of these difficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; =-=Singh et al., 1995-=-; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-t... |

89 |
A Menu of Designs for Reinforcement Learning Over Time,”
- Werbos
- 1990
(Show Context)
Citation Context ...he continuous framework has the following possible advantages: 1. A smooth control performance can be achieved. 2. An efficient control policy can be derived using the gradient of the value function (=-=Werbos, 1990-=-). 3. There is no need to guess how to partition the state, action, and time: it is the task of the function approximation and numerical integration algorithms to find the right granularity. There hav... |

81 |
A stochastic reinforcement learning algorithm for learning realvalued functions.
- Gullapalli
- 1990
(Show Context)
Citation Context ...n approximator with a parameter vector w A , n(t) 2 R m is noise, and s() is a monotonically increasing output function. The parameters are updated by the stochastic real-valued (SRV) unit algorithm (=-=Gullapalli, 1990-=-) assw A i = j A ffi(t)n(t) @A(x(t); w A ) @w A i : (20) 4.2 Value-Gradient Based Policy In discrete problems, a greedy policy can be found by one-ply search for an action that maximizes the sum of th... |

70 | TD models: Modeling the world at a mixture of time scales.
- Sutton
- 1995
(Show Context)
Citation Context ...utton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale methods (=-=Sutton, 1995-=-). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resorting to the explicit discretization of time, stat... |

62 | Reinforcement learning applied to linear quadratic regulation." In: - Bradtke - 1993 |

52 | Using local trajectory optimizers to speed up global optimization in dynamic programming
- Atkeson
- 1994
(Show Context)
Citation Context ...ionship with "advantage updating" (Baird, 1993) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limite=-=d torque (Atkeson, 1994-=-; Doya, 1996) using normalized Gaussian basis function networks for representing the value function, the policy, and the model. We test: 1) the performance of the discrete actor-critic, continuous act... |

51 | Advantage updating.
- Baird
- 1993
(Show Context)
Citation Context ...when a model is available for the input gain of the system dynamics, we derive a closed-form feedback policy that is suitable for real-time implementation. Its relationship with "advantage updati=-=ng" (Baird, 1993-=-) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1994; Doya, 1996) using normalized Ga... |

34 | Temporal difference learning in continuous time and space
- Doya
- 1996
(Show Context)
Citation Context ...dvantage updating" (Baird, 1993) is also discussed. The performance of the proposed methods is first evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1=-=994; Doya, 1996-=-) using normalized Gaussian basis function networks for representing the value function, the policy, and the model. We test: 1) the performance of the discrete actor-critic, continuous actor-critic, a... |

30 | ActionBased Sensor Space Categorization for Robot Learning
- Asada, Noda, et al.
- 1996
(Show Context)
Citation Context ...ficulties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; =-=Asada et al., 1996-=-; Pareigis, 1998), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical system... |

28 | Neurons with graded response have collective computational properties like those of twostate neurons”, - Hop - 1984 |

23 | Stable fitted reinforcement learning
- Gordon
- 1996
(Show Context)
Citation Context ...nageable, an elaborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difficulties by using appropriate function approximators (=-=Gordon, 1996-=-; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale method... |

22 | Reinforcement learning of dynamic motor sequence: Learning to stand up.
- Morimoto, Doya
- 1998
(Show Context)
Citation Context ...mically allocate or reshape basis functions have been 23 successfully used with continuous RL algorithms, for example, in a swing-up task (Schaal, 1997) and in a stand-up task for a three-link robot (=-=Morimoto and Doya, 1998-=-). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully, for example, the properties of the function approximators and the methods for exploration, remains... |

17 | An analysis of temporal-dierence learning with function approximation - Tsitsiklis, Roy - 1997 |

15 | Adaptive choice of grid and time in reinforcement learning
- Pareigis
- 1997
(Show Context)
Citation Context ...ppropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; =-=Pareigis, 1998-=-), and multiple time scale methods (Sutton, 1995). 2 In this paper, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resort... |

13 |
Reinforcement learning applied to a differential game,”
- Harmon, B, et al.
- 1995
(Show Context)
Citation Context ...r t + flV t 0 V t01 by taking the discount factor fl = 1 0 1t ' e 0 1t and rescaling the values as V t = 1 1t V (t). The update schemes (14) and (15) correspond to the residual-gradient (Baird, 1995; =-=Harmon et al., 1996-=-) and TD(0) algorithms, respectively. Note that time step 1t of the Euler differentiation does not have to be equal to the control cycle of the physical system. 3.3 Exponential Eligibility Trace: TD()... |

13 | A convergent reinforcement learning algorithm in the continuous case: the finite-element reinforcement learning.
- Munos
- 1996
(Show Context)
Citation Context ...ue function that satisfies the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for gird sizes taken to zero (=-=Munos, 1997-=-; Munos and Bourgine, 1998). However, the direct implementation of such methods is impractical in a high-dimensional state space. An HJB based method that uses function approximators was presented by ... |

12 | Efficient Nonlinear Control with Actor-Tutor Architecture,
- Doya
- 1997
(Show Context)
Citation Context ...ects of the learning parameters, including the action cost, exploration noise, and landscape of the reward function. Then, we test the algorithms in a more challenging task, i.e., cart-pole swing-up (=-=Doya, 1997-=-), in which the state space is higher-dimensional and the system input gain is state-dependent. 2 The Optimal Value Function for a Discounted Reward Task In this paper, we consider the continuous-time... |

10 |
Reinforcement learning for continuous stochastic control problems. Neural Information Processing Systems
- Munos, Bourgine
- 1997
(Show Context)
Citation Context ...hat satisfies the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for gird sizes taken to zero (Munos, 1997; =-=Munos and Bourgine, 1998-=-). However, the direct implementation of such methods is impractical in a high-dimensional state space. An HJB based method that uses function approximators was presented by Dayan and Singh (1996) . T... |

6 | Improving policies without measuring merits - Dayan, Singh - 1996 |

5 |
Temporal dierence learning in continuous time and space
- Doya
- 1996
(Show Context)
Citation Context ...\advantage updating" (Baird, 1993) is also discussed. The performance of the proposed methods issrst evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1994; =-=Doya, 1996-=-) using normalized Gaussian basis function networks for representing the value function, the policy, and the model. We test: 1) the performance of the discrete actor-critic, continuous actor-critic, a... |

4 |
Reinforcement learning applied to a dierential game. Adaptive Behavior
- Harmon, Baird, et al.
- 1996
(Show Context)
Citation Context ... r t +sV t V t1 by taking the discount factors= 1 t ' e t and rescaling the values as V t = 1 t V (t). The update schemes (14) and (15) correspond to the residual-gradient (Baird, 1995; =-=Harmon et al., 1996-=-) and TD(0) algorithms, respectively. Note that time step t of the Euler dierentiation does not have to be equal to the control cycle of the physical system. 3.3 Exponential Eligibility Trace: TD()... |

3 |
Stable reinforcement learning
- Gordon
- 1996
(Show Context)
Citation Context ... manageable, an elaborate partitioning of the variables has to be found using prior knowledge. Eorts have been made to eliminate some of these diÆculties by using appropriate function approximators (=-=Gordon, 1996-=-; Sutton, 1996; Tsitsiklis and Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh et al., 1995; Asada et al., 1996; Pareigis, 1998), and multiple time scale method... |

2 |
On-line estimation of the optimal value function: HJB-estimators
- Peterson
- 1993
(Show Context)
Citation Context ... Bertsekas (1995) and Fleming and Soner (1993)). Methods for learning the optimal value function that satisfies the HJB equation have been studied using a grid-based discretization of space and time (=-=Peterson, 1993-=-) and convergence proofs have been shown for gird sizes taken to zero (Munos, 1997; Munos and Bourgine, 1998). However, the direct implementation of such methods is impractical in a high-dimensional s... |