## Direct gradient-based reinforcement learning: I. gradient estimation algorithms (1999)

### Cached

### Download Links

Venue: | National University |

Citations: | 68 - 3 self |

### BibTeX

@TECHREPORT{Baxter99directgradient-based,

author = {Jonathan Baxter and Lex Weaver and Peter Bartlett},

title = {Direct gradient-based reinforcement learning: I. gradient estimation algorithms},

institution = {National University},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

In [2] we introduced ¢¡¤£¦¥¨§¦¡, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes ( ¡©£¦¥¨§¦ ¡ s). The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one ���� � ������ � free parameter which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces.

### Citations

4388 | Reinforcement learning: An introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...oximate the state (or state and action) values. Most algorithms then seek to minimize some form of error between the approximate value function and the true value function, usually by simulation (see =-=[13]-=- and [4] for comprehensive overviews). While there have been a multitude of empirical successes for this approach (see e.g [10, 14, 15, 3, 18, 11] to name but a few), there are only weak theoretical g... |

1359 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...cy with around 100 iterations of the ¥¨§¦¡ , while the on-line method requires approximately 1000 iterations. This should be contrasted with training a linear value-function for this system using �¦� =-=[12]-=-, which can be shown to converge to a ����� value function whose one-step lookahead policy is suboptimal [16]. In the second set of experiments, we consider a simple “puck-world” problem in which a sm... |

889 |
Denumerable Markov Chains
- Kemeny, Snell, et al.
- 1976
(Show Context)
Citation Context ...he system of equations (10) is underconstrained because is not invertible (the balance equations show that has a left eigenvector with zero eigenvalue). However, �� ,where���� ����� ℄ , is invertible =-=[16]-=-. Since � � � � � � � , we can rewrite (10) as Hence, � � � � �� ℄ � (11) � � � � �� ℄ � (12) Note that (11) is essentially a proof that � exists under our assumptions. For MDP’s with a sufficiently s... |

679 | Some Studies in Machine Learning Using the Game of Checkers
- Samuel
- 1959
(Show Context)
Citation Context ...e value function and the true value function, usually by simulation (see [13] and [4] for comprehensive overviews). While there have been a multitude of empirical successes for this approach (see e.g =-=[10, 14, 15, 3, 18, 11]-=- to name but a few), there are only weak theoretical guarantees on the performance of the policy generated by the approximate value function. In particular, there is no guarantee that the policy will ... |

529 |
Real Analysis and Probability
- Dudley
- 1989
(Show Context)
Citation Context ...h this case of densities on subsets of as well as the finite case of Theorem 6. We allow and � to be general spaces satisfying the following topological assumption. (For definitions see, for example, =-=[13]-=-.) Assumption 5. The control space has an associated topology that is separable, Hausdorff, and first-countable. For the corresponding Borel �-algebra � generated by this topology, there is a �-finite... |

437 | The Theory of Matrices - Lancaster, Tismenetsky - 1985 |

403 | Dynamic Programming and
- Bertsekas
- 2007
(Show Context)
Citation Context ... expected discounted reward: � � �� � �� � �� � �� � �� � (5) We also consider average reward problems. Define the average reward by: � � �� � � � �� � �� � �� � � � � �� � � � � It can be shown (see =-=[6]-=-) that � � � � �� � �� � � � � � � (6) where �� ����� ℄ . Somewhat surprisingly, for any � � , optimizing the discounted reward (5) is equivalent to optimizing the average reward (6), as the following... |

369 | Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Williams
- 1992
(Show Context)
Citation Context ...mations to the performance gradient of parameterized partially observable Markov decision processes ( ¡¤£¦¥¨§¦¡ ’s). Our algorithm is essentially an extension of ����� �¢��£ ����� Williams’ algorithm =-=[17]-=- and similar more recent algorithms [7, 5, 9, 8]. See [2, Section 1.1] for a more comprehensive discussion of this related work. More specifically, suppose ������� are the parameters controlling the ¡... |

366 | Policy gradient methods for reinforcement learning with function approximation
- Sutton, Mcallester, et al.
- 2000
(Show Context)
Citation Context ...r cost) for stochastic policies by following a gradient direction. Approaches that combine both value function (or -function) estimation and gradient estimation include [25] and more recently [1] and =-=[28]-=-. These approaches attempt to combine the advantages of gradient estimation and value function approaches, although as yet there has been little empirical or theoretical investigation of their propert... |

252 |
Td-gammon, a self-teaching backgammon program, achieves master-level play
- Tesauro
- 1994
(Show Context)
Citation Context ...e value function and the true value function, usually by simulation (see [13] and [4] for comprehensive overviews). While there have been a multitude of empirical successes for this approach (see e.g =-=[10, 14, 15, 3, 18, 11]-=- to name but a few), there are only weak theoretical guarantees on the performance of the policy generated by the approximate value function. In particular, there is no guarantee that the policy will ... |

135 | Gradient descent for general reinforcement learning
- Baird, Moore
- 1999
(Show Context)
Citation Context ...eward (or cost) for stochastic policies by following a gradient direction. Approaches that combine both value function (or -function) estimation and gradient estimation include [25] and more recently =-=[1]-=- and [28]. These approaches attempt to combine the advantages of gradient estimation and value function approaches, although as yet there has been little empirical or theoretical investigation of thei... |

134 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...e value function and the true value function, usually by simulation (see [13] and [4] for comprehensive overviews). While there have been a multitude of empirical successes for this approach (see e.g =-=[10, 14, 15, 3, 18, 11]-=- to name but a few), there are only weak theoretical guarantees on the performance of the policy generated by the approximate value function. In particular, there is no guarantee that the policy will ... |

123 | A Reinforcement Learning Approach to job-shop Scheduling
- Zhang, Dietterich
- 1995
(Show Context)
Citation Context |

122 | Reinforcement learning with soft state aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...t to optimize average reward (or cost) for stochastic policies by following a gradient direction. Approaches that combine both value function (or -function) estimation and gradient estimation include =-=[25]-=- and more recently [1] and [28]. These approaches attempt to combine the advantages of gradient estimation and value function approaches, although as yet there has been little empirical or theoretical... |

85 | Simulation-based optimization of markov reward processes
- Marbach, Tsitsiklis
- 1998
(Show Context)
Citation Context ...pproach . Formulae for the gradient in a Markov process setting were given in [10]. These formulae critically rely on estimates of the differential reward of each state, as do the algorithms given in =-=[20]-=-. One difficulty with estimating the differential reward is that it relies on the existence of a single recurrent state � for all parameter settings �. The variance of the estimate of the differential... |

75 | An upper bound on the loss from approximate optimal-value functions
- Singh, Yee
- 1994
(Show Context)
Citation Context ...inimize (1) or an � variant, one should minimize some form of relative error between state values [2, 7, 32]. 1 See Section 2 for definitions. 2 For a proof of (2) see [8, Proposition 6.1] or [34] or =-=[23]-=-. 2While this idea is promising, the approach we take in this paper is even more direct: search for a policy minimizing the expected discounted reward directly. We can view the average reward (2) as ... |

61 | L.: Feedforward neural networks methodology - Fine - 1999 |

48 | Advantage updating
- Baird
- 1993
(Show Context)
Citation Context ... learning problems. The function that is approximated is invariably some measure of the value of a state or of the values of state and action pairs (e.g. � � [26], Q-learning [31], advantage updating =-=[2]-=-). We will refer to these approaches generically as value function based. The approximating architectures range from linear functions through to neural networks and decision trees. Once an approximate... |

44 | Perturbation realization, potentials and sensitivity analysis of Markov processes
- Cao, Chen
- 1997
(Show Context)
Citation Context ...the stationary distribution itself depends on the parameters. The bias can be reduced by allowing the discount factor to approach . Formulae for the gradient in a Markov process setting were given in =-=[10]-=-. These formulae critically rely on estimates of the differential reward of each state, as do the algorithms given in [20]. One difficulty with estimating the differential reward is that it relies on ... |

42 | A convergent iterative restricted complexity control design scheme
- Hjalmarsson, Gunnarson, et al.
- 1994
(Show Context)
Citation Context ... Work The approach we take in this paper is closely related to certain direct adaptive control schemes that are used to tune (deterministic) controllers for discrete time systems. A number of authors =-=[14, 12, 15]-=- have presented algorithms for the approximate computation in closed loop of derivatives of a quadratic cost function with respect to controller parameters. This information is then used to tune the c... |

30 | Analysis of some incremental variants of policy iteration: first steps toward understanding actor-critic learning systems (Report NU-CCS-93-11
- Williams, Baird
- 1993
(Show Context)
Citation Context ...ing to minimize (1) or an � variant, one should minimize some form of relative error between state values [2, 7, 32]. 1 See Section 2 for definitions. 2 For a proof of (2) see [8, Proposition 6.1] or =-=[34]-=- or [23]. 2While this idea is promising, the approach we take in this paper is even more direct: search for a policy minimizing the expected discounted reward directly. We can view the average reward... |

28 |
Reinforcement learning in pomdps with function approximation
- Kimura, Miyazaki, et al.
- 1997
(Show Context)
Citation Context ...arameterized partially observable Markov decision processes ( ¡¤£¦¥¨§¦¡ ’s). Our algorithm is essentially an extension of ����� �¢��£ ����� Williams’ algorithm [17] and similar more recent algorithms =-=[7, 5, 9, 8]-=-. See [2, Section 1.1] for a more comprehensive discussion of this related work. More specifically, suppose ������� are the parameters controlling the ¡�£¦¥¨§¦¡ . For example, � could be the parameter... |

26 | Integral, measure and derivative: a unified approach - Shilov, Gurevich - 1977 |

25 | Simulated-Based Methods for Markov Decision Processes
- Marbach
- 1998
(Show Context)
Citation Context ...arameterized partially observable Markov decision processes ( ¡¤£¦¥¨§¦¡ ’s). Our algorithm is essentially an extension of ����� �¢��£ ����� Williams’ algorithm [17] and similar more recent algorithms =-=[7, 5, 9, 8]-=-. See [2, Section 1.1] for a more comprehensive discussion of this related work. More specifically, suppose ������� are the parameters controlling the ¡�£¦¥¨§¦¡ . For example, � could be the parameter... |

24 | Differential training of rollout policies
- Bertsekas
- 1997
(Show Context)
Citation Context ...to the successor states in each state. This motivates an alternative approach: instead of seeking to minimize (1) or an � variant, one should minimize some form of relative error between state values =-=[2, 7, 32]-=-. 1 See Section 2 for definitions. 2 For a proof of (2) see [8, Proposition 6.1] or [34] or [23]. 2While this idea is promising, the approach we take in this paper is even more direct: search for a p... |

16 |
Learning to play chess using temporal-differences.MACHINE
- Baxter, Tridgell, et al.
- 2000
(Show Context)
Citation Context |

8 | Iterative controller optimization for nonlinear systems
- Bruyne, Anderson, et al.
- 1997
(Show Context)
Citation Context ... Work The approach we take in this paper is closely related to certain direct adaptive control schemes that are used to tune (deterministic) controllers for discrete time systems. A number of authors =-=[14, 12, 15]-=- have presented algorithms for the approximate computation in closed loop of derivatives of a quadratic cost function with respect to controller parameters. This information is then used to tune the c... |

8 |
Direct iterative tuning via spectral analysis
- Kammer, Bitmead, et al.
- 2000
(Show Context)
Citation Context ... Work The approach we take in this paper is closely related to certain direct adaptive control schemes that are used to tune (deterministic) controllers for discrete time systems. A number of authors =-=[14, 12, 15]-=- have presented algorithms for the approximate computation in closed loop of derivatives of a quadratic cost function with respect to controller parameters. This information is then used to tune the c... |

7 |
Algorithms for Sensitivity Analysis of Markov Chains Through Potentials and Perturbation Realization
- Cao, Wan
- 1998
(Show Context)
Citation Context ...arameterized partially observable Markov decision processes ( ¡¤£¦¥¨§¦¡ ’s). Our algorithm is essentially an extension of ����� �¢��£ ����� Williams’ algorithm [17] and similar more recent algorithms =-=[7, 5, 9, 8]-=-. See [2, Section 1.1] for a more comprehensive discussion of this related work. More specifically, suppose ������� are the parameters controlling the ¡�£¦¥¨§¦¡ . For example, � could be the parameter... |

6 | Reinforcement learning from state and temporal differences
- Weaver, Baxter
- 1999
(Show Context)
Citation Context ...ll be guaranteed to improve the average ������� reward on each step. Except in the case of table-lookup, most value-function based approaches to reinforcement learning cannot make this guarantee. See =-=[16]-=- for some analysis in the case �¦� of and a demonstration of performance degradation during the course of training ����� a neural network backgammon player. In this paper we � £�����¡�£¦¥¨§¢¡ present ... |

3 | Reinforcement Learning From State Differences
- Weaver, Baxter
- 1999
(Show Context)
Citation Context ...to the successor states in each state. This motivates an alternative approach: instead of seeking to minimize (1) or an � variant, one should minimize some form of relative error between state values =-=[2, 7, 32]-=-. 1 See Section 2 for definitions. 2 For a proof of (2) see [8, Proposition 6.1] or [34] or [23]. 2While this idea is promising, the approach we take in this paper is even more direct: search for a p... |