## R.: Incremental multi-step Q-learning (1996)

Citations: | 93 - 2 self |

### BibTeX

@MISC{Peng96r.:incremental,

author = {Jing Peng and Ronald J. Williams and Pack Kaelbling},

title = {R.: Incremental multi-step Q-learning},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quantization. The resulting algorithm, Q(A)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.

### Citations

1412 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...rning to balance a pole on a cart. 1 INTRODUCTION The incremental multi-step Q-learning (Q()-learning) method is a new direct (or model-free) algorithm that extends the one-step Q-learning algorithm (=-=Watkins 1989-=-) by combining it with TD() returns for generals(Sutton 1989) in a natural way for delayed reinforcement learning. By allowing corrections to be made incrementally to the predictions of observations o... |

1328 | Learning to predict by the method of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...ntal multi-step Q-learning (Q()-learning) method is a new direct (or model-free) algorithm that extends the one-step Q-learning algorithm (Watkins 1989) by combining it with TD() returns for generals(=-=Sutton 1988-=-) in a natural way for delayed reinforcement learning. By allowing corrections to be made incrementally to the predictions of observations occurring in the past, the Q()-learning method propagates inf... |

498 | Integrated architectures for learning, planning and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...ditional pleasing characteristic of the Q()-learning method is that it achieves greater computational efficiency without having to learn and use a model of the world (Peng 1993, Peng & Williams 1993, =-=Sutton 1990-=-) and is well suited to parallel implementation. Finally, although look-up table representation has been our main focus so far, it can be shown without difficulty that Q() learning can be implemented ... |

336 | Prioritized sweeping: Reinforcement learning with less data and less time
- Moore, Atkeson
- 1993
(Show Context)
Citation Context ...ns to past predictions throughout each trial. If this is the main benefit conferred by TD(), one might expect model-based, multiple-update methods like priority-Dyna (Peng 1993, Peng & Williams 1993, =-=Moore & Atkeson 1994-=-), to perform at least as well. However, additional experiments using such techniques showed that they performed significantly worse than Q()-learning. We believe the reason for this is that the coars... |

305 | On-line Q-learning using connectionist systems
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...e visited. There are other possibilities, however. For example, the algorithm may estimate the TD() returns by using the current exploration policy. This is the algorithm, called sarsa, described in (=-=Rummery & Niranjan 1994-=-). In this algorithm, the update rule is \Deltaw t = ff(r t + flQ t+1 \Gamma Q t ) t X k=0 (fl) t\Gammak rwQ k (10) where w denotes the weights of connectionist networks, and Q t+1 is associated with ... |

283 |
Stochastic Dynamic Programming
- Ross
- 1983
(Show Context)
Citation Context ...h & Sutton 1996, Watkins 1989). 3. One-Step Q-Learning One-step Q-learning of Watkins (1989), or simply Q-learning, is a simple incremental algorithm developed from the theory of dynamic programming (=-=Ross 1983-=-) for delayed reinforcement learning. In Q-learning, policies and the value function are represented by a two-dimensional lookup table indexed by state-action pairs. Formally, using notation consisten... |

257 |
Temporal Credit Assignment in Reinforcement Learning
- Sutton
- 1984
(Show Context)
Citation Context ...y had. 6 CONCLUSION The Q()-learning algorithm is of interest because of its incrementality and its relationship to Q-learning (Watkins 1989) and actor-critic learning (Barto, Sutton & Anderson 1983, =-=Sutton 1984-=-). The experiments reported here demonstrate that the Q()-learning algorithm inherits the best qualities of both the actorcritic learning algorithm and the Q-learning algorithm. However, this algorith... |

215 | On the convergence of stochastic iterative dynamic programming algorithms - Jaakkola, Jordan, et al. - 1994 |

196 | Reinforcement Learning for Robots Using Neural Networks - Lin - 1993 |

196 | Reinforcement learning with replacing eligibility traces
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...chieve near optimal performance in certain prediction tasks by settingsat each time step to the transition probability of the immediately preceding transitions. For further details, see (Sutton 1988, =-=Singh & Sutton 1996-=-, Watkins 1989). 3. One-Step Q-Learning One-step Q-learning of Watkins (1989), or simply Q-learning, is a simple incremental algorithm developed from the theory of dynamic programming (Ross 1983) for ... |

191 | Neuron-like elements that can solve difficult learning control problems - Barto, Sutton, et al. - 1983 |

102 | Efficient learning and planning within the Dyna framework. Adaptive Behavior
- Peng, Williams
- 1993
(Show Context)
Citation Context ...ct of making alterations to past predictions throughout each trial. If this is the main benefit conferred by TD(), one might expect model-based, multiple-update methods like priority-Dyna (Peng 1993, =-=Peng & Williams 1993-=-, Moore & Atkeson 1994), to perform at least as well. However, additional experiments using such techniques showed that they performed significantly worse than Q()-learning. We believe the reason for ... |

62 |
The convergence of TD(*) for general
- Dayan
- 1992
(Show Context)
Citation Context ...ively, r is a weighted average of corrected truncated returns in which the weight of r (n) is proportional tosn , where 0 !s! 1. As a result, r has the error-reduction property. In fact, it is shown (=-=Dayan 1992, Sutton-=- 1988) that under certain conditions the expected value of r converges to V �� . The TD() return can also be written recursively as r t = r t + fl(1 \Gamma )sV �� t (x t+1 ) + flr t+1 (2) Then... |

18 | On reinforcement learning of control actions in noisy and non-Markovian domains
- Pendrith
- 1994
(Show Context)
Citation Context ...94) that the overall performance of Q() learning, including sarsa, shows less sensitivity to the choice of training parameters 7 and exhibits more robust behaviors than standard Q learning. See also (=-=Pendrith 1994-=-). Experiments involving both Markovian and non-Markovian tasks, whose details we omit here, were carried out to validate the efficacy of the Q()-learning algorithm. The results showed that Q()-learni... |

15 | On step-size and bias in temporal-difference learning
- Sutton, Singh
- 1994
(Show Context)
Citation Context ... trade-off between bias and variance. Sutton's empiricalsdemonstration (Sutton, 1988) favors intermediate values of A that are closer to O. MoresINCREMENTAL MULTI-STEP Q-LEARNING 285srecent analysis (=-=Sutton & Singh, 1994-=-) suggests that in certain prediction tasks nearsoptimal performance can be achieved by setting A at each time step to the transitionsprobability of the immediately preceding transitions. For further ... |

13 |
Efficient Dynamic Programming-Based Learning for Control
- Peng
- 1993
(Show Context)
Citation Context ...ifficulty is that changes insQ at each time step may affect r , which will in turn affectsQ, and so on. However, these effects may not be significant for small ff since they are proportional to ff 2 (=-=Peng 1993-=-). 6 At each time step, the Q()-learning algorithm loops through a set of stateaction pairs which grow linearly with time. In the worst case, this set could be the entire state-action space. However, ... |

9 | Fast and efficient reinforcement learning with truncated temporal differences - Cichosz, Mulawka - 1995 |

1 | Neuronlike lements that can solve difficult learning control problems - Barto, Sutton, et al. - 1983 |