## On-Line Q-Learning Using Connectionist Systems (1994)

Citations: | 292 - 1 self |

### BibTeX

@TECHREPORT{Rummery94on-lineq-learning,

author = {G. A. Rummery and M. Niranjan},

title = {On-Line Q-Learning Using Connectionist Systems},

institution = {},

year = {1994}

}

### Years of Citing Articles

### OpenURL

### Abstract

Reinforcement learning algorithms are a powerful machine learning technique. However, much of the work on these algorithms has been developed with regard to discrete finite-state Markovian problems, which is too restrictive for many real-world environments. Therefore, it is desirable to extend these methods to high dimensional continuous state-spaces, which requires the use of function approximation to generalise the information learnt by the system. In this report, the use of back-propagation neural networks (Rumelhart, Hinton and Williams 1986) is considered in this context. We consider a number of different algorithms based around Q-Learning (Watkins 1989) combined with the Temporal Difference algorithm (Sutton 1988), including a new algorithm (Modified Connectionist Q-Learning), and Q() (Peng and Williams 1994). In addition, we present algorithms for applying these updates on-line during trials, unlike backward replay used by Lin (1993) that requires waiting until the end of each t...

### Citations

1237 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...umelhart, Hinton and Williams 1986) is considered in this context. We consider a number of different algorithms based around Q-Learning (Watkins 1989) combined with the Temporal Difference algorithm (=-=Sutton 1988-=-), including a new algorithm (Modified Connectionist Q-Learning), and m(h) (Peng and Williams 1994). In addition, we present algorithms for applying these updates on-line during trials, unlike backwar... |

276 | Self-improving reactive agents based on reinforcement learning
- Lin
- 1992
(Show Context)
Citation Context .... These methods have been demonstrated on a realistic mobile robot problem, and have shown that on-line learning is in fact a more efficient method of performing updates than backward replay methods (=-=Lin 1992-=-), in terms both of storage requirements and sensitivity to training parameters. On-line learning also has the advantage that it could be used in continuously operating systems where no end of trial c... |

189 | Reinforcement learning for robots using neural networks - Lin - 1992 |

122 | Efficient exploration in reinforcement learning
- Thrun
- 1992
(Show Context)
Citation Context ...system should only choose to perform a non-greedy action if it lacks confidence in the current greedy prediction. Although various methods have been suggested for use in discrete state-space systems (=-=Thrun 1992-=-), they are not generally applicable to systems using continuous function approximators. 3 Connectionist Q-Learning Learning the Q-function requires some method of storing the current predictions at e... |

89 | R.: Incremental multi-step Q-learning
- Peng, Williams
- 1996
(Show Context)
Citation Context ... of different algorithms based around Q-Learning (Watkins 1989) combined with the Temporal Difference algorithm (Sutton 1988), including a new algorithm (Modified Connectionist Q-Learning), and m(h) (=-=Peng and Williams 1994-=-). In addition, we present algorithms for applying these updates on-line during trials, unlike backward replay used by Lin (1993) that requires waiting until the end of each trial before updating can ... |

63 | A.Schwart. Issues in using function approximation for reinforcement learning
- Thrun
- 1993
(Show Context)
Citation Context ...ions from being ;seen' by earlier actions, but also mean that states see a continual overestimation of the payoffs available, as they are always trained on the maximum predicted Q-value at each step (=-=Thrun and Schwartz 1993-=-). However, in a connectionist system, generalisation occurs, which means that the effects of bad exploratory actions will be seen by nearby states even if h is set to zero to try to prevent this, so ... |

8 | Implementation details of the TD( ) procedure for the case of vector predictions and backpropagation - Sutton - 1989 |

3 |
An Approach to Learning Robot Navigation
- Thrun
- 1994
(Show Context)
Citation Context ...l then, all state-action In order to clarify the equations, (t is used as a notational shorthand for ((xt, at). 2This is also the algorithm used by Thrun in his vork on connectionist Q-learning e.g. (=-=Thrun 1994-=-). pairs must be stored, and then presented in a temporally backward order to propagate the prediction errors correctly. This is called backward replay. In section 4 we present algorithms to implement... |

1 | The convergence of TD(h) for general h - Dayan - 1992 |