## Problem Solving With Reinforcement Learning (1995)

### Cached

### Download Links

- [svr-www.eng.cam.ac.uk]
- [cdn.preterhuman.net]
- DBLP

### Other Repositories/Bibliography

Citations: | 47 - 0 self |

### BibTeX

@MISC{Rummery95problemsolving,

author = {Gavin Adrian Rummery},

title = {Problem Solving With Reinforcement Learning},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous state-space environments. In particular, the extension of on-line updating methods is considered, where the term implies systems that learn as each experience arrives, rather than storing the experiences for use in a separate off-line learning phase. Firstly, the use of alternative update rules in place of standard Q-learning (Watkins 1989) is examined to provide faster convergence rates. Secondly, the use of multi-layer perceptton (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalising function approximators. Finally, consideration is given to the combination of Adaptive Heuristic Critic (AHC) methods and Q-learning to produce systems combining the benefits of real-valued actions and discrete switching

### Citations

2686 |
A robust layered control system for a mobile robot
- Brooks
- 1986
(Show Context)
Citation Context ...ormation to decide on its next action. This process is repeated until the robot reaches the goal. Knowledge based systems (Agre and Chapman 1987, Schoppers 1987) and Brooks' subsumption architecture (=-=Brooks 1986-=-) rely on previously de ned rules and behaviours to decide on the appropriate actions at each step. This means that the designer must take into account the dynamics of the robot in deciding on the rul... |

2626 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ... and su xi 2 X is that, V cient condition for a value function to be optimal for each state (x i)=max a2A X x j2X h i P (xjjxi�a) r(xi� xj)+ V (xj) (1.5) This is called Bellman's Optimality Equation (=-=Bellman 1957-=-). This equation forms the basis for reinforcement learning algorithms that make use of the principles of dynamic programming (Ross 1983, Bertsekas 1987), as it can be used to drive the learning of im... |

1327 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ... that learn as each experience arrives, rather than storing the experiences for use in a separate o -line learning phase. Firstly, the use of alternative update rules in place of standard Q-learning (=-=Watkins 1989-=-) is examined to provide faster convergence rates. Secondly, the use of multi-layer perceptron (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalis... |

1233 | Multilayer Feed Forward Networks Are Universal Approximates - Hornik, Stinchcombe, et al. - 1989 |

1231 | Learning to predict by the methods of temporal differences. Machine Learning 3:9--44 - Sutton - 1988 |

913 |
Real-Time Obstacle Avoidance for Manipulators and Mobile Robots,’’ The
- Khatib
- 1986
(Show Context)
Citation Context ...lt engineering problems. Robot navigation tasks are a popular topic in the AI and control literature and many solutions have been proposed using methods developed in these elds (Kant and Zucker 1986, =-=Khatib 1986-=-, Barraquand and Latcombe 1991, Agre and Chapman 1987, Schoppers 1987, Ram and Santamaria 1993, Zhu 1991). However,itisshown here that a reinforcement learning system can train a controller which can ... |

846 |
Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems
- Cybenko
- 1989
(Show Context)
Citation Context ...ning to its use, which are brie y explored in the following sections.3. Connectionist Reinforcement Learning 41 3.2.2 Layers It has been shown by several authors (Hornik, Stinchcombe and White 1989, =-=Cybenko 1989-=-, Funahashi 1989), that one hidden layer is all that is required for an MLP to approximate an arbitrary function. However, the number of hidden units actually needed in this hidden layer could be huge... |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...being learnt. Steps in this direction have been made by considering Bayesian methods to select from the state-space of possible network weights on the basis of probability distributions (Mackay 1991, =-=Neal 1995-=-). However, these methods have relied on second order calculations based on data gathered from xed training sets, which are not suitable for use with on-line reinforcement methods. 3.2.4 Choice of Per... |

578 |
Pengi: An implementation of a theory of activity
- Agre, Chapman
- 1987
(Show Context)
Citation Context ...asks are a popular topic in the AI and control literature and many solutions have been proposed using methods developed in these elds (Kant and Zucker 1986, Khatib 1986, Barraquand and Latcombe 1991, =-=Agre and Chapman 1987-=-, Schoppers 1987, Ram and Santamaria 1993, Zhu 1991). However,itisshown here that a reinforcement learning system can train a controller which can deal with general obstacle layouts, new environments,... |

532 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...rning a world model can either be treated as a separate task (system identi cation), or can be performed simultaneously with learning the value function (as in adaptive real-time dynamic programming (=-=Barto et al. 1993-=-)). Once a world model has been learnt, it can also be used to perform value function updates o -line (Sutton 1990, Peng and Williams 1993) or for planning ahead (Thrun and Moller 1992). Learning a mo... |

476 | Neuron-like Adaptive Elements That Can Solve Difficult Learning Control Problems - BARTO, SUTTON, et al. - 1983 |

417 | A learning algorithm for continually running fully recurrent neural networks
- Williams, Zipser
- 1989
(Show Context)
Citation Context ...ctions are possible, including direct links from the inputs to the output units (bypassing the hidden layer units), or feeding back outputs to previous layers to give recurrent networks (Werbos 1990, =-=Williams and Zipser 1989-=-). The exact function performed by the network is dependent on the current weightvalues at each unit, and it is by changing these values that the network can be trained to produce a particular input-o... |

373 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ....5) This is called Bellman's Optimality Equation (Bellman 1957). This equation forms the basis for reinforcement learning algorithms that make use of the principles of dynamic programming (Ross 1983, =-=Bertsekas 1987-=-), as it can be used to drive the learning of improved policies. The reinforcement learning algorithms considered in this section are applicable to systems where the state transition probabilities are... |

368 | Practical Issues in Temporal Difference Learning - Tesauro - 1992 |

340 |
On the approximate realization of continuous mappings by neural networks
- Funahashi
- 1989
(Show Context)
Citation Context ...e, which are brie y explored in the following sections.3. Connectionist Reinforcement Learning 41 3.2.2 Layers It has been shown by several authors (Hornik, Stinchcombe and White 1989, Cybenko 1989, =-=Funahashi 1989-=-), that one hidden layer is all that is required for an MLP to approximate an arbitrary function. However, the number of hidden units actually needed in this hidden layer could be huge. For this reaso... |

335 |
Robot motion planning: A distributed representation approach
- Barraquand, Latombe
- 1991
(Show Context)
Citation Context ...g problems. Robot navigation tasks are a popular topic in the AI and control literature and many solutions have been proposed using methods developed in these elds (Kant and Zucker 1986, Khatib 1986, =-=Barraquand and Latcombe 1991-=-, Agre and Chapman 1987, Schoppers 1987, Ram and Santamaria 1993, Zhu 1991). However,itisshown here that a reinforcement learning system can train a controller which can deal with general obstacle lay... |

326 | Universal Plans for Reactive Robots in Unpredictable Environments
- Schoppers
- 1987
(Show Context)
Citation Context ...c in the AI and control literature and many solutions have been proposed using methods developed in these elds (Kant and Zucker 1986, Khatib 1986, Barraquand and Latcombe 1991, Agre and Chapman 1987, =-=Schoppers 1987-=-, Ram and Santamaria 1993, Zhu 1991). However,itisshown here that a reinforcement learning system can train a controller which can deal with general obstacle layouts, new environments, and moving goal... |

305 |
Learning in embedded Systems
- Kaelbling
- 1990
(Show Context)
Citation Context ...ation method used by the system is fundamental in determining the rate at which the system will gather information and thus improve its action policy. Various methods have been suggested (Thrun 1992, =-=Kaelbling 1990-=-) which are more sophisticated than functions based only on the current prediction levels like the Boltzmann distribution. However, most of these methods rely on explicitly storing information at each... |

300 | A scaled conjugate gradient algorithm for fast supervised learning.”NEURAL - Møller - 1993 |

289 | On-line qlearning using connectionist systems - Rummery, Niranjan - 1994 |

279 |
Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems
- Holland
- 1986
(Show Context)
Citation Context ...u need to know State-Action-Reward-State-Action before performing an update (Singh and Sutton 1994). 4 Wilson (1994) noted the similarities between Q-learning and the bucket-brigade classi er system (=-=Holland 1986-=-). Using this interpretation, the bucket-brigade algorithm is equivalent a TD(0) form of Modi ed Q-Learning. 5 In the speci c form de ned in section 2.1.2.2. Alternative Q-Learning Update Rules 21 2.... |

276 | Self-improving reactive agents based on reinforcement learning
- Lin
- 1992
(Show Context)
Citation Context ...tinuous state inputs is the multi-layer perceptron (MLP) or back-propagation neural network. Although the use of neural networks in reinforcement problems has1. Introduction 12 been examined before (=-=Lin 1992-=-, Sutton 1988, Anderson 1993, Thrun 1994, Tesauro 1992, Boyan 1992), the use of on-line training methods for performing Q-learning updates with >0 has not been examined previously. These allow tempora... |

274 | Backpropagation through time: what it does and how to do it - Werbos - 1990 |

248 |
Introduction to Stochastic Dynamic Programming
- Ross
- 1983
(Show Context)
Citation Context ...+ V (xj) (1.5) This is called Bellman's Optimality Equation (Bellman 1957). This equation forms the basis for reinforcement learning algorithms that make use of the principles of dynamic programming (=-=Ross 1983-=-, Bertsekas 1987), as it can be used to drive the learning of improved policies. The reinforcement learning algorithms considered in this section are applicable to systems where the state transition p... |

247 |
Temporal Credit Assignment in Reinforcement Learning," Doctoral Dissertation
- Sutton
- 1984
(Show Context)
Citation Context ... policy iteration is to update the value function and policy simultaneously, which results in the Adaptive Heuristic Critic class of methods. The original AHC system (Barto, Sutton and Anderson 1983, =-=Sutton 1984-=-) consists of two elements: ASE The Associative Search Element chooses actions from a stochastic policy. ACE The Adaptive Critic Element learns the value function. These two elements are now more gene... |

236 |
Learning Automata: An Introduction
- Narendra, Thathachar
- 1989
(Show Context)
Citation Context ...vour producing actions that receive the best internal payo s, the idea is that actor should converge to producing the optimal policy. 5.1.1 Stochastic Hill-climbing The stochastic learning automaton (=-=Narendra and Thathachar 1989-=-) is a general term dened to represent an automaton that generates actions randomly from a probability distribution, and which receives reinforcement signals to adjust these probabilities. Williams (1... |

209 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1988
(Show Context)
Citation Context ...ernative Q-Learning Update Rules The standard one-step Q-learning algorithm as introduced by Watkins (1989) was presented in the last chapter. This has been shown to converge (Watkins and Dayan 1992, =-=Jaakkola et al. 1993-=-) for a system operating in xed Markovian environment. However, these proofs give no indication as to the convergence rate. In fact, they require that every state is visited in nitely often, which mea... |

194 |
Pruning algorithm - A survey
- Reed
- 1993
(Show Context)
Citation Context ...aining times (both in terms of processing and number of updates required for convergence). Rather than x the number of units in advance, methods of adjusting the number of hidden units automatically (=-=Reed 1993-=-, Hassibi and Stork 1993, Lee, Song and Kim 1991) have been suggested, which involve trying to remove units which perform no useful function. Ultimately, the requirement is for more sophisticated lear... |

189 |
Reinforcement Learning for Robots Using Neural Networks
- Lin
- 1993
(Show Context)
Citation Context ...ethods would have been impractical. In the remainder of this chapter, methods are examined for applying Q-learning updates utilising MLPs as function approximators. Unlike previous work in this area (=-=Lin 1993-=-b), the algorithms presented here can be applied on-line by the learning system. This results in a reinforcement learning system which ful ls the following goals: On-line training for autonomous syste... |

189 | Reinforcement learning with replacing eligibility traces. Machine Learning 22:123--158
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...993, Tsitsiklis 1994) can be modi ed to provide these bounds is an open question. 3 Though Rich Sutton suggests SARSA, asyou need to know State-Action-Reward-State-Action before performing an update (=-=Singh and Sutton 1994-=-). 4 Wilson (1994) noted the similarities between Q-learning and the bucket-brigade classi er system (Holland 1986). Using this interpretation, the bucket-brigade algorithm is equivalent a TD(0) form ... |

180 | Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks - Jacobs, Jordan, et al. - 1991 |

170 | A resource-allocating network for function interpolation
- Platt
- 1991
(Show Context)
Citation Context ... can be achieved using pure real-valued AHC learning. This is because the system 4 This is similar to the idea used in Anderson (1993) to learn a Q-function using a resource-allocation network (RAN) (=-=Platt 1991-=-).5. Systems with Real-Valued Actions 93 tends to use either Q-learning or AHC learning to construct its nal policy, rather than a mixture of both. It was explained that this was due to each AHC elem... |

161 | Transfer of Learning by Composing Solutions of Elemental Sequential Tasks
- Singh
- 1992
(Show Context)
Citation Context ...e hierarchical approach. This would involve separate Q-learning modules being taught todeal with di erent tasks, and then training the system to choose between them based on the situation (Lin 1993a, =-=Singh 1992-=-).4. The Robot Problem 72 4.5.2 Heuristic Parameters There are several parameters that must be set in order to use the reinforcement learning methods presented in this chapter. , , ,andTmust all be s... |

153 |
Learning to predict by the methods of temporal di erences
- Sutton
- 1988
(Show Context)
Citation Context ...ds and Q-learning to produce systems combining the bene ts of real-valued actions and discrete switching. The di erent update rules examined are based on Q-learning combined with the TD( ) algorithm (=-=Sutton 1988-=-). Several new algorithms, including Modi ed Q-Learning and Summation Q-Learning, are examined, as well as alternatives such asQ( )(Peng and Williams 1994). In addition, algorithms are presented for a... |

151 | Bayesian methods for adaptive models
- MacKay
- 1991
(Show Context)
Citation Context ...the function being learnt. Steps in this direction have been made by considering Bayesian methods to select from the state-space of possible network weights on the basis of probability distributions (=-=Mackay 1991-=-, Neal 1995). However, these methods have relied on second order calculations based on data gathered from xed training sets, which are not suitable for use with on-line reinforcement methods. 3.2.4 Ch... |

122 | Efficient exploration in reinforcement learning
- Thrun
- 1992
(Show Context)
Citation Context ...s The exploration method used by the system is fundamental in determining the rate at which the system will gather information and thus improve its action policy. Various methods have been suggested (=-=Thrun 1992-=-, Kaelbling 1990) which are more sophisticated than functions based only on the current prediction levels like the Boltzmann distribution. However, most of these methods rely on explicitly storing inf... |

106 | ZCS: a zeroth level classifier system - Wilson - 1994 |

94 | Efficient Learning and Planning Within the Dyna Framework
- Ping, Williams
- 1993
(Show Context)
Citation Context ... the value function (as in adaptive real-time dynamic programming (Barto et al. 1993)). Once a world model has been learnt, it can also be used to perform value function updates o -line (Sutton 1990, =-=Peng and Williams 1993-=-) or for planning ahead (Thrun and Moller 1992). Learning a model from experience is straight-forward in a Markovian domain. The basic method is to keep counters of the individual state transitions th... |

89 | Incremental multi-step q-learning
- Peng, Williams
- 1994
(Show Context)
Citation Context ...e based on Q-learning combined with the TD( ) algorithm (Sutton 1988). Several new algorithms, including Modi ed Q-Learning and Summation Q-Learning, are examined, as well as alternatives such asQ( )(=-=Peng and Williams 1994-=-). In addition, algorithms are presented for applying these Q-learning updates to train MLPs on-line during trials, as opposed to the backward-replay methodused by Lin (1993b) that requires waiting un... |

82 | Advanced Supervised Learning in Multi-Layer Perceptron from Back Propagation to Adaptative Learning Algorithms - Riedmiller - 1994 |

68 |
Hierarchies of adaptive experts
- Jordan, Jacobs
- 1992
(Show Context)
Citation Context ... has been suggested that this problem could be dealt with by splitting the state-space into regions with a di erent neural network to learn the function in each region (Jacobs, Jordan and Barto 1991, =-=Jordan and Jacobs 1992-=-). Gating networks are then used to select which networks to use in each region, or even provide a weighted sum of the outputs of the networks. In chapter 5 a similar idea is explored when the Q-AHC a... |

66 | Learning to Control an Unstable System with Forward Modeling - Jordan, Jacobs - 1990 |

62 | Issues in using function approximation for reinforcement learning
- Thurn, Schwartz
- 1993
(Show Context)
Citation Context ... being `seen' by earlier actions, but also means that states see a continual over-estimation of the payo s available, since they are always trained on the maximum predicted action value at each step (=-=Thrun and Schwartz 1993-=-). However, in a connectionist system, generalisation occurs, which means that the results of bad exploratory actions will e ect nearby states even if the eligibilities are zeroed. So, this mechanism ... |

62 |
Practical issues in temporal dierence learning
- Tesauro
- 1992
(Show Context)
Citation Context ... (MLP) or back-propagation neural network. Although the use of neural networks in reinforcement problems has1. Introduction 12 been examined before (Lin 1992, Sutton 1988, Anderson 1993, Thrun 1994, =-=Tesauro 1992-=-, Boyan 1992), the use of on-line training methods for performing Q-learning updates with >0 has not been examined previously. These allow temporal difference methods to be applied during the trial as... |

61 | The convergence of td() for general - Dayan - 1992 |

60 |
Active exploration in dynamic environments
- Thrun, Moller
(Show Context)
Citation Context ...ynamic programming (Barto et al. 1993)). Once a world model has been learnt, it can also be used to perform value function updates o -line (Sutton 1990, Peng and Williams 1993) or for planning ahead (=-=Thrun and Moller 1992-=-). Learning a model from experience is straight-forward in a Markovian domain. The basic method is to keep counters of the individual state transitions that occur and hence calculate the transition pr... |

55 | Reinforcement learning Applied to Linear Quadratic Regulation - Bradtke - 1993 |

52 | Modified Policy Iteration Algorithms for Discounted Markov Decision ProblemsModified Policy Iteration Algorithms for Discounted Markov Decision Problems - Puterman, Shin - 1978 |

51 |
Scaling Up Reinforcement Learning for Robot Control
- LIN
- 1993
(Show Context)
Citation Context ...ethods would have been impractical. In the remainder of this chapter, methods are examined for applying Q-learning updates utilising MLPs as function approximators. Unlike previous work in this area (=-=Lin 1993-=-b), the algorithms presented here can be applied on-line by the learning system. This results in a reinforcement learning system which ful ls the following goals: On-line training for autonomous syste... |

47 |
Optimal brain surgeon and general network pruning
- Hassibi, Stork, et al.
- 1993
(Show Context)
Citation Context ...s (both in terms of processing and number of updates required for convergence). Rather than x the number of units in advance, methods of adjusting the number of hidden units automatically (Reed 1993, =-=Hassibi and Stork 1993-=-, Lee, Song and Kim 1991) have been suggested, which involve trying to remove units which perform no useful function. Ultimately, the requirement is for more sophisticated learning algorithms that can... |