## Reinforcement learning: A survey (1996)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www2.cs.uh.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [opim.wharton.upenn.edu]
- [grace.wharton.upenn.edu]
- [www.mil.ufl.edu]
- [mil.ufl.edu]
- [synapse.cs.byu.edu]
- [www.cs.uml.edu]
- [iridia.ulb.ac.be]
- [www.ri.cmu.edu]
- [pecan.srv.cs.cmu.edu]
- [www.cs.duke.edu]
- [www.cs.cmu.edu]
- [www.cse.wustl.edu]
- [www.nbu.bg]
- [www.eecs.harvard.edu]
- [www.eecs.harvard.edu]
- [csl.anu.edu.au]
- [www-2.cs.cmu.edu]
- [www.jair.org]
- [people.cs.pitt.edu]
- [users.cecs.anu.edu.au]
- [www.cc.gatech.edu]
- [people.cs.pitt.edu]
- [engr.case.edu]
- [www.cc.gatech.edu]
- [www.ri.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Arti Intelligence Research |

Citations: | 1314 - 22 self |

### BibTeX

@ARTICLE{Kaelbling96reinforcementlearning:,

author = {Leslie Pack Kaelbling and Michael L. Littman and Andrew W. Moore},

title = {Reinforcement learning: A survey},

journal = {Journal of Arti Intelligence Research},

year = {1996},

pages = {4--247}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by anagentthat learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement. " The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility ofcurrentmethods for reinforcement learning. 1.

### Citations

2651 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...function of the current state and action. The model is Markov if the state transitions are independent ofany previous environment states or agent actions. There are many good references to MDP models =-=[10, 13,48,90]-=-. Although general MDPs mayhave in nite (even uncountable) state and action spaces, we will only discuss methods for solving nite-state and nite-action problems. In section 6, we discuss methods for s... |

1837 |
Genetic algorithms in search, optimization, and machine learning
- Goldberg
- 1989
(Show Context)
Citation Context ...ed by anumber of researchers [77, 62,103]. It seems to work e ectively on simple problems, but can su er from convergence to local optima on more complex problems. Classi er Systems Classi er systems =-=[47, 41]-=-were explicitly developed to solve problems with delayed reward, including those requiring short-term memory. The internal mechanism typically used to pass reward back through chains of decisions, cal... |

1708 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...orcement learning and conventional supervised learning. In the latter, expected future predictive accuracy or statistical e ciency are the prime concerns. For example, in the well-known PAC framework =-=[127]-=-, there is a learning period during which mistakes do not count, then a performance period during which they do. The framework provides bounds on the necessary length of the learning period in order t... |

1469 |
Theory of Linear and Integer Programming
- Schrijver
- 1996
(Show Context)
Citation Context ...ittman et al., 1995b). Modi ed policy iteration (Puterman & Shin, 1978) seeks a trade-o between cheap and e ective iterations and is preferred by some practictioners (Rust, 1996). Linear programming (=-=Schrijver, 1986-=-) is an extremely general problem, and MDPs can be solved by general-purpose linear-programming packages (Derman, 1970; D'Epenoux, 1963; Ho man & Karp, 1966). An advantage of this approach is that com... |

1331 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...h the certaintyequivalent method [108], which is discussed in Section 5.1. 4.2 Q-learning The work of the two components of AHC can be accomplished in a uni ed manner by Watkins' Q-learning algorithm =-=[128, 129]-=-. Q-learning is typically easier to implement. In order to understand . . 250Reinforcement Learning: A Survey Q-learning, we have to develop some additional notation. Let Q (s� a) be the expected dis... |

1238 | Learning to predict by the methods of temporal differences - Sutton - 1988 |

626 | Tsitsiklis, Parallel and Distributed Computation: Numerical Methods - Bertsekas, N - 1989 |

617 | Some studies in machine learning using the game of checkers
- Samuel
- 1959
(Show Context)
Citation Context ...ery general class of games [63] and many researchers have used reinforcement learning in these environments. One application, spectacularly far ahead of its time, was Samuel's checkers playing system =-=[99]-=-. This learned a value function represented by a linear function approximator, and employed a training scheme similar to the updates used in value iteration, temporal di erences and Q-learning. More r... |

533 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1993
(Show Context)
Citation Context ...nal e ort of Q-learning). 5.4 Other Model-Based Methods Methods proposed for solving MDPs given a model can be used in the context of model-based methods as well. RTDP (real-time dynamic programming) =-=[8]-=- is another model-based method that uses Q-learning to concentrate computational e ort on the areas of the state-space that the agent ismostlikely to occupy. It is speci c to problems in which the age... |

522 |
Dynamic Programming and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...function of the current state and action. The model is Markov if the state transitions are independent ofany previous environment states or agent actions. There are many good references to MDP models =-=[10, 13,48,90]-=-. Although general MDPs mayhave in nite (even uncountable) state and action spaces, we will only discuss methods for solving nite-state and nite-action problems. In section 6, we discuss methods for s... |

499 | Markov Games as a Framework for Multi-agent Reinforcement Learning
- Littman
- 1994
(Show Context)
Citation Context ...face of a xed environment, but one of maximizing reward against an optimal adversary (minimax). Nonetheless, reinforcement-learning algorithms can be adapted to work for a very general class of games =-=[63]-=- and many researchers have used reinforcement learning in these environments. One application, spectacularly far ahead of its time, was Samuel's checkers playing system [99]. This learned a value func... |

488 |
Parallel distributed processing: explorations in the microstructure of cognition, foundations
- Rumelhart, McClelland
- 1986
(Show Context)
Citation Context ...n be handled using any of the wide variety of function-approximation techniques for supervised learning that support noisy training examples. Popular techniques include various neuralnetwork methods (=-=Rumelhart & McClelland, 1986-=-), fuzzy logic (Berenji, 1991; Lee, 1991). CMAC (Albus, 1981), and local memory-based methods (Moore, Atkeson, & Schaal, 1995), such as generalizations of nearest neighbor methods. Other mappings, esp... |

480 | Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...de arbitrarily small. Techniques like thishave been used in several reinforcement learning algorithms including the interval exploration method [52] (described shortly), the exploration bonus in Dyna =-=[116]-=-, curiosity-driven exploration [102], and the exploration mechanism in prioritized sweeping [83]. 244Reinforcement Learning: A Survey 2.2.2 Randomized Strategies Another simple exploration strategy i... |

479 |
Neuronlike adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...oral-di erence learning strategies for the discounted in nite-horizon model. 4.1 Adaptive Heuristic Critic and TD( ) The adaptive heuristic critic algorithm is an adaptive version of policy iteration =-=[9]-=- inwhichthe value-function computation is no longer implemented by solving a set of linear equations, but is instead computed by an algorithm called TD(0). A block diagram for this approach is given i... |

466 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
- 1995
(Show Context)
Citation Context ...un average reward: 1 lim E( h!1 h Such a policy is referred to as a gain optimal policy� it can be seen as the limiting case of the in nitehorizon discounted model as the discount factor approaches 1 =-=[14]-=-. One problem with this criterion is that there is no way to distinguish between two policies, one of which gains a large amount of reward in the initial phases and the other of which does not. Reward... |

393 |
Empirical Model-Building and Response Surfaces
- Box, Draper
- 1987
(Show Context)
Citation Context ...associated statistics or with nonparametric methods. The method works very well in empirical trials. It is also related to a certain class of statistical techniques known as experiment design methods =-=[17]-=-, which are used for comparing multiple treatments (for example, fertilizers or drugs) to determine which treatment (if any) is best in as small a set of experiments as possible. 2.3 More General Prob... |

378 | Temporal difference learning and TD-gammon - Tesauro - 1995 |

377 |
Dynamic programming: deterministic and stochastic models
- Bertsekas
- 1987
(Show Context)
Citation Context ...function of the current state and action. The model is Markov if the state transitions are independent ofany previous environment states or agent actions. There are many good references to MDP models =-=[10, 13,48,90]-=-. Although general MDPs mayhave in nite (even uncountable) state and action spaces, we will only discuss methods for solving nite-state and nite-action problems. In section 6, we discuss methods for s... |

368 | Practical Issues in Temporal Difference Learning - Tesauro - 1992 |

356 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...nd Moore [18] report that their counter-examples can be made to work with problem-speci c hand-tuning despite the unreliability ofuntuned algorithms that provably converge in discrete domains. Sutton =-=[113]-=- shows how modi ed versions of Boyan and Moore's examples can converge successfully. An open question is whether general principles, ideally supported by theory, can help us understand when value func... |

337 | Automatic programming of behavior-based robots using reinforcement learning
- Mahadevan, Connell
- 1991
(Show Context)
Citation Context ... to linear control policies and locally linear transitions was used to improve the policy. The form of dynamic programming is known as linear-quadratic-regulator design [97]. 2. Mahadevan and Connell =-=[71]-=- discuss a task in which a mobile robot pushes large boxes for extended periods of time. Box-pushing is a well-known di cult robotics problem, characterized by immense uncertainty in the results of ac... |

322 | Simple statistical gradient-following algorithms for connectionist reinforcement learning
- Williams
- 1992
(Show Context)
Citation Context ...ned in a standard supervised mode to estimate r as a function of the input state s. Variations of this approach have been used in a variety of applications [4, 9,61,114]. REINFORCE Algorithms Williams=-=[131, 132]-=- studied the problem of choosing actions to maximize immedate reward. He identi ed a broad class of update rules that perform gradient descent on the expected reward and showed how tointegrate these r... |

316 | Prioritized sweeping: reinforcement learning with less data and less time
- Moore, Atkeson
- 1994
(Show Context)
Citation Context ...thms including the interval exploration method [52] (described shortly), the exploration bonus in Dyna [116], curiosity-driven exploration [102], and the exploration mechanism in prioritized sweeping =-=[83]-=-. 244Reinforcement Learning: A Survey 2.2.2 Randomized Strategies Another simple exploration strategy is to take the action with the best estimated expected reward by default, but with probability p,... |

307 |
Learning in Embedded Systems
- Kaelbling
- 1990
(Show Context)
Citation Context ...mal but unlucky action, but the risk of this can be made arbitrarily small. Techniques like this have been used in several reinforcement learning algorithms including the interval exploration method (=-=Kaelbling, 1993-=-b) (described shortly), the exploration bonus in Dyna (Sutton, 1990), curiosity-driven exploration (Schmidhuber, 1991a), and the exploration mechanism in prioritized sweeping (Moore & Atkeson, 1993). ... |

292 | On-line Q-learning using connectionist systems
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...pects of the experiments: 1. Small changes to the task speci cations. 2. A very di erent kind of function approximator (CMAC [2]) that has weak generalization. 3. A di erent learning algorithm: SARSA =-=[95]-=- instead of value iteration. 4. A di erent training regime. Boyan and Moore sampled states uniformly in state space, whereas Sutton's method sampled along empirical trajectories. There are intuitive r... |

287 | Classifier fitness based on accuracy - Wilson - 1995 |

282 | Improving elevator performance using reinforcement learning
- Crites, Barto
- 1996
(Show Context)
Citation Context ...s of the underlying sensors. The performance of the Q-learned policies were almost as good as a simple hand-crafted controller for the job. 4. Q-learning has been used in an elevator dispatching task =-=[29]-=-. The problem, which hasbeen implemented in simulation only at this stage, involved four elevators servicing ten oors. The objective was to minimize the average squared wait time for passengers, disco... |

279 |
Reinforcement learning with selective perception and hidden state
- McCallum
- 1996
(Show Context)
Citation Context ...he Q function in the presence of an overwhelming numberofirrelevant, noisy state attributes. It outperformed Q-learning with backpropagation in a simple video-game environment andwas used by McCallum =-=[74]-=- (in conjunction with other techniques for dealing with partial observability) to learn behaviors in a complex driving-simulator. It cannot, however, acquire partitions in which attributes are only si... |

276 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...nother strategy consists of using hidden Markov model (HMM) techniques to learn a model of the environment, including the hidden state, then to use that model to construct a perfect memory controller =-=[20, 67,79]-=-. Chrisman [22] showed how the forward-backward algorithm for learning HMMs could be adapted to learning POMDPs. He, and later McCallum [75], also gave heuristic state-splitting rules to attempt to le... |

269 | A New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC
- Albus
- 1974
(Show Context)
Citation Context ...tive experiments with Boyan and Moore's counterexamples, he changes four aspects of the experiments: 1. Small changes to the task speci cations. 2. A very di erent kind of function approximator (CMAC =-=[2]-=-) that has weak generalization. 3. A di erent learning algorithm: SARSA [95] instead of value iteration. 4. A di erent training regime. Boyan and Moore sampled states uniformly in state space, whereas... |

260 |
Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting
- Cleveland, Devlin
- 1988
(Show Context)
Citation Context ...iency at mere tens of hits. The juggling robot learned a world model from experience, which was generalized to unvisited states by a function approximation scheme known as locally weighted regression =-=[25, 82]-=-. 265Figure 11: Schaal and Atkeson's devil-sticking robot. The tapered stick is hit alternately by eachof the two hand sticks. The task is to keep the devil stick from falling for as many hits as pos... |

255 | Feudal reinforcement learning
- Dayan, Hinton
- 1993
(Show Context)
Citation Context ...is approach, rst training the behaviors and then training the gating function. Many of the other hierarchical learning methods can be cast in this framework. 6.3.1 Feudal Q-learning Feudal Q-learning =-=[31, 128]-=- involves a hierarchy of learning modules. In the simplest case, there is a high-level master and a low-level slave. The master receives reinforcement from the external environment. Its actions consis... |

253 | Generalization in reinforcement learning: Safely approximating the value function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...n approximator is used to represent the value function by mapping a state description to a value. 257Kaelbling, Littman, & Moore Many reseachers have experimented with this approach: Boyan and Moore =-=[18]-=- used local memory-based methods in conjunction with value iteration� Lin [59] used backpropagation networks for Q-learning� Watkins [128] used CMAC for Q-learning� Tesauro [118, 120]usedbackpropagati... |

248 |
Temporal credit assignment in reinforcement learning
- Sutton
- 1984
(Show Context)
Citation Context ...xt of a delayed reinforcement, for instance, as the RL component in the AHC architecture described in Section 4.1. They can also be generalized to real-valued reward through reward comparison methods =-=[114]-=-. CRBP The complementary reinforcement backpropagation algorithm [1] (crbp) consists of a feed-forward network mapping an encoding of the state to an encoding of the action. The action is determined p... |

245 |
Neural network perception for mobile robotic guidance
- Pomerleau
- 1992
(Show Context)
Citation Context ...r real robots, this requires perceptual abilities that are not yet available. But another strategy is to have a human supply appropriate motor commands to a robot through a joystick or steering wheel =-=[89]-=-. problem decomposition: Decomposing a huge learning problem into a collection of smaller ones, and providing useful reinforcement signals for the subproblems is a very powerful technique for biasing ... |

239 | Residual Algorithms: Reinforcement learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ...on. Several recent results [42, 126]showhowthe appropriate choice of function approximator can guarantee convergence, though not necessarily to the optimal values. Baird's residual gradient technique =-=[6]-=- provides guaranteed convergence to locally optimal solutions. Perhaps the gloominess of these counter-examples is misplaced. Boyan and Moore [18] report that their counter-examples can be made to wor... |

236 | Learning Automata: an introduction - Narendra, Thathachar - 1989 |

232 | Learning policies for partially observable environments: Scaling up
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...s representation as a piecewise-linear and convex function over the belief space. This method is computationally intractable, but may serve as inspiration for methods that make further approximations =-=[20, 65]-=-. 8. Reinforcement Learning Applications One reason that reinforcement learning is popular is that is serves as a theoretical tool for studying the principles of agents learning to act. But it is unsu... |

228 |
TD-gammon a self-teaching backgammon program, achieves master-level play. Neural Computation 1994;6(2):215–9
- Tesauro
- 2008
(Show Context)
Citation Context ...value function represented by a linear function approximator, and employed a training scheme similar to the updates used in value iteration, temporal di erences and Q-learning. More recently, Tesauro =-=[118,119,120]-=- applied the temporal di erence algorithm to backgammon. Backgammon has approximately 1020 states, making table-based reinforcement learning impossible. Instead, Tesauro used a backpropagation-based t... |

224 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ... high-resolution arrays would have been impractical. It has the disadvantage of requiring a guess at an initially valid trajectory through state-space. PartiGame Algorithm Moore's PartiGame algorithm =-=[81]-=- is another solution to the problem of learning to achieve goal con gurations in deterministic high-dimensional continuous spaces by learning an adaptive-resolution model. It also divides the environm... |

218 |
Bandit Problems: Sequential Allocation of Experiments
- Berry, Fristedt
- 1985
(Show Context)
Citation Context ...ore appropriate measure, then, is the expected decrease in reward gained due to executing the learning algorithm instead of behaving optimally from the very beginning. This measure is known as regret =-=[12]-=-. It penalizes mistakes wherever they occur during the run. Unfortunately, results concerning the regret of algorithms are quite hard to obtain. 241Kaelbling, Littman, & Moore 1.4 Reinforcement Learn... |

211 | Stable function approximation in dynamic programming
- Gordon
- 1995
(Show Context)
Citation Context ...alue functions is also dangerous because the errors in value functions due to generalization can become compounded by the \max" operator in the de nition of the value function. Several recent results =-=[42, 126]-=-showhowthe appropriate choice of function approximator can guarantee convergence, though not necessarily to the optimal values. Baird's residual gradient technique [6] provides guaranteed convergence ... |

210 | On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ...erience tuple as described earlier. If each action is executed in each state an in nite number of times on an in nite run and is decayed appropriately, theQ values will converge with probability 1toQ =-=[128, 125, 49]-=-. Q-learning can also be extended to update states that occurred more than one step previously, asinTD( )[88]. When the Q values are nearly converged to their optimal values, it is appropriate for the... |

209 | Learning to Coordinate Behaviors
- Maes, Brooks
- 1990
(Show Context)
Citation Context ...ent states into low-level actions and a gating function that decides, based on the state of the environment, which behavior's actions should be switched through and actually executed. Maes and Brooks =-=[68]-=- used a version of this architecture in which the individual behaviors were xed a priori and the gating function was learned from reinforcement. Mahadevan and Connell [72] used the dual approach: they... |

195 | Reinforcement learning with perceptual aliasing: the perceptual distinction approach
- Chrisman
- 1992
(Show Context)
Citation Context ...ts of using hidden Markov model (HMM) techniques to learn a model of the environment, including the hidden state, then to use that model to construct a perfect memory controller [20, 67,79]. Chrisman =-=[22]-=- showed how the forward-backward algorithm for learning HMMs could be adapted to learning POMDPs. He, and later McCallum [75], also gave heuristic state-splitting rules to attempt to learn the smalles... |

194 |
A survey of partially observable Markov decision processes: Theory, models, and algorithms.Manag
- MONAHAN
- 1982
(Show Context)
Citation Context ...nother strategy consists of using hidden Markov model (HMM) techniques to learn a model of the environment, including the hidden state, then to use that model to construct a perfect memory controller =-=[20, 67,79]-=-. Chrisman [22] showed how the forward-backward algorithm for learning HMMs could be adapted to learning POMDPs. He, and later McCallum [75], also gave heuristic state-splitting rules to attempt to le... |

189 |
Reinforcement learning for robots using neural networks
- Lin
- 1992
(Show Context)
Citation Context ...e second network. The second network is trained in a standard supervised mode to estimate r as a function of the input state s. Variations of this approach have been used in a variety of applications =-=[4, 9,61,114]-=-. REINFORCE Algorithms Williams[131, 132] studied the problem of choosing actions to maximize immedate reward. He identi ed a broad class of update rules that perform gradient descent on the expected ... |

189 | Reinforcement learning with replacing eligibility traces
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...ably faster for large [30, 32]. There has been some recent work on making the updates more e cient [24] and on changing the de nition to make TD( ) more consistent with the certaintyequivalent method =-=[108]-=-, which is discussed in Section 5.1. 4.2 Q-learning The work of the two components of AHC can be accomplished in a uni ed manner by Watkins' Q-learning algorithm [128, 129]. Q-learning is typically ea... |

177 |
A survey of algorithmic methods for partially observable Markov decision processes
- LOVEJOY
- 1991
(Show Context)
Citation Context ...nother strategy consists of using hidden Markov model (HMM) techniques to learn a model of the environment, including the hidden state, then to use that model to construct a perfect memory controller =-=[20, 67,79]-=-. Chrisman [22] showed how the forward-backward algorithm for learning HMMs could be adapted to learning POMDPs. He, and later McCallum [75], also gave heuristic state-splitting rules to attempt to le... |

171 |
Adaptation in Natural and Arti cial Systems. The
- Holland
- 1975
(Show Context)
Citation Context ...ed by anumber of researchers [77, 62,103]. It seems to work e ectively on simple problems, but can su er from convergence to local optima on more complex problems. Classi er Systems Classi er systems =-=[47, 41]-=-were explicitly developed to solve problems with delayed reward, including those requiring short-term memory. The internal mechanism typically used to pass reward back through chains of decisions, cal... |