This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.
|
2044
|
Learning internal representations by error propagation
– Rumelhart, Hinton, et al.
- 1986
|
|
488
|
Some Studies in Machine Learning using the Game of Checkers
– Samuel
- 1959
|
|
479
|
Finite Markov Chains
– Kemeny, Snell
- 1983
|
|
406
|
Matrix Iterative Analysis
– Varga
- 1962
|
|
379
|
Adaptative switching circuits
– Widrow, Hoff
- 1960
|
|
310
|
A learning algorithm for Boltzmann machines
– Ackley, Hinton
- 1985
|
|
244
|
Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems
– Holland
- 1986
|
|
208
|
Temporal Credit Assignment in Reinforcement Learning
– Sutton
- 1984
|
|
147
|
Neuronlike elements that can solve difficult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
138
|
Adaptive signal processing
– Widrow, Stearns
- 1985
|
|
130
|
Toward a modern theory of adaptive networks: Expectation and prediction
– Sutton, Barto
- 1981
|
|
67
|
Intelligent Behavior as an Adaptation to the Task Environment
– Booker
- 1982
|
|
64
|
Strategy learning with multilayer connectionist representations
– Anderson
|
|
47
|
Learning and Problem Solving with Multilayer Connectionist Systems
– Anderson
- 1986
|
|
40
|
Learning by statistical cooperation of self-interested neuronlike adaptive elements. Human Neurobiology
– Barto
- 1985
|
|
34
|
The learning of world models by connectionist networks
– Sutton, Pinette
- 1985
|
|
30
|
Dynamic Programming: models and applications
– Denardo
- 1982
|
|
29
|
A temporal-difference model of classical conditioning
– Sutton, Barto
- 1987
|
|
27
|
Learning to predict sequences
– Dietterich
- 1986
|
|
24
|
Neuronlike elements that can solve di cult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
19
|
Reinforcement learning in connectionist networks: A mathematical analysis
– Williams
- 1986
|
|
14
|
An adaptive network that constructs and uses an internal model of its world
– Sutton, Barto
- 1981
|
|
9
|
Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing and interstimulus intervals
– Moore, Desmond, et al.
- 1986
|
|
8
|
The logic of Limax learning
– Gelperin, Hopfield, et al.
- 1985
|
|
6
|
Disjunctive models of boolean category learning
– Hampson, Volper
- 1987
|
|
4
|
Temporal primacy overrides prior training in serial compound conditioning of the rabbit's nictitating membrane response
– Kehoe, Schreurs, et al.
- 1987
|
|
3
|
Adaptive switching circuits
– unknown authors
- 1960
|
|
2
|
Learning static evaluation functions by linear regression
– Christensen
- 1986
|
|
2
|
Dynamic Programming: Models and Applications. Englewood Cliffs, NJ
– Denardo
- 1982
|
|
1
|
A neural model of adaptive behavior. Doctoral dissertation
– Hampson
- 1983
|
|
1
|
A neuronal model of classical conditioning (Air Force Wright Aeronautical Laboratories
– Klopf
- 1987
|
|
1
|
Learning static evaluation fimctions l)y linear regression
– Christensen
- 1986
|
|
1
|
A unified theory of hem'istic evaluathm flmetions and its application to learning
– l, Korf
- 1986
|
|
1
|
Dynamic programmin.g: Model.s and applicatio~.~'. Engh'wood (?lifts
– V
- 1982
|
|
1
|
The logic of Limaz learning
– Gelperin, Hopfield, et al.
- 1985
|
|
1
|
A neuronal model of classical conditioning (Technical Report 87-1139). OH: Wright-Patterson Air Force Base, Wright Aeronautical Laboratories
– Klop
- 1987
|
|
1
|
An adaptive network that constructs and uses an internal model of its environment
– unknown authors
- 1981
|
|
1
|
Reinfi)rcement learning in conneetionist network,s.: A mathematical anal~,sis
– Williams
- 1986
|