## Learning to predict by the methods of temporal differences (1988)

### Cached

### Download Links

Venue: | MACHINE LEARNING |

Citations: | 1328 - 46 self |

### BibTeX

@INPROCEEDINGS{Sutton88learningto,

author = {Richard S. Sutton},

title = {Learning to predict by the methods of temporal differences},

booktitle = {MACHINE LEARNING},

year = {1988},

pages = {9--44},

publisher = {Kluwer Academic Publishers}

}

### Years of Citing Articles

### OpenURL

### Abstract

This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.

### Citations

3021 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...ustable parameters or "weights." This and other representational assumptions are detailed in Section 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., =-=Rumelhart, Hinton, & Williams, 1985-=-; Ackley, Hinton, & Sejnowski, 1985; Barto, 1985; Anderson, 1986; Williams, 1986; Hampson & Volper, 1987), we note that here we are concerned with a different set of issues. The work with multi-layer ... |

857 |
Finite Markov Chains
- Kemeny, Snell
- 1960
(Show Context)
Citation Context ...each term above, yielding where di is the expected number of times the Markov chain is in state i in one sequence, so that dipij is the expected value of nij. For an absorbing Markov chain (e.g., see =-=Kemeny & Snell, 1976-=-, p. 46): where [d]i = di and [u]t = Ui, i € N. Each di is strictly positive, because any state for which di = 0 has no probability of being visited and can be discarded. Let wn denote the expected va... |

689 |
Matrix Iterative Analysis
- Varga
- 1962
(Show Context)
Citation Context ...\Gamma Q) are less than 1 in modulus, which assures us that its powers converge. TEMPORAL-DIFFERENCE LEARNING 27 We show that D(I \Gamma Q) is positive definite 6 by applying the following lemma (see =-=Varga, 1962-=-, p. 23, for a proof.): Lemma If A is a real, symmetric, and strictly diagonally dominant matrix with positive diagonal entries, then A is positive definite. We cannot apply this lemma directly to D(I... |

663 | Some studies in machine learning using the game of checkers
- Samuel
- 1959
(Show Context)
Citation Context ... prediction-learning problems. 3.2 A random-walk example The game-playing example is too complex to analyze in great detail. Previous experiments with TD methods have also used complex domains (e.g., =-=Samuel, 1959-=-; Sutton, 1984; Barto, Sutton & Anderson, 1983; Anderson, 1986, 1987). Which aspects of these domains can be simplified or eliminated, and which aspects are essential in order for TD methods to be eff... |

604 |
Adaptive switching circuits
- Widrow, Hoff
- 1960
(Show Context)
Citation Context ...by including t as a component of x t . 14 R. S. SUTTON the i th components of w and x t respectively. 2 In this case we have rwP t = x t , and (2) reduces to the well known Widrow-Hoff rule (Widrow & =-=Hoff, 1960): \Deltaw-=- t = ff(z \Gamma w T x t )x t : This linear learning method is also know as the "delta rule", the ADALINE, and the LMS filter. It is widely used in connectionism, pattern recognition, signal... |

464 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ... and other representational assumptions are detailed in Section 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton, & Williams, 1985; =-=Ackley, Hinton, & Sejnowski, 1985-=-; Barto, 1985; Anderson, 1986; Williams, 1986; Hampson & Volper, 1987), we note that here we are concerned with a different set of issues. The work with multi-layer networks focuses on learning input-... |

290 | Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rulebased systems - Holland - 1986 |

257 |
Temporal Credit Assignment in Reinforcement Learning
- Sutton
- 1984
(Show Context)
Citation Context ...aluations assigned to the two positions to modify the evaluation of the earlier one. Similar methods have also been used in Holland's (1986) bucket brigade, in the author's Adaptive Heuristic Critic (=-=Sutton, 1984-=-; Barto, Sutton & Anderson, 1983), and in learning systems studied by Witten (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (Sutton &... |

226 |
Adaptive signal processing
- Widrow, Stearns
- 1985
(Show Context)
Citation Context ...sed, increasing w T x t and reducing the error. The Widrow-Hoff rule is simple, effective, and robust. Its theory is also better developed than that of any other learning method (e.g., see Widrow and =-=Stearns, 1985). Another-=- instance of the prototypical supervised-learning procedure is the "generalized delta rule," or backpropagation procedure, of Rumelhart, Hinton and Williams (1985). In this case, P t is comp... |

225 | Toward a modern theory of adaptive networks: Expectation and prediction
- Sutton, Barto
- 1981
(Show Context)
Citation Context ...on, 1984; Barto, Sutton & Anderson, 1983), and in learning systems studied by Witten (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (=-=Sutton & Barto, 1981-=- a, 1987; Gelperin, Hopfield & Tank, 1985; Moore et al., 1986; Klopf, 1987). Nevertheless, TD methods have remained poorly understood. Although they have performed well, there has been no theoretical ... |

191 |
Neuron-like elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context .... 3.2 A random-walk example The game-play ing example is too complex to analyze in great detail. Previous experiments with TD methods have also used complex domains (e.g., Samuel, 1959; Sutton, 1984; =-=Barto et al., 1983-=-; Anderson, 1986, 1987). Which aspects of these domains can be simplified or eliminated, and which aspects are essential in order for TD methods to be effective? In this paper, we propose that the onl... |

79 | Intelligent behavior as an adaptation to the task environment - Booker - 1982 |

77 | Strategy learning with multilayer connectionist representations - Anderson - 1987 |

64 |
Dynamic Programming Models and Applications
- Denardo
- 1982
(Show Context)
Citation Context ...that uses the mismatch in the recursive equations to drive weight changes towards a better match. These three steps are very similar to those taken in formulating a dynamic programming problem (e.g., =-=Denardo, 1982-=-). 6. Related Research Although temporal-difference methods have never previously been identified or studied on their own, we can view some previous machine learning research as having used them. In t... |

56 |
Learning by statistical cooperation of self-interested neuron-like computing elements
- Barto
- 1985
(Show Context)
Citation Context ...ptions are detailed in Section 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton & Williams, 1985; Ackley, Hinton & Sejnowski, 1985; =-=Barto, 1985-=-; Anderson, 1986; Williams, 1986; Hampson & Volper, 1987), we note that here we are concerned with a different set of issues. The work with multi-layer networks is concerned with learning input-output... |

55 | Learning and Problem solving with multilayer connectionist systems
- Anderson
- 1986
(Show Context)
Citation Context ...tailed in Section 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton & Williams, 1985; Ackley, Hinton & Sejnowski, 1985; Barto, 1985; =-=Anderson, 1986-=-; Williams, 1986; Hampson & Volper, 1987), we note that here we are concerned with a different set of issues. The work with multi-layer networks is concerned with learning input-output mappings of mor... |

45 |
A temporal-difference model of classical conditioning
- Sutton, Barto
- 1987
(Show Context)
Citation Context ...ven just the bell, as evidenced by salivation to the bell alone. Some of the detailed features of this learning process suggest that animals may be using a TD method (Kehoe, Schreurs, & Graham, 1987; =-=Sutton & Barto, 1987-=-). Acknowledgements The author acknowledges especially the assistance of Andy Barto, Martha Steenstrup, Chuck Anderson, John Moore, and Harry Klopf. I also thank Oliver Selfridge, Pat Langley, Ron Riv... |

36 | The learning of world models by connectionist networks - Sutton, Pinette - 1985 |

35 | Learning to predict sequences - Michalski, Carbonell, et al. - 1986 |

30 | Neuronlike elements that can solve dicult learning control problems - Barto, Sutton, et al. - 1983 |

22 |
Reinforcement Learning in Connectionist Networks: A Mathematical analysis
- Williams
- 1986
(Show Context)
Citation Context ...n 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton & Williams, 1985; Ackley, Hinton & Sejnowski, 1985; Barto, 1985; Anderson, 1986; =-=Williams, 1986-=-; Hampson & Volper, 1987), we note that here we are concerned with a different set of issues. The work with multi-layer networks is concerned with learning input-output mappings of more complex functi... |

18 |
Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals
- Moore, Desmond, et al.
- 1986
(Show Context)
Citation Context ...tems studied by Witten (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (Sutton & Barto, 1981a, 1987; Gelperin, Hopfield & Tank, 1985; =-=Moore et al., 1986-=-; Klopf, 1987). Nevertheless, TD methods have remained poorly understood. Although they have performed well, there has been no theoretical understanding of how or why they worked. One reason is that t... |

18 | A unified theory of heuristic evaluation functions and its applications to learning
- Christensen, Korf
- 1986
(Show Context)
Citation Context ...d Korf have investigated a simplification of Samuel’s procedure that also does not constrain the evaluations of terminal positions, and have obtained promising preliminary results (Christensen, 1986; =-=Christensen & Korf, 1986-=-). Thus, although a terminal constraint may be critical to good temporal-difference theory, apparently it is not strictly necessary to obtain good performance. 6.2 Backpropagation in connectionist net... |

17 |
An adaptive network that constructs and uses an internal model of its world
- Sutton, Barto
- 1981
(Show Context)
Citation Context ...on, 1984; Barto, Sutton & Anderson, 1983), and in learning systems studied by Witten (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (=-=Sutton & Barto, 1981-=-a, 1987; Gelperin, Hopfield & Tank, 1985; Moore et al., 1986; Klopf, 1987). Nevertheless, TD methods have remained poorly understood. Although they have performed well, there has been no theoretical u... |

11 |
Temporal primacy overrides prior training in serial compound conditioning of the rabbit’s nictitating membrane response
- Kehoe, Schreurs, et al.
- 1987
(Show Context)
Citation Context ...will learn to predict the meal given just the bell, as evidenced by salivation to the bell alone. Some of the detailed features of this learning process suggest that animals may be using a TD method (=-=Kehoe, Schreurs, & Graham, 1987-=-; Sutton & Barto, 1987). Acknowledgements The author acknowledges especially the assistance of Andy Barto, Martha Steenstrup, Chuck Anderson, John Moore, and Harry Klopf. I also thank Oliver Selfridge... |

10 | The logic of Limax learning - Gelperin, Hop, et al. - 1985 |

9 |
Disjunctive models of boolean category learning
- Hampson, Volper
- 1987
(Show Context)
Citation Context ...rent interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton, & Williams, 1985; Ackley, Hinton, & Sejnowski, 1985; Barto, 1985; Anderson, 1986; Williams, 1986; =-=Hampson & Volper, 1987-=-), we note that here we are concerned with a different set of issues. The work with multi-layer networks focuses on learning input-output mappings of more complex functional forms. Most of that work r... |

7 | Some studies in machine learning using the game of checkers - L - 1959 |

5 | Adaptive Switching Circuits - E - 1960 |

3 | A neural model of adaptive behavior. Doctoral dissertation - Hampson - 1983 |

2 |
Learning static evaluation functions by linear regression
- Christensen
- 1986
(Show Context)
Citation Context ...ers. Christensen and Korf have investigated a simplification of Samuel's procedure that also does not constrain the evaluations of terminal positions, and have obtained promising preliminary results (=-=Christensen, 1986-=-; Christensen & Korf, 1986). Thus, although a terminal constraint may be critical to good temporal-difference theory, apparently it is not strictly necessary to obtain good performance. 6.2 Backpropag... |

2 |
Learning static evaluation fimctions l)y linear regression
- Christensen
- 1986
(Show Context)
Citation Context ...rs. Christensen and Korf have investigated a simplification of Samuel's procedure that also does not constrain the evaluations of terminal positions, and imve obtained promising prelinfinary results (=-=Christensen, 1986-=-: Christensen & Korf. 1986). Thus. although a terminal constraint may be critical to good tenlporal-(til[i~rence theory, apparently it is not strictly necessary to obtait~ goo(t performance. 6.2 Baekp... |

2 | A unified theory of hem'istic evaluathm flmetions and its application to learning - l, Korf - 1986 |

2 | Dynamic programmin.g: Model.s and applicatio~.~'. Engh'wood (?lifts. N,]: Prentice-Hall - V - 1982 |

2 | The logic of Limaz learning - Gelperin, Hopfield, et al. - 1985 |

2 | A neuronal model of classical conditioning (Technical Report 87-1139). OH: Wright-Patterson Air Force Base, Wright Aeronautical Laboratories - Klop - 1987 |

2 |
Reinfi)rcement learning in conneetionist network,s.: A mathematical anal~,sis
- Williams
- 1986
(Show Context)
Citation Context ... 2. Given the current interest in learning procedures for multi-layer connectionist networks (e.g., Rumelhart, Hinton, & Williams, 1985: Ackley, Hinton. & S@~owski, 1985; Barto, 1985; Anderson, 1986; =-=Williams, 1986-=-: ttalnpson & Volper. 1987), we note that here we are c(meerned with a different, set of issues. The work with multi-layer networks focuses on learning input-output mat)pings of more comt)lex flm('tio... |

2 |
Dynamic Programming: Models and Applications. Englewood Cliffs, NJ
- Denardo
- 1982
(Show Context)
Citation Context ...that uses the mismatch in the recursive equations to drive weight changes towards a better match. These three steps are very similar to those taken in formulating a dynamic programming problem (e.g., =-=Denardo, 1982-=-). 6. Related research Although temporal-difference methods have never previously been identified or studied on their own, we can view some previous machine learning research as having used them. In t... |

1 |
A neuronal model of classical conditioning (Air Force Wright Aeronautical Laboratories
- Klopf
- 1987
(Show Context)
Citation Context ...en (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (Sutton & Barto, 1981a, 1987; Gelperin, Hopfield & Tank, 1985; Moore et al., 1986; =-=Klopf, 1987-=-). Nevertheless, TD methods have remained poorly understood. Although they have performed well, there has been no theoretical understanding of how or why they worked. One reason is that they were neve... |

1 | An adaptive network that constructs and uses an internal model of its environment - unknown authors - 1981 |

1 |
A neuronal model of classical conditioning (Technical Report 87-1139). OH: Wright-Patterson Air Force Base, Wright Aeronautical Laboratories
- Klopf
- 1987
(Show Context)
Citation Context ...n (1977), Booker (1982), and Hampson (1983). TD methods have also been proposed as models of classical conditioning (Sutton & Barto, 1981 a, 1987; Gelperin, Hopfield & Tank, 1985; Moore et al., 1986; =-=Klopf, 1987-=-). Nevertheless, TD methods have remained poorly understood. Although they have performed well, there has been no theoretical understanding of how or why they worked. One reason is that they were neve... |