## Technical update: Least-squares temporal difference learning (2002)

### Cached

### Download Links

- [www.research.rutgers.edu]
- [www.cs.rutgers.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 96 - 2 self |

### BibTeX

@INPROCEEDINGS{Boyan02technicalupdate:,

author = {Justin A. Boyan},

title = {Technical update: Least-squares temporal difference learning},

booktitle = {Machine Learning},

year = {2002},

pages = {233--246}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto’s work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

### Citations

4165 | Reinforcement Learning - An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ce learning, value function approximation, linear least-squares methods 1. Background We address the problem of approximating the value function V of a fixed policysin a large Markov decision process =-=[2, 10]-=-. This is an important subproblem of several algorithms for sequential decision making, including optimistic policy iteration [2] and STAGE [3]. V (x) simply predicts the expected long-term sum of fut... |

2196 |
Numerical Recipes in C. The Art of Scientific Computation
- Press, Teukolsky, et al.
- 1994
(Show Context)
Citation Context ...imate of nd, and A is an unbiased estimate of \GammanC. Thus, fiscan be estimated as A \Gamma1 b. As is standard in least-squares algorithms, Singular Value Decomposition is used to invert A robustly =-=[8]-=-. The complete LSTD() algorithm is specified in Figure 2. LSTD() for approximate policy evaluation: Given: a simulation model, featurizer, andsas in ordinary TD(). (No stepsize schedules or initial es... |

1328 | Learning to predict by the method of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ... must compute V or an approximation thereof (denoted ~ V ) solely from a collection of trajectories sampled from the chain. This is where the TD() family of algorithms applies. TD() was introduced in =-=[11]-=-; excellent summaries may now be found in several books [2, 10]. For each state on each observed trajectory, TD() incrementally adjusts the coefficients of ~ V toward new target values. The target val... |

829 |
Neuro-dynamic programming
- Bertsekas, Tsitsiklis
- 1996
(Show Context)
Citation Context ...arning, value function approximation, linear least-squares methods 1. Background We address the problem of approximating the value function V π of a fixed policy π in a large Markov decision process (=-=Bertsekas & Tsitsiklis, 1996-=-; Sutton & Barto, 1998). This is an important subproblem of several algorithms for sequential decision making, including optimistic policy iteration (Bertsekas & Tsitsiklis, 1996) and STAGE (Boyan & M... |

498 | Integrated architectures for learning, planning and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...ical model-based method is that it makes the most of the available training data. The relative advantages of model-based and model-free reinforcement learning methods have been investigated in, e.g., =-=[12, 7, 1]-=-. Where does LSTD() fit in? In fact, for the case ofs= 0, it precisely duplicates the classical model-based method sketched above. The assumed lookup-table representation for ~ V means that we have on... |

336 | Prioritized sweeping: Reinforcement learning with less data and less time
- Moore, Atkeson
- 1993
(Show Context)
Citation Context ...ical model-based method is that it makes the most of the available training data. The relative advantages of model-based and model-free reinforcement learning methods have been investigated in, e.g., =-=[12, 7, 1]-=-. Where does LSTD() fit in? In fact, for the case ofs= 0, it precisely duplicates the classical model-based method sketched above. The assumed lookup-table representation for ~ V means that we have on... |

249 |
TD-Gammon, a self-teaching backgammon program, achieves master-level play
- Tesauro
(Show Context)
Citation Context ...ms of overall computational efficiency as well as data efficiency. We generate trajectories by opposing two fixed, pre-trained backgammon policies: "Fiona," a neural network patterned after =-=TD-Gammon [15], and &quo-=-t;pubeval," a benchmark backgammon evaluator. 1 We sidestep the two-player aspect of backgammon by defining a Markov chain consisting of only those states in which it is Fiona's turn to play. Tha... |

243 | An analysis of temporal-difference learning with function approximation
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ...best [11, 10]. TD() has been shown to converge to a good approximation of V when linear architectures are used, assuming a suitable decreasing schedule of stepsizes for the incremental weight updates =-=[16]-=-. Linear architectures---which include lookup tables, state aggregation methods, CMACs, radial basis function networks with fixed bases, and multi-dimensional polynomial regression---approximate V (x)... |

196 |
Reinforcement Learning for Robots Using Neural Networks
- Lin
- 1993
(Show Context)
Citation Context ...ch, while requiring little computation per iteration, wastes data and may require sampling many trajectories to reach convergence. One technique for using data more efficiently is "experience rep=-=lay" [6]-=-: explicitly remember all trajectories ever seen, and whenever asked to produce an updated set of coefficients, perform repeated passes of TD() over all the saved trajectories until convergence. This ... |

191 | Linear least-squares algorithms for temporal difference learning
- Bradtke, Barto
- 1996
(Show Context)
Citation Context ...t requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations ands= 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto =-=[5]-=- eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. ... |

130 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...h as TD() may have better real-time performance than least-squares methods [13]. On the other hand, some reinforcement learning applications have been successful with small numbers of features (e.g., =-=[9, 3]-=-), and in these situations LSTD() should be superior. LSTD() has been applied successfully in the context of STAGE, a reinforcement learning algorithm for combinatorial optimization [4]. An exciting p... |

63 | TD models: Modeling the world at a mixture of time scales
- Sutton
- 1995
(Show Context)
Citation Context ...ectly to absorption, and fi then simply computes the average Monte-Carlo return at each state. For general , we produce the statistics of the "simple beta-model" of multi-scale reinforcement=-= learning [14]-=-. In short, if we assume a lookup-table representation for the function ~ V , we can view the LSTD() algorithm as performing these two steps: 1. It implicitly uses the observed simulation data to buil... |

62 | Learning evaluation functions for global optimization and boolean satisfiability
- Boyan, Moore
- 1998
(Show Context)
Citation Context ...of a fixed policysin a large Markov decision process [2, 10]. This is an important subproblem of several algorithms for sequential decision making, including optimistic policy iteration [2] and STAGE =-=[3]-=-. V (x) simply predicts the expected long-term sum of future rewards obtained when the process starts in state x and follows policysuntil termination. For simplicity we will assume thatsis proper (gua... |

45 | A comparison of direct and model-based reinforcement learning
- Atkeson, Santamaria
- 1997
(Show Context)
Citation Context ... in several books [2, 10]. For each state on each observed trajectory, TD() incrementally adjusts the coefficients of ~ V toward new target values. The target values depend on 2 BOYAN the parameters2 =-=[0; 1]-=-. Ats= 1, the target at each visited state x t is the MonteCarlosreturn, i.e., the actual observed sum of future rewards R t +R t+1 + \Delta \Delta \Delta +R end . This is an unbiased sample of V (x t... |

43 | Gain adaptation beats least squares
- Sutton
- 1992
(Show Context)
Citation Context ... the application domain. If a domain has many features and simulation data is available cheaply, then incremental methods such as TD() may have better real-time performance than least-squares methods =-=[13]-=-. On the other hand, some reinforcement learning applications have been successful with small numbers of features (e.g., [9, 3]), and in these situations LSTD() should be superior. LSTD() has been app... |

30 | Learning Evaluation Functions for Global Optimization
- Boyan
- 1998
(Show Context)
Citation Context ...ures (e.g., [9, 3]), and in these situations LSTD() should be superior. LSTD() has been applied successfully in the context of STAGE, a reinforcement learning algorithm for combinatorial optimization =-=[4]-=-. An exciting possibility for future work is to apply LSTD() in the context of approximation algorithms for general Markov decision problems. LSTD() provides an alternative to TD() for the inner loop ... |