Least-Squares Temporal Difference Learning (1999) [41 citations — 0 self]
http://www.cs.cmu.edu/afs/cs/user/jab/web/cv/pubs/
http://www.cs.cmu.edu/~jab/pubs/boyan.lstdl.ps
http://www.research.rutgers.edu/~lihong/project/ah
DBLP
CACHED:
Abstract:
TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto [5] eliminates all stepsize parameters and improves data efficiency. This paper extends Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linea...

