## Linear least-squares algorithms for temporal difference learning (1996)

### Cached

### Download Links

- [www-anw.cs.umass.edu]
- [www-all.cs.umass.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 191 - 0 self |

### BibTeX

@INPROCEEDINGS{Bradtke96linearleast-squares,

author = {Steven J. Bradtke and Andrew G. Barto and Pack Kaelbling},

title = {Linear least-squares algorithms for temporal difference learning},

booktitle = {Machine Learning},

year = {1996},

pages = {22--33}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call Least-Squares TD (LS TD) for which we prove probability-one convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive Least-Squares TD (RLS TD). Although these new TD algorithms require more computation per time-step than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, arc,, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ~ro. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.

### Citations

1412 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...l n, and is updated only at the end of each trial.38 S.J. BRADTKE AND A.G. BARTO Less restrictive theorems have been obtained for the TD(0) algorithm by considering it as a special case of Watkins' (=-=Watkins, 1989-=-) Q-learning algorithm. Watkins and Dayan (Watkins & Dayan, 1992), Jaakkola, Jordan, and Singh, (Jaakkola, et al., 1994), and Tsitsiklis (Tsitsiklis, 1993) note that since the TD(0) learning rule is a... |

857 |
Finite Markov Chains
- Kemeny, Snell
- 1960
(Show Context)
Citation Context ...ed proportion of state transitions that take the Markov chain out of state z. For an ergodic Markov chain, 7rz is the invariant, or steady-state, distribution associated with the stochastic matrix P (=-=Kemeny & Snell, 1976-=-). For an absorbing Markov chain, lrx is the expected number of visits out of state z during one transition sequence from a start state to a goal state (Kemeny & Snell, 1976). Since there are no trans... |

608 |
A stochastic approximation method,” The
- Robbins, Monro
(Show Context)
Citation Context ...1 J Appendix C Selecting Step-Size Parameters The convergence theorem for NTD(0) (Bradtke, 1994) requires a separate step-size parameter, a(x), for each state x, that satisfies the Robbins and Monro (=-=Robbins & Monro, 1951-=-) criteria OQ E ak(x) = ~ and E ak(x)2 < c~ k=l k=l with probability 1, where ak(x) is the step-size parameter for the k-th visitation of state x. Instead of a separate step-size parameter for each st... |

436 | Theory and Practice of Recursive Identification - Ljung, Soderstrom - 1983 |

384 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context .... They allow a system to learn to predict the total amount of reward expected over time, and they can be used for other prediction problems as well (Anderson, 1988, Barto, et al., 1983, Sutton, 1984, =-=Tesauro, 1992-=-). We introduce two new TD algorithms based on the theory of linear least-squares function approximation. The recursive least-squares function approximation algorithm is commonly used in adaptive cont... |

246 |
Adaptive Filtering Prediction and Control
- Goodwin, Sin
- 1984
(Show Context)
Citation Context ... introduce two new TD algorithms based on the theory of linear least-squares function approximation. The recursive least-squares function approximation algorithm is commonly used in adaptive control (=-=Goodwin & Sin, 1984-=-) because it can converge many times more rapidly than simpler algorithms. Unfortunately, extending this algorithm to the case of TD learning is not straightforward. We define an algorithm we call Lea... |

215 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ...ve been obtained for the TD(0) algorithm by considering it as a special case of Watkins' (Watkins, 1989) Q-learning algorithm. Watkins and Dayan (Watkins & Dayan, 1992), Jaakkola, Jordan, and Singh, (=-=Jaakkola, et al., 1994-=-), and Tsitsiklis (Tsitsiklis, 1993) note that since the TD(0) learning rule is a special case of Q-learning, their probability-one convergence proofs for Q-learning can be used to show that on-line u... |

191 | Neuron-like elements that can solve difficult learning control problems - Barto, Sutton, et al. - 1983 |

169 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...by considering it as a special case of Watkins' (Watkins, 1989) Q-learning algorithm. Watkins and Dayan (Watkins & Dayan, 1992), Jaakkola, Jordan, and Singh, (Jaakkola, et al., 1994), and Tsitsiklis (=-=Tsitsiklis, 1993-=-) note that since the TD(0) learning rule is a special case of Q-learning, their probability-one convergence proofs for Q-learning can be used to show that on-line use of the TD(0) learning rule (i.e.... |

89 |
Generalization of backpropagation with application to a recurrent gas market model
- Werbos
- 1988
(Show Context)
Citation Context ... is not invertible. The LS TD algorithm has some similarity to an algorithm Werbos (Werbos, 1990) proposed as a linear version of his Heuristic Dynamic Programming (Lukes, et al., 1990, Werbos, 1987, =-=Werbos, 1988-=-, Werbos, 1992). However, Werbos' algorithm is not amenable to a recursive formulation, as is LS TD, and does not converge for arbitrary initial parameter vectors, as does LS TD. See ref. (Bradtke, 19... |

88 |
Approximate dynamic programming for real-time control and neural modeling
- Werbos
- 1992
(Show Context)
Citation Context ...ible. The LS TD algorithm has some similarity to an algorithm Werbos (Werbos, 1990) proposed as a linear version of his Heuristic Dynamic Programming (Lukes, et al., 1990, Werbos, 1987, Werbos, 1988, =-=Werbos, 1992-=-). However, Werbos' algorithm is not amenable to a recursive formulation, as is LS TD, and does not converge for arbitrary initial parameter vectors, as does LS TD. See ref. (Bradtke, 1994). It remain... |

88 |
Recursive Estimation and Time-Series Analysis
- Young
- 1984
(Show Context)
Citation Context ...hen it is used with a function approximator linear in the adjustable parameters. To obtain this result, we use the instrumental variable approach (Ljung & Soderstrrm, 1983, SrderstrSm & Stoica, 1983, =-=Young, 1984-=-) which provides a way to handle least-squares estimation with training data that is noisy on both the input and output observations. We then define a recursive version of this algorithm, Re-34 s.J. ... |

77 | Strategy learning with multilayer connectionist representations
- Anderson
- 1987
(Show Context)
Citation Context ...uences of actions unfold over extended time periods. They allow a system to learn to predict the total amount of reward expected over time, and they can be used for other prediction problems as well (=-=Anderson, 1988-=-, Barto, et al., 1983, Sutton, 1984, Tesauro, 1992). We introduce two new TD algorithms based on the theory of linear least-squares function approximation. The recursive least-squares function approxi... |

60 | Td(λ) converges with probability 1
- Dayan, Sejnowski
- 1994
(Show Context)
Citation Context ...ction of 0 when the feature vectors are linearly independent. 1 Sutton (Sutton, 1988) and Dayan (Dayan, 1992) proved parameter convergence in the mean under these conditions, and Dayan and Sejnowski (=-=Dayan & Sejnowski, 1994-=-) proved parameter convergence with probability 1 under these conditions for TD(~) applied to absorbing Markov chains in a trial-based manner, i.e., with parameter updates only at the end of every tri... |

50 | Instrumental Variable Methods for System Identification - Söderström, Stoica - 1983 |

50 |
Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research
- Werbos
- 1987
(Show Context)
Citation Context ... -- "Yek+l)' ] is not invertible. The LS TD algorithm has some similarity to an algorithm Werbos (Werbos, 1990) proposed as a linear version of his Heuristic Dynamic Programming (Lukes, et al., 1990, =-=Werbos, 1987-=-, Werbos, 1988, Werbos, 1992). However, Werbos' algorithm is not amenable to a recursive formulation, as is LS TD, and does not converge for arbitrary initial parameter vectors, as does LS TD. See ref... |

47 |
Consistency of HDP Applied to a Simple Reinforcement Learning Problem
- Werbos
- 1990
(Show Context)
Citation Context ... transition. The parameter vector 0t is not well defined when t is small since the matrix -/~k=l ek(¢k -- "Yek+l)' ] is not invertible. The LS TD algorithm has some similarity to an algorithm Werbos (=-=Werbos, 1990-=-) proposed as a linear version of his Heuristic Dynamic Programming (Lukes, et al., 1990, Werbos, 1987, Werbos, 1988, Werbos, 1992). However, Werbos' algorithm is not amenable to a recursive formulati... |

46 | learning rate schedules for faster stochastic gradient search
- Darken, Chang, et al.
- 1992
(Show Context)
Citation Context ...eter since each subsequence will contain fewer large step sizes. The step-size parameter sequence {at } was generated using the "search then converge" algorithm described by Darken, Chang, and Moody (=-=Darken, et al., 1992-=-): Ozt = ~0 1+ c t O~ 0 T c t t 2 " l+~oT+T 7 The choice of parameters a0, c, and ~- determines the transition of learning from "search mode" to "converge mode". Search mode describes the time during ... |

20 | Incremental Dynamic Programming for On-Line Adaptive Control
- Bradtke
- 1994
(Show Context)
Citation Context ...e proofs for Q-learning can be used to show that on-line use of the TD(0) learning rule (i.e., not trial-based) with a lookup-table function representation converges to V with probability 1. Bradtke (=-=Bradtke, 1994-=-) extended Tsitsiklis' proof to show that on-line use of TD(0) with a function approximator that is linear in the parameters and in which the feature vectors are linearly independent also converges to... |

11 |
The Convergence of TD(A) for General A
- Dayan
- 1992
(Show Context)
Citation Context ...y been proven for cases in which the value function is represented as a lookup table or as a linear function of 0 when the feature vectors are linearly independent. 1 Sutton (Sutton, 1988) and Dayan (=-=Dayan, 1992-=-) proved parameter convergence in the mean under these conditions, and Dayan and Sejnowski (Dayan & Sejnowski, 1994) proved parameter convergence with probability 1 under these conditions for TD(~) ap... |

7 | Expectation driven learning with an associative memory
- Lukes, Thompson, et al.
- 1990
(Show Context)
Citation Context ...e matrix -/~k=l ek(¢k -- "Yek+l)' ] is not invertible. The LS TD algorithm has some similarity to an algorithm Werbos (Werbos, 1990) proposed as a linear version of his Heuristic Dynamic Programming (=-=Lukes, et al., 1990-=-, Werbos, 1987, Werbos, 1988, Werbos, 1992). However, Werbos' algorithm is not amenable to a recursive formulation, as is LS TD, and does not converge for arbitrary initial parameter vectors, as does ... |

2 | Temporal Credit Assignment in Reinforcement Learning - unknown authors - 1984 |