## An analysis of temporal-difference learning with function approximation (1997)

### Cached

### Download Links

Venue: | IEEE Transactions on Automatic Control |

Citations: | 217 - 7 self |

### BibTeX

@TECHREPORT{Tsitsiklis97ananalysis,

author = {John N. Tsitsiklis and Benjamin Van Roy},

title = {An analysis of temporal-difference learning with function approximation},

institution = {IEEE Transactions on Automatic Control},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator on-line, duringasingle endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of on-line updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present anexample illustrating the possibility of divergence when temporal-difference learning is used in the presence of a nonlinear function approximator.

### Citations

1227 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...ce learning is simple and elegant, a rigorous analysis of its behavior requires significant sophistication. Several previous papers have presented positive results about the algorithm. These include (=-=Sutton, 1988-=-), (Watkins and Dayan, 1992), (Tsitsiklis, 1994), (Jaakkola et al., 1994), (Dayan and Sejnowski, 1994), and (Gurvits et al., 1994), all of which only deal with cases where the number of tunable parame... |

457 |
Dynamic Programming and Optimal Control, Athena Scientific, 3rd edition
- Bertsekas
- 2007
(Show Context)
Citation Context ...te enables the ranking of alternative states in order to guide decision-making. Indeed, such predictions constitute the cost-to-go function that is central to dynamic programming and optimal control (=-=Bertsekas, 1995-=-). Temporal-difference learning, originally proposed by Sutton (1988), is a method for approximating long-term future cost as a function of current state. The algorithm is recursive, efficient, and si... |

368 | Practical issues in temporal difference learning - Tesauro - 1992 |

354 | Generalization in reinforcement learning: successful examples using sparse coding
- Sutton
- 1995
(Show Context)
Citation Context ...sekas (1994). The sensitivity of the error bound tosraises the question of whether or not it ever makes sense to setsto values less than 1. Experimental results (Sutton, 1988; Singh and Sutton, 1994, =-=Sutton, 1996-=-) suggest that settingsto values less than one can often lead to significant gains in the rate of convergence. Such acceleration may be critical when computation time and/or data (in the event that th... |

339 | Adaptive algorithms and stochastic approximations - Benveniste, Métivier, et al. - 1990 |

251 | Generalization in reinforcement learning: Safely approximating the value function - Boyan, Moore - 1995 |

237 | Residual Algorithms: Reinforcement Learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ...s or interpretable characterizations of the limit of convergence. In addition to the positive results, counterexamples to variants of the algorithm have been offered in several papers. These include (=-=Baird, 1995-=-), (Boyan and Moore, 1995), (Gordon, 1995), and (Tsitsiklis and Van Roy, 1996). As suggested by Sutton (1995), the key feature that distinguishes these negative results from their positive counterpart... |

207 | Stable Function Approximation in Dynamic Programming
- Gordon
- 1995
(Show Context)
Citation Context ...e not practical when state spaces are large or infinite. The more general case, involving the use of function approximation, is addressed by results in (Dayan, 1992), (Tsitsiklis and Van Roy, 1996), (=-=Gordon, 1995-=-), and (Singh et al., 1995). The latter three establish convergence with probability 1. However, their results only apply to a very limited class of function approximators and involve variants of a co... |

207 | Convergence of stochastic iterative dynamic programming algorithms - Jaakkola, Jordan, et al. - 1994 |

187 | Reinforcement learning with replacing eligibility traces - Singh, Sutton - 1996 |

153 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...us analysis of its behavior requires significant sophistication. Several previous papers have presented positive results about the algorithm. These include (Sutton, 1988), (Watkins and Dayan, 1992), (=-=Tsitsiklis, 1994-=-), (Jaakkola et al., 1994), (Dayan and Sejnowski, 1994), and (Gurvits et al., 1994), all of which only deal with cases where the number of tunable parameters is the same as the cardinality of the stat... |

134 | Feature-based methods for large scale dynamic programming
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...ning. Furthermore, the ideas used in our analysis can be applied to the simpler context of absorbing Markov chains. Though this extension is omitted from this paper, it can be found in (Bertsekas and =-=Tsitsiklis, 1996-=-), which also contains a more accessible version of the results in this paper, for the case of finite state spaces. The contributions in this paper are as follows: 1. Convergence (with probability 1) ... |

111 | Reinforcement learning with soft state aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ... state spaces are large or infinite. The more general case, involving the use of function approximation, is addressed by results in (Dayan, 1992), (Tsitsiklis and Van Roy, 1996), (Gordon, 1995), and (=-=Singh et al., 1995-=-). The latter three establish convergence with probability 1. However, their results only apply to a very limited class of function approximators and involve variants of a constrained version of tempo... |

61 |
The convergence of td() for general
- Dayan
- 1992
(Show Context)
Citation Context ...egant, a rigorous analysis of its behavior requires significant sophistication. Several previous papers have presented positive results about the algorithm. These include (Sutton, 1988), (Watkins and =-=Dayan, 1992-=-), (Tsitsiklis, 1994), (Jaakkola et al., 1994), (Dayan and Sejnowski, 1994), and (Gurvits et al., 1994), all of which only deal with cases where the number of tunable parameters is the same as the car... |

55 | TD(λ) converges with probability 1 - CASTRO, Dayan, et al. - 1994 |

11 | Incremental learning of evaluation functions for absorbing Markov chains: New methods and theorems - Gurvits, Lin, et al. - 1994 |

11 |
The Convergence of TD(A) for General A
- Dayan
- 1992
(Show Context)
Citation Context ...egant, a rigorous analysis of its behavior requires significant sophistication. Several previous papers have presented positive results about the algorithm. These include (Sutton, 1988), (Watkins and =-=Dayan, 1992-=-), (Tsitsiklis, 1994), (Jaakola et al., 1994), (Dayan and Sejnowski, 1994), and (Gurvits et al., 1995), all of which only deal with cases where the number of tunable parameters is the same as the card... |

8 | Private communication - Gurvits - 1998 |

8 | On the virtues of linear learning and trajectory distributions - Sutton - 1995 |

6 | On the Cut-O Phenomenon in Some Queueing Systems - Konstantopoulos, Baccelli - 1991 |

5 | A Counterexample to Temporal-Difference Learning - Bertsekas - 1994 |

5 | Mean-field analysis for batched TD - Pineda - 1997 |

4 | An Introduction to Queueing Networks - Warland - 1988 |

3 | On the settling time of the congested GI/G/1 queue
- Stamoulis, Tsitsiklis
- 1990
(Show Context)
Citation Context ...e speed of convergence of Dt(io) to D. Starting from state io, with io large, the Markov chain has a negative drift, and requires O(io) steps to enter (with high probability) the vicinity of state 0 (=-=Stamoulis and Tsitsiklis, 1990-=-; Konstantopoulos and Baccelli, 1990). Once the vicinity of state 0 is reached, it quickly reaches steady-state. Thus, if we concentrate on 03(i) = i 3 , the difference E[0(iT)0I(iT,+m)iol]- Eo[q(i-) ... |

3 | D.(1992) \The Convergence of TD( )forGeneral - Dayan - 1999 |

2 | Learningto Predict by theMethod of Temporal Di erences - Sutton - 1988 |

1 | Mean--Field Analysis for Batched TD(), unpublished - Pineda - 1996 |

1 |
On the Cut-Off Phenomena in Some Queueing Systems," preprint
- Konstantopoulos, Baccelli
- 1990
(Show Context)
Citation Context ... to D. Starting from state io, with io large, the Markov chain has a negative drift, and requires O(io) steps to enter (with high probability) the vicinity of state 0 (Stamoulis and Tsitsiklis, 1990; =-=Konstantopoulos and Baccelli, 1990-=-). Once the vicinity of state 0 is reached, it quickly reaches steady-state. Thus, if we concentrate on 03(i) = i 3 , the difference E[0(iT)0I(iT,+m)iol]- Eo[q(i-) '(iT+m)] is of the order of i 6 for ... |

1 | Residual Algorithms: ReinforcementLearning with Function Approximation - Baird - 1995 |

1 | P.(1994) \A Counterexample toTemporal-Di erence Learning - Bertsekas - 1994 |

1 | Incremental Learning ofEvaluation Functions for Absorbing Markov Chains: New Methods and Theorems - Gurvits, Lin - 1994 |

1 | On the Virtues of Linear Learning andTrajectory Distributions - Sutton - 1995 |

1 | Asynchronous Stochastic Approximation andQ-Learning - Tsitsiklis - 1994 |