## Temporal Difference Methods for General Projected Equations

### Cached

### Download Links

Citations: | 5 - 4 self |

### BibTeX

@MISC{Bertsekas_temporaldifference,

author = {Dimitri P. Bertsekas},

title = {Temporal Difference Methods for General Projected Equations},

year = {}

}

### OpenURL

### Abstract

Abstract—We consider projected equations for approximate solution of high-dimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities, and algorithms that may be implemented with low-dimensional simulation. These algorithms originated in approximate dynamic programming (DP), where they are collectively known as temporal difference (TD) methods. Even when specialized to DP, our methods include extensions/new versions of TD methods, which offer special implementation advantages and reduced overhead over the standard LSTD and LSPE methods, and can deal with near singularity in the associated matrix inversion. We develop deterministic iterative methods and their simulationbased versions, and we discuss a sharp qualitative distinction between them: the performance of the former is greatly affected by direction and feature scaling, yet the latter have the same asymptotic convergence rate regardless of scaling, because of their common simulation-induced performance bottleneck. Index Terms—Dynamic programming, Markov decision processes, approximation methods, temporal difference methods, reinforcement learning. I.

### Citations

1518 |
Iterative methods for sparse linear systems
- Saad
- 2003
(Show Context)
Citation Context ... least squares formulation. The vector that minimizes ‖x − Ax − b‖ 2 is approximated by an x ∈ S such that the residual (x − Ax − b) is orthogonal to U (this is known as the Petrov-Galerkin condition =-=[24]-=-). If U = ΞS, where Ξ is a positive definite symmetric matrix, then the orthogonality condition is written as y ′ Ξ(x − Ax − b) = 0 for all y ∈ S, which together with the condition x ∈ S, is equivalen... |

1227 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...y use the temporal differences (TD) defined by qk,t = φ(it) ′ rk − αφ(it+1) ′ rk − bit , t ≤ k, (5) where φ(i) ′ denotes the ith row of the matrix Φ. The original method known as TD(0), due to Sutton =-=[6]-=-, is rk+1 = rk − γkφ(ik)qk,k, (6) where γk is a stepsize sequence that diminishes to 0. 2 It may be viewed as a stochastic approximation/Robbins-Monro scheme for solving the equation Φ ′ Ξ(Φr−AΦr−b) =... |

1198 | Markov Decision Processes: Discrete Stochastic Dynamic Programming - Puterman - 1994 |

848 | Introduction to Reinforcement Learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ... is described in detail in the literature, has been extensively tested in practice, and is one of the major methods for approximate DP (see the books by Bertsekas and Tsitsiklis [3], Sutton and Barto =-=[4]-=-, and Powell [5]; Bertsekas [1] provides a recent textbook treatment and up-to-date references). For problems of very high dimension, classical matrix inversion methods cannot be used to solve the pro... |

745 |
Nonlinear Programming
- Bertsekas
- 2004
(Show Context)
Citation Context ... scheme. This approach is described in detail in the literature, has been extensively tested in practice, and is one of the major methods for approximate DP (see the books by Bertsekas and Tsitsiklis =-=[3]-=-, Sutton and Barto [4], and Powell [5]; Bertsekas [1] provides a recent textbook treatment and up-to-date references). For problems of very high dimension, classical matrix inversion methods cannot be... |

619 | Tsitsiklis, Parallel and Distributed Computation: Numerical Methods - Bertsekas, N - 1989 |

266 |
Monotone operators and the proximal point algorithm
- Rockafellar
- 1976
(Show Context)
Citation Context ...this type of sensitivity, we may use a regularization approach, which is well-known in the theory of the proximal point algorithm for monotone variational inequalities (see Martinet [29], Rockafellar =-=[30]-=-, or the text by Facchinei and Pang [25], Section 12.3). In particular, we approximate the equation Ckr = dk by (Ck + βI)r = dk + β¯r, (41) where β is a positive scalar and ¯r is some guess of the sol... |

217 | An Analysis of Temporal-Difference Learning with Function Approximation
- Tsitsiklis, Roy
- 1997
(Show Context)
Citation Context ...llman’s equation, they become unreliable because in the nonlinear context, ΠT need not be a contraction [3], [16] (a notable exception is optimal stopping problems, as shown by Tsitsiklis and Van Roy =-=[17]-=-, [18]; see also Yu and Bertsekas [19]). The alternatives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed b... |

179 | Linear Least-Squares algorithms for temporal difference learning
- Bradtke, Barto
- 1996
(Show Context)
Citation Context ...and Bertsekas [19]). The alternatives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed by Bradtke and Barto =-=[20]-=-, and followed up by Boyan [21], and Nedić and Bertsekas [13]). It writes the projected equation (4) in an equivalent linear form Cr = d, where C is an s × s matrix, and d ∈ ℜs , then uses the type of... |

175 | Actor-critic algorithms
- Konda, Tsitsiklis
- 1999
(Show Context)
Citation Context ...and side Φ ′ Ξ(Φr − AΦr − b) of the equation. Because TD(0) is often slow and unreliable (this is wellknown in practice and typical of stochastic approximation schemes; see also the analysis by Konda =-=[10]-=-), alternative iterative methods have been proposed. One of them is the Fixed Point Kalman Filter (FPKF), proposed by Choi and Van Roy [11] and given by rk+1 = rk − γkD −1 k φ(ik)qk,k, (7) where Dk is... |

166 |
Approximate Dynamic Programming: Solving the Curses of Dimensionality
- Powell
- 2007
(Show Context)
Citation Context ... detail in the literature, has been extensively tested in practice, and is one of the major methods for approximate DP (see the books by Bertsekas and Tsitsiklis [3], Sutton and Barto [4], and Powell =-=[5]-=-; Bertsekas [1] provides a recent textbook treatment and up-to-date references). For problems of very high dimension, classical matrix inversion methods cannot be used to solve the projected equation,... |

154 |
Iterative Methods for Sparse Linear Systems. 2nd edition
- Saad
- 2003
(Show Context)
Citation Context ...oach uses two subspaces, and , and a least squares formulation. The vector that minimizes is approximated by an such that the residual is orthogonal to (this is known as the Petrov-Galerkin condition =-=[24]-=-). If , where is a positive definite symmetric matrix, then the orthogonality condition is written as for all , which together with2130 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 56, NO. 9, SEPTEMB... |

140 |
Positive Solutions of Operator Equations
- Krasnoselskii
- 1964
(Show Context)
Citation Context ...e form (2), and a (possibly weighted) Euclidean projection operator Π from ℜ n to S. Then we approximate a fixed point with a vector Φr ∈ S that solves the projected equation Φr = Π(AΦr+b) (see e.g., =-=[22]-=-, [23]). Thus the projected equation framework of approximate DP is a special case of Galerkin approximation. This connection, which is potentially significant, does not seem to have been mentioned in... |

88 | Technical Update: Least-Squares Temporal Difference Learning
- Boyan
- 1999
(Show Context)
Citation Context ...atives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed by Bradtke and Barto [20], and followed up by Boyan =-=[21]-=-, and Nedić and Bertsekas [13]). It writes the projected equation (4) in an equivalent linear form Cr = d, where C is an s × s matrix, and d ∈ ℜs , then uses the type of simulation described earlier t... |

75 | Optimal stopping of Markov processes : Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional nancial derivatives. IEEE transactions on automatic control, 44(10):1840 1851
- Tsitsiklis, Roy
- 1999
(Show Context)
Citation Context ...an’s equation, they become unreliable because in the nonlinear context, need not be a contraction [3], [16] (a notable exception is optimal stopping problems, as shown by Tsitsiklis and Van Roy [17], =-=[18]-=-; see also Yu and Bertsekas [19]). The alternatives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed by Brad... |

71 |
Stochastic Approximation: A Dynamical Systems Viewpoint
- Borkar
(Show Context)
Citation Context ...nce rate, which is fast relative to the slow convergence rate of the simulation-generated Dk, Ck, and dk. As a result the iterations (34), (42), and (44) operate on two time scales (see, e.g., Borkar =-=[32]-=-, Ch. 6): the slow time scale at which Dk, Ck, and dk change, and the fast time scale at which rk adapts to changes in Dk, Ck, and dk. It follows that there is convergence in the fast time scale befor... |

63 | Least Squares Policy Evaluation Algorithms with Linear Function Approximation. Discrete Event Dynamic Systems: Theory and Applications
- Nedić, Bertsekas
- 2003
(Show Context)
Citation Context ...l proof of convergence rate superiority over TD(0). An alternative to TD(0) is the Least Squares Policy Evaluation algorithm (LSPE, proposed by Bertsekas and Ioffe [12]; see also Nedić and Bertsekas, =-=[13]-=-, Bertsekas, Borkar, and Nedić [14], Yu and Bertsekas [15]): rk+1 = rk − 1 k + 1 D−1 k k∑ φ(it)qk,t, (9) t=0 where Dk is given by Eq. (8). While this method resembles the FPKF iteration (7), it is dif... |

46 |
Computational Galerkin Methods
- Fletcher
- 1984
(Show Context)
Citation Context ... (2), and a (possibly weighted) Euclidean projection operator Π from ℜ n to S. Then we approximate a fixed point with a vector Φr ∈ S that solves the projected equation Φr = Π(AΦr+b) (see e.g., [22], =-=[23]-=-). Thus the projected equation framework of approximate DP is a special case of Galerkin approximation. This connection, which is potentially significant, does not seem to have been mentioned in the l... |

40 | Temporal differences-based policy iteration and applications in neuro-dynamic programming
- Bertsekas, Ioffe
- 1996
(Show Context)
Citation Context ... reported, albeit without theoretical proof of convergence rate superiority over TD(0). An alternative to TD(0) is the Least Squares Policy Evaluation algorithm (LSPE, proposed by Bertsekas and Ioffe =-=[12]-=-; see also Nedić and Bertsekas, [13], Bertsekas, Borkar, and Nedić [14], Yu and Bertsekas [15]): rk+1 = rk − 1 k + 1 D−1 k k∑ φ(it)qk,t, (9) t=0 where Dk is given by Eq. (8). While this method resembl... |

34 |
Projection Methods for Variational Inequalities with Application to the Traffic Assignment Problem
- Bertsekas, Gafni
- 1982
(Show Context)
Citation Context ...of strong monotonicity of F , it turns out that this iteration is convergent in a way similar to the case where F is strongly monotone. In particular, in a paper devoted to the case F (r) = Φ ′ f(Φr) =-=[27]-=-, it was shown that there exists ¯γ > 0 such that rk → r ∗ linearly for each γ ∈ (0, ¯γ], where r ∗ is some solution of f(Φr ∗ ) ′ Φ(r − r ∗ ) ≥ 0, ∀ r ∈ ˆ R,BERTSEKAS: TEMPORAL DIFFERENCE METHODS FO... |

33 | A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning. Discrete Event Dynamic Systems
- Choi, Roy
- 2006
(Show Context)
Citation Context ...hastic approximation schemes; see also the analysis by Konda [10]), alternative iterative methods have been proposed. One of them is the Fixed Point Kalman Filter (FPKF), proposed by Choi and Van Roy =-=[11]-=- and given by rk+1 = rk − γkD −1 k φ(ik)qk,k, (7) where Dk is a positive definite symmetric scaling matrix, selected to speed up convergence. It is a scaled (by the matrix Dk) version of TD(0), so it ... |

27 |
Dynamic Programming and Optimal Control, 3rd ed
- Bertsekas
(Show Context)
Citation Context ...icy, b is a given cost vector of the policy, and α ∈ (0, 1) is a discount factor. Other cases where α = 1 include the classical average cost and stochastic shortest path problems; see e.g., Bertsekas =-=[1]-=-, Puterman [2]. An approximate/projected solution of Bellman’s equation can be used to generate an (approximately) improved policy through an (approximate) policy iteration scheme. This approach is de... |

25 | Improved temporal difference methods with linear function approximation
- Bertsekas, Borkar, et al.
- 2004
(Show Context)
Citation Context ...rity over TD(0). An alternative to TD(0) is the Least Squares Policy Evaluation algorithm (LSPE, proposed by Bertsekas and Ioffe [12]; see also Nedić and Bertsekas, [13], Bertsekas, Borkar, and Nedić =-=[14]-=-, Yu and Bertsekas [15]): rk+1 = rk − 1 k + 1 D−1 k k∑ φ(it)qk,t, (9) t=0 where Dk is given by Eq. (8). While this method resembles the FPKF iteration (7), it is different in a fundamental way because... |

24 |
On the existence of fixed points for approximate value iteration and temporal-difference learning
- Farias, Roy
- 2000
(Show Context)
Citation Context ...s of the Markov chain. When these algorithms are extended to solve nonlinear versions of Bellman’s equation, they become unreliable because in the nonlinear context, ΠT need not be a contraction [3], =-=[16]-=- (a notable exception is optimal stopping problems, as shown by Tsitsiklis and Van Roy [17], [18]; see also Yu and Bertsekas [19]). The alternatives to iterative methods are matrix inversion methods, ... |

21 | Projected Equation Methods for Approximate Solution of Large Linear Systems
- Bertsekas, Yu
(Show Context)
Citation Context ...he case λ > 0 in Section 4.E. For the unconstrained case ( ˆ S = S) and λ > 0, analogs of TD(λ), LSPE(λ), and LSTD(λ) for general projected equations and their convergence properties are discussed in =-=[8]-=- and [9]. ℓ=0 γk, and uses the time average (k + 1) −1 ∑k t=0 φ(it)qk,t of the TD term in its right-hand side in place of φ(ik)qk,k, the latest sample of the TD term [cf. Eqs. (6) and (7)]. This resul... |

18 | Average cost temporal-difference learning
- Tsitsiklis, Roy
- 1999
(Show Context)
Citation Context ...ion property of ΠT . This shows that f is strongly monotone on S. There are well-known cases in approximate DP where ΠT is a contraction with respect to ‖·‖Ξ, with Ξ a diagonal matrix (see [3], [17], =-=[28]-=-, [1], [15]). An example is discounted or average cost DP, where T (x) = αP x + b, with α ∈ (0, 1], P is a transition probability matrix of an ergodic Markov chain, and Ξ is a diagonal matrix with the... |

17 | Convergence results for some temporal difference methods based on least squares
- Yu, Bertsekas
- 2006
(Show Context)
Citation Context ...ernative to TD(0) is the Least Squares Policy Evaluation algorithm (LSPE, proposed by Bertsekas and Ioffe [12]; see also Nedić and Bertsekas, [13], Bertsekas, Borkar, and Nedić [14], Yu and Bertsekas =-=[15]-=-): rk+1 = rk − 1 k + 1 D−1 k k∑ φ(it)qk,t, (9) t=0 where Dk is given by Eq. (8). While this method resembles the FPKF iteration (7), it is different in a fundamental way because it is not a stochastic... |

17 |
A least squares Q-learning algorithm for optimal stopping problems
- Yu, Bertsekas
- 2007
(Show Context)
Citation Context ...le because in the nonlinear context, ΠT need not be a contraction [3], [16] (a notable exception is optimal stopping problems, as shown by Tsitsiklis and Van Roy [17], [18]; see also Yu and Bertsekas =-=[19]-=-). The alternatives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed by Bradtke and Barto [20], and followed... |

7 | Least squares temporal difference methods: An analysis under general conditions - Yu |

7 | Approximate simulation-based solution of large-scale least squares problems
- Wang, Polydorides, et al.
- 2009
(Show Context)
Citation Context ...se an alternative regularization approach, based on a conversion to a least squares problem (also used in a related simulation-based equation approximation context by Wang, Polydorides, and Bertsekas =-=[31]-=-). We introduce a positive definite symmetric matrix Σk and replace the equation Ckr = dk with minimization of (Ckr − dk) ′ Σ −1 k (Ckr − dk) over r ∈ ℜ s . We then iterate according to rk+1 = arg min... |

6 | Projected Equations, Variational Inequalities, and Temporal Difference Methods
- Bertsekas
- 2009
(Show Context)
Citation Context ...mely large and Φ cannot be chosen favorably. [25]), and the connection with projected equations can be used as a basis for an approximate solution approach. We discuss these connections in the report =-=[7]-=-, which is in effect an extended version of the present paper. C. Approximate Solution of Variational Inequalities This context is more general than the preceding two because ˆR may be a strict subset... |

6 |
Krasnoselskii et al., Approximate Solution of Operator Equations
- A
- 1972
(Show Context)
Citation Context ...s a vector, a subspace of the form (2), and a (possibly weighted) Euclidean projection operator from to . Then we approximate a fixed point with a vector that solves the projected equation (see e.g., =-=[22]-=-, [23]). Thus, the projected equation framework of approximate DP is a special case of Galerkin approximation. This connection, which is potentially significant, does not seem to have been mentioned i... |

5 |
stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing financial derivatives
- Optimal
- 1999
(Show Context)
Citation Context ...s equation, they become unreliable because in the nonlinear context, ΠT need not be a contraction [3], [16] (a notable exception is optimal stopping problems, as shown by Tsitsiklis and Van Roy [17], =-=[18]-=-; see also Yu and Bertsekas [19]). The alternatives to iterative methods are matrix inversion methods, a prime example of which is the Least Squares Temporal Differences method (LSTD, proposed by Brad... |

1 |
Regularisation d’ inequations variationnelles par approximations successives,” Revue Francaise d’Informatique et de
- Martinet
- 1970
(Show Context)
Citation Context ...y fast. To reduce this type of sensitivity, we may use a regularization approach, which is well-known in the theory of the proximal point algorithm for monotone variational inequalities (see Martinet =-=[29]-=-, Rockafellar [30], or the text by Facchinei and Pang [25], Section 12.3). In particular, we approximate the equation Ckr = dk by (Ck + βI)r = dk + β¯r, (41) where β is a positive scalar and ¯r is som... |