## A unified analysis of value-function-based reinforcementlearning algorithms. Neural Computation (1997)

Citations: | 34 - 7 self |

### BibTeX

@MISC{Szepesvari97aunified,

author = {Csaba Szepesvari and Michael L. Littman},

title = {A unified analysis of value-function-based reinforcementlearning algorithms. Neural Computation},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-ma.king environment given the opportunity of interacting,vith it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. \Ve extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proven by verifying that a Himplcr HynchronouH algorithm convergeH. \-Ve illuHtrate the application of the theorem by analyzing the convergence of Q-learningl model-based reinforcement learning, Q-learning with multi-state updates, Q-learning for:\farkov games, and risk-sensitive reinforcement learning. 1

### Citations

4165 | Reinforcement Learning - An Introduction - Sutton, Barto - 1998 |

1412 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ... to approximate T (the optimal value operator given in Equation (1)) by Tt = T(P" Ct), and then uses the operator sequence Tt to build an estimate of v'. In a model-free approach, such as Q-learning (=-=Watkins 1989-=-), for example, the decision maker directly estimates v" without ever estimating p or c. 'iVe describe an abstract version of Q-Iearning next, as it provides a frame\vork and vocabulary for surnrnariz... |

1405 | Reinforcement learning: A survey - Kaelbling, Littman, et al. - 1996 |

829 | Neuro-dynamic programming - Bertsekas, Tsitsiklis - 1996 |

608 | A stochastic approximation method,” The - Robbins, Monro |

563 | Learning to act using real-time dynamic programming - Barto, Bradtke, et al. - 1995 |

533 | Markov games as a framework for multi-agent reinforcement learning - Littman - 1994 |

436 | Real-time heuristic search - Korf - 1990 |

401 |
Stochastic Approximation Algorithms and Applications
- Kushner, Yin
- 1997
(Show Context)
Citation Context ...mponents under which each component has a positive probability (assuming a finite number of components). This latter type of process can be handled using ODE (ordinary differential equation) methods (=-=Kushner & Yin 1997-=-), although this is not the approach taken here. It would be possible, nevertheless, to extend the theorem such that in the Lipschitz conditions we used a conditional expectation with respect to an ap... |

383 | Adaptive Algorithms and Stochastic Approximations - Benveniste, Metivier, et al. - 1991 |

336 | Prioritized sweeping: Reinforcement learning with less data and less time - Moore, Atkeson - 1993 |

305 | On-line Q-learning using connectionist systems - Rummery, Niranjan - 1994 |

299 | Multiagent reinforcement learning: Theoretical framework and an algorithm - Hu, Wellman - 1998 |

288 | Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, Clark - 1978 |

227 | Stable function approximation in dynamic programming - Gordon - 1995 |

215 | On the convergence of stochastic iterative dynamic programming algorithms - Jaakkola, Jordan, et al. - 1994 |

200 | Learning and sequential decision making - Barto, Sutton, et al. - 1990 |

196 | Reinforcement learning with replacing eligibility traces
- Singh, Sutton
- 1996
(Show Context)
Citation Context ...antees convergence to an optimal Q function, and the case in which learncd values arc a function of the chosen exploratory actions (the so-called SARSA algorithm) (John 1994; Rummery & Niranjan 1994; =-=Singh & Sutton 1996-=-; Singh et al. 1998). 3.4 Q-learning for Markov Games In an :\1DP; a single dccision maker selects actions to mlll1miz;c its cxpected discounted cost in a stochastic environment. A generalization of t... |

182 | Algorithms for Sequential Decision Making - Littman - 1996 |

169 | Asynchronous stochastic approximation and Q-learning - Tsitsiklis - 1994 |

149 | Analysis of Recursive Stochastic Algorithms - Ljung - 1977 |

124 | Convergence results for single-step on-policy reinforcement-learning algorithms
- Singh, Jaakkola, et al.
(Show Context)
Citation Context ...ions, however, are not needed for the applications presented in this paper and introduce unneeded complications. These extensions are needed, and have been made, in the convergence analysis of SARSA (=-=Singh et al. 1998-=-). See also tbe work of Szepesvaxi (1998). 3 Analysis of Reinforcement-Learning Algorithms In this section, we apply the results described in Section 2 to prove the convergence of a variety of reinfor... |

121 | Reinforcement learning with soft state aggregation - Singh, Jaakkola, et al. - 1996 |

104 | Average reward reinforcement learning: foundations, algorithms and empirical results - Mahadevan - 1996 |

103 | A reinforcement learning method for maximizing undiscounted rewards - Schwartz - 1993 |

97 | Markov Decision Processes—Discrete Stochastic Dynamic Programming - Puterman - 1994 |

52 | Consideration of risk in reinforcement learning
- Heger
- 1994
(Show Context)
Citation Context ...this article). As a consequence of this, model-ba..'led methods can be used to find optimal policies in 1.fDPS, alternating r...1arkov gaInes, r...'larkov games (Littman 1994), risk-sensitive models (=-=Heger 1994-=-), and explorationsensitive (i.e., SARSA) models (John 1994; Rummery & "iranjan 1994) . Also, if we fix et(x, a, y) = e(x, a, y) and pt (x, a, y) = Pr(ylx,a) for all t, X,y E X a.nd a E A, this result... |

46 | Adaptive aggregation methods for infinite horizon dynamic programming - Bertsekas, Castanon - 1989 |

44 | iegmund, A convergence theorem for nonnegative almost supermartingales and some applications, in: Optimizing Methods - Robbins, S - 1971 |

43 | A generalized reinforcement-learning model: Convergence and applications - Littman, Szepesvári - 1996 |

33 |
Game Theory, second edition
- Owen
- 1982
(Show Context)
Citation Context ... non-expansions (Littman 1996). �farkov games are a generalization of both I'vlDPS and alternating T\.t1arkov games in which the two players simultaneously choose actions at each step in the process (=-=Owen 1982-=-; Littman 1994). The basic model is defined by the tuple (X,A,B,Pr(-I·, ·),c) (states, min actions, max actions, transitions, a.nd costs) and disc01mt factor '";f. As in alternating �farkov games, the... |

30 |
Aggregation methods for large Markov chains
- Schweitzer
- 1984
(Show Context)
Citation Context ...s(z, a, x) as the degree of expected difference between Q'(z,a.) and Q'(.r" a). ='Jote that the above learning process is closely related to learning on aggregated states (Bertsekas & eastaiton 1989; =-=Schweitzer 1984-=-; Singh, Jaakkola, & Jordan 1995). An aggregated state is simply a subset Xi of X. The idea is that the size of the Q table (which stores the Qt(x, a) values) could be reduced if we assigned a common ... |

27 | Convergence of indirect adaptive asynchronous value iteration algorithms - Gullapalli, Barto - 1994 |

27 | A learning algorithm for Markov Decision Processes with adaptive state aggregation
- Baras, Borkar
- 1999
(Show Context)
Citation Context ...then there does not seem to be any simple way to ensure the convergence of Qt unless randomized policies are used during learning \\'hose rate of change is slower than that of the estimation process (=-=Konda & Borkar 1997-=-). 8NoLe that Corollary 5 could also be applied directly to this rule. Another way to deduce the above convergence result is to consider the learning rule over the aggregated states as a standard Q-Ie... |

26 | Generalized markov decision processes: Dynamic-programming and reinforcement-learning algorithms - Szepesvári, Littman - 1996 |

23 | Embedding Fields: A Theory of Learning with Physiological Implications - Grossberg - 1969 |

14 | When the best move isn’t optimal: Q-learning with exploration
- John
- 1994
(Show Context)
Citation Context ...ods can be used to find optimal policies in 1.fDPS, alternating r...1arkov gaInes, r...'larkov games (Littman 1994), risk-sensitive models (Heger 1994), and explorationsensitive (i.e., SARSA) models (=-=John 1994-=-; Rummery & "iranjan 1994) . Also, if we fix et(x, a, y) = e(x, a, y) and pt (x, a, y) = Pr(ylx,a) for all t, X,y E X a.nd a E A, this result implies that asynchronous dynamic programming converges to... |

9 | Q-Learning combined with spreading: Convergence and results - Ribeiro, Szepesvári - 1996 |

9 | Static and Dynamic Aspects of Optimal Sequential Decision Making - Szepesvári - 1998 |

8 | Fictitious play applied to sequences of games and discounted stochastic games - Vrieze, Tijs - 1982 |

5 | Attentional mechanisms as a strategy for generalisation in the Q-learning algorithm - Ribeiro - 1995 |

5 | The asymptotic convergence rates for Q-learning - Szepesvári - 1997 |

3 | Algorithms for Sequential Decision Making - unknown authors - 1996 |

2 |
On the asymptotic convergence rate of Q-Iearning
- Szepesvari
- 1997
(Show Context)
Citation Context ... standard stochastic approximation theory.) Note also that the methods developed in this paper can be used to obtain an asymptotic convergence rate results for averaging-type asynchronous algorithms (=-=Szepesvari 1997-=-). Similarly to .Jaakkola, .Jordan, & Singh (1994) and Tsitsiklis (1994), we develop the connection between stochastic approximation theory and reinforcement learning in 11DPS. Our work is similar in ... |

1 | Learning and sequential decision making - o, on, et al. - 1989 |

1 | Adaptive Algorithms and Stochastic Appro.r,imation8 - Benveniste, vretivier, et al. - 1990 |

1 | On the convergence of stochastic iterative dynanlic progranmling algorithms. Nenral Computation - Jaakkola, Jordan, et al. - 1994 |

1 | Real-time heuristic sealTh - Korf - 1990 |

1 | Avcragc rcward reinforccmcnt learning: Foundations, algorithms, and empirical results - rvlahadcvan - 1996 |

1 |
Ma1'kov Deci8ion P1'Oce88e8-Di8c1'ete Stocha8tic Dynamic Pmgmmming
- Puterman
- 1994
(Show Context)
Citation Context ...immediate cost received on discrete time step t. Consider a finite MDP with let the objective criterion of minimizing total discounted expected cost. The optimal value function v*, as is ,veIl known (=-=Puterman 1994-=-), is the fixed point of the optimal value operator T: B(X) -+ B(X), (Tv)(x) = min L Pr(ylx, a) (c(x, a, y) + 'Yv(y)) , aEA VEX (1) o '" 'Y < 1, where Pr(ylx, a) is the probability of going to state y... |