Results 1 
3 of
3
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
The Asymptotic ConvergenceRate of Qlearning
, 1998
"... In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Qlearning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the stateaction pairs are sampled from a fixed probability distribution. He ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Qlearning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the stateaction pairs are sampled from a fixed probability distribution. Here R = p min =pmax is the ratio of the minimum and maximum stateaction occupation frequencies. The results extend to convergent online learning provided that p min ? 0, where p min and pmax now become the minimum and maximum stateaction occupation frequencies corresponding to the stationary distribution. 1 INTRODUCTION Qlearning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesv'ari, 1996; Szepesv'ari and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookuptable based) Qlearning algorithms. Although, this upper bound i...
The Asymptotic ConvergenceRate of QIearning
"... szepes((trnath.uszeged.hu In this paper we show that for discounted MDPs with discount factor ' (> 1/2 the asymptotic rate of convergence of Qlearning is O(I/tR(lO)) if R(1 ':I) < 1/2 and O ( Jlog log t/t) otherwise provided that the stateaction pairs are sampled from a fixed probability distri ..."
Abstract
 Add to MetaCart
szepes((trnath.uszeged.hu In this paper we show that for discounted MDPs with discount factor ' (> 1/2 the asymptotic rate of convergence of Qlearning is O(I/tR(lO)) if R(1 ':I) < 1/2 and O ( Jlog log t/t) otherwise provided that the stateaction pairs are sampled from a fixed probability distribution. Here R = Pmin/PmRx is the ratio of the minimum and rnaximurn stateaction occupation frequencies. The results extend to convergent oIlline learning provided that Pmin> 0, where Pmin and Pmax now become the minimum and maximum stateaction occupation frequencies corresponding to the stationary distribution. 1