Results 1  10
of
25
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 104 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
An analysis of stochastic shortest path problems
 Mathematics of Operations Research
, 1991
"... by ..."
(Show Context)
The NSF Workshop on Reinforcement Learning: Summary and Observations
 AI Magazine
, 1996
"... Reinforcement learning (RL) has become one of the most actively studied learning ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Reinforcement learning (RL) has become one of the most actively studied learning
QLearning and Policy Iteration Algorithms for Stochastic Shortest
 Path Problems,” Lab. for Information and Decision Systems Report LIDSP2871, MIT
, 2011
"... We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Qlearning schemes with as reliable convergence as classical Qlearning. We also discuss methods that use basis function approximations of Qfactors and we give an associated error bound.
The Simplex and PolicyIteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
, 2010
"... We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computati ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policyiteration method (including the Simplex method) is superior to that of the only known strongly polynomialtime interiorpoint algorithm ([28] 2005) for solving this problem. The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policyiteration) method with the smallestindex pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policyiteration method was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with substochastic and transient state transition probability matrices. 1 Introduction of the Markov decision problem and
An AverageReward Reinforcement Learning Algorithm for Computing BiasOptimal Policies
 In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in averagereward reinforcement learning ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Recently, there has been growing interest in averagereward reinforcement learning
Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning
 In L.Saitta (Ed.), Machine Learning: Proc. of the Thirteenth Int. Conf
, 1996
"... Research in reinforcement learning (RL) has thus far concentrated on two optimality ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Research in reinforcement learning (RL) has thus far concentrated on two optimality
A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies,” Lab. for Info. and Decision Systems Report LIDSP2905
, 2013
"... We consider the stochastic control model with Borel spaces and universally measurable policies. For this model the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and policy iteration method that circumvents thi ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
We consider the stochastic control model with Borel spaces and universally measurable policies. For this model the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and policy iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function, in a manner that resembles policy iteration. It can also be used to address similar difficulties of policy iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems, for the discounted case where the onestage costs are bounded, and for the undiscounted case where the onestage costs are nonpositive or nonnegative. For the undiscounted total cost problems with nonnegative onestage costs, we also give a new convergence theorem for value iteration, which shows that value iteration converges whenever it is initialized with a function that is above the optimal cost function and yet bounded by a multiple of the optimal cost function. This condition resembles Whittle’s bridging condition and is partly motivated by it. The theorem is also partly motivated by a result of Maitra
Recursive Stochastic Games with Positive Rewards
"... Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1exit Recursive Markov Decision Processes (1RMDPs) and Simple Stochastic Games (1RSSGs) with strictly positive rewards. These are a class of finitely presented countablestate z ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1exit Recursive Markov Decision Processes (1RMDPs) and Simple Stochastic Games (1RSSGs) with strictly positive rewards. These are a class of finitely presented countablestate zerosum stochastic games, with total expected reward objective. They subsume standard finitestate MDPs and Condon’s simple stochastic games and correspond to optimization and game versions of several classic stochastic models, with rewards. Such stochastic models arise naturally as models of probabilistic procedural programs with recursion, and the problems we address are motivated by the goal of analyzing the optimal/pessimal expected running time in such a setting. We give polynomial time algorithms for 1exit Recursive Markov decision processes (1RMDPs) with positive rewards. Specifically, we show that the exact optimal value of both maximizing and minimizing 1RMDPs with positive rewards can be computed in polynomial time (this value may be ∞). For twoplayer 1RSSGs with positive rewards, we prove a “stackless and memoryless ” determinacy result, and show that deciding whether the game value is at least a given value r is in NP ∩ coNP. We also prove that a simultaneous strategy improvement algorithm converges to the value and optimal strategies for these stochastic games. We observe that 1RSSG positive reward games are “harder ” than finitestate SSGs in several senses. 1