Results 1  10
of
24
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
An analysis of stochastic shortest path problems
 Mathematics of Operations Research
, 1991
"... by ..."
The NSF Workshop on Reinforcement Learning: Summary and Observations
 AI Magazine
, 1996
"... Reinforcement learning (RL) has become one of the most actively studied learning ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Reinforcement learning (RL) has become one of the most actively studied learning
The Simplex and PolicyIteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
, 2010
"... We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computati ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policyiteration method (including the Simplex method) is superior to that of the only known strongly polynomialtime interiorpoint algorithm ([28] 2005) for solving this problem. The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policyiteration) method with the smallestindex pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policyiteration method was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with substochastic and transient state transition probability matrices. 1 Introduction of the Markov decision problem and
QLearning and Policy Iteration Algorithms for Stochastic Shortest
 Path Problems,” Lab. for Information and Decision Systems Report LIDSP2871, MIT
, 2011
"... We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Qlearning schemes with as reliable convergence as classical Qlearning. We also discuss methods that use basis function approximations of Qfactors and we give an associated error bound.
An AverageReward Reinforcement Learning Algorithm for Computing BiasOptimal Policies
 In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in averagereward reinforcement learning ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Recently, there has been growing interest in averagereward reinforcement learning
Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning
 In L.Saitta (Ed.), Machine Learning: Proc. of the Thirteenth Int. Conf
, 1996
"... Research in reinforcement learning (RL) has thus far concentrated on two optimality ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Research in reinforcement learning (RL) has thus far concentrated on two optimality
A Mixed Value and Policy Iteration Method for Stochastic Control with Universally Measurable Policies,” Lab. for Info. and Decision Systems Report LIDSP2905
, 2013
"... We consider the stochastic control model with Borel spaces and universally measurable policies. For this model the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and policy iteration method that circumvents thi ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
We consider the stochastic control model with Borel spaces and universally measurable policies. For this model the standard policy iteration is known to have difficult measurability issues and cannot be carried out in general. We present a mixed value and policy iteration method that circumvents this difficulty. The method allows the use of stationary policies in computing the optimal cost function, in a manner that resembles policy iteration. It can also be used to address similar difficulties of policy iteration in the context of upper and lower semicontinuous models. We analyze the convergence of the method in infinite horizon total cost problems, for the discounted case where the onestage costs are bounded, and for the undiscounted case where the onestage costs are nonpositive or nonnegative. For the undiscounted total cost problems with nonnegative onestage costs, we also give a new convergence theorem for value iteration, which shows that value iteration converges whenever it is initialized with a function that is above the optimal cost function and yet bounded by a multiple of the optimal cost function. This condition resembles Whittle’s bridging condition and is partly motivated by it. The theorem is also partly motivated by a result of Maitra
MODELING SHORTEST PATH GAMES WITH PETRI NETS: A LYAPUNOV BASED THEORY
"... In this paper we introduce a new modeling paradigm for shortest path games representation with Petri nets. Whereas previous works have restricted attention to tracking the net using Bellman’s equation as a utility function, this work uses a Lyapunovlike function. In this sense, we change the tradit ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we introduce a new modeling paradigm for shortest path games representation with Petri nets. Whereas previous works have restricted attention to tracking the net using Bellman’s equation as a utility function, this work uses a Lyapunovlike function. In this sense, we change the traditional cost function by a trajectorytracking function which is also an optimal costtotarget function. This makes a significant difference in the conceptualization of the problem domain, allowing the replacement of the Nash equilibrium point by the Lyapunov equilibrium point in game theory. We show that the Lyapunov equilibrium point coincides with the Nash equilibrium point. As a consequence, all properties of equilibrium and stability are preserved in game theory. This is the most important contribution of this work. The potential of this approach remains in its formal proof simplicity for the existence of an equilibrium point.