Results 1  10
of
20
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
An analysis of stochastic shortest path problems
 Mathematics of Operations Research
, 1991
"... by ..."
The NSF Workshop on Reinforcement Learning: Summary and Observations
 AI Magazine
, 1996
"... Reinforcement learning (RL) has become one of the most actively studied learning ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Reinforcement learning (RL) has become one of the most actively studied learning
The Simplex and PolicyIteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate
, 2010
"... We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computati ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We prove that the classic policyiteration method (Howard 1960), including the Simplex method (Dantzig 1947) with the mostnegativereducedcost pivoting rule, is a strongly polynomialtime algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policyiteration method (including the Simplex method) is superior to that of the only known strongly polynomialtime interiorpoint algorithm ([28] 2005) for solving this problem. The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policyiteration) method with the smallestindex pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policyiteration method was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with substochastic and transient state transition probability matrices. 1 Introduction of the Markov decision problem and
QLearning and Policy Iteration Algorithms for Stochastic Shortest
 Path Problems,” Lab. for Information and Decision Systems Report LIDSP2871, MIT
, 2011
"... We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Qlearning schemes with as reliable convergence as classical Qlearning. We also discuss methods that use basis function approximations of Qfactors and we give an associated error bound.
An AverageReward Reinforcement Learning Algorithm for Computing BiasOptimal Policies
 In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in averagereward reinforcement learning ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Recently, there has been growing interest in averagereward reinforcement learning
Recursive Stochastic Games with Positive Rewards
"... Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1exit Recursive Markov Decision Processes (1RMDPs) and Simple Stochastic Games (1RSSGs) with strictly positive rewards. These are a class of finitely presented countablestate z ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1exit Recursive Markov Decision Processes (1RMDPs) and Simple Stochastic Games (1RSSGs) with strictly positive rewards. These are a class of finitely presented countablestate zerosum stochastic games, with total expected reward objective. They subsume standard finitestate MDPs and Condon’s simple stochastic games and correspond to optimization and game versions of several classic stochastic models, with rewards. Such stochastic models arise naturally as models of probabilistic procedural programs with recursion, and the problems we address are motivated by the goal of analyzing the optimal/pessimal expected running time in such a setting. We give polynomial time algorithms for 1exit Recursive Markov decision processes (1RMDPs) with positive rewards. Specifically, we show that the exact optimal value of both maximizing and minimizing 1RMDPs with positive rewards can be computed in polynomial time (this value may be ∞). For twoplayer 1RSSGs with positive rewards, we prove a “stackless and memoryless ” determinacy result, and show that deciding whether the game value is at least a given value r is in NP ∩ coNP. We also prove that a simultaneous strategy improvement algorithm converges to the value and optimal strategies for these stochastic games. We observe that 1RSSG positive reward games are “harder ” than finitestate SSGs in several senses. 1
Optimality Criteria in Reinforcement Learning
 In AAAI Fall Symposium on Learning Complex Behaviors for Intelligent Adaptive Systems
, 1996
"... Embedded autonomous agents, such as robots or softbots, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Embedded autonomous agents, such as robots or softbots,
An asymptotic simplex method for singularly perturbed linear programs
, 1998
"... We study singularly perturbed linear programs. These are linear programs whose constraints and objective coecients depend on a small perturbation parameter, and furthermore the constraints become linearly dependent when the perturbation parameter goes to zero. Problems like that were studied by Jero ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We study singularly perturbed linear programs. These are linear programs whose constraints and objective coecients depend on a small perturbation parameter, and furthermore the constraints become linearly dependent when the perturbation parameter goes to zero. Problems like that were studied by Jeroslow in 1970's. He proposed simplexlike method, which works over the eld of rational functions. Here we develop an alternative asymptotic simplex method based on Laurent series expansions. This approach appears to be more computationally ecient. In addition, we point out several possible generalizations of our method and provide new simple updating formulae for the perturbed solution. Key words. asymptotic simplex method, singular perturbations, Laurent series AMS subject classications. 90C05, 90C31, 41A58 Abbreviated title. Asymptotic simplex method 1