Results 1 - 10
of
16
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 80 (12 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
An analysis of stochastic shortest path problems
- Mathematics of Operations Research
, 1991
"... by ..."
The NSF Workshop on Reinforcement Learning: Summary and Observations
- AI Magazine
, 1996
"... Reinforcement learning (RL) has become one of the most actively studied learning ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Reinforcement learning (RL) has become one of the most actively studied learning
Analytic Perturbation Theory and its Applications
, 1999
"... this paper, we are mainly concerned with the characterization of the fundamental matrix Z(") of the perturbed chain. In the case of a singular perturbation, Z( ) also has a discontinuity at = 0. Moreover, in this case, jjZ(")jj !1 as " ! 0 and Z(") admits a Laurent series expansion [132, 134, 66, 67 ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
this paper, we are mainly concerned with the characterization of the fundamental matrix Z(") of the perturbed chain. In the case of a singular perturbation, Z( ) also has a discontinuity at = 0. Moreover, in this case, jjZ(")jj !1 as " ! 0 and Z(") admits a Laurent series expansion [132, 134, 66, 67] Z(") = 1 " s Z s + ::: + 1 " Z 1 + Z 0 + "Z 1 + :::; " 6= 0: (3.47) Note that s, the order of the pole, is nite and s N . We denote the singular and the regular parts of the fundamental matrix expansion (3.47) by Z S (") and Z R ("), respectively. Schweitzer has also obtained formulae for the matrices Z k . However, they are rather complicated and their computation requires to handle large size matrices (cf. [132]). In sequel, we propose a di erent approach that allows us to compute more eciently the coecients of the Laurent series (3.47). This approach can be considered as a particular realisation of the general scheme proposed in Section 2.4 for the perturbation analysis of group inverses. Our method leads to the operations with matrices of small dimensions. For example, immediately after the rst stage of the reduction process one handles aggregated Markov chains with no more than m states (m being the number of ergodic classes). This would, typically, constitute a drastic reduction of the dimension. In addition, for the case of a linear perturbation we provide a simple formula for the regular part Z R ( ), which readily simpli es to the usual formula [131] if the perturbation is regular. We introduce the deviation matrices (or reduced resolvents) H and H(") for the original and perturbed chains: H def = Z P and H(") def = Z(") P ("): We immediately have H S (") = Z S (") and H R (") = Z R (") P ("): (3.48) The fundamental matri...
An Average-Reward Reinforcement Learning Algorithm for Computing Bias-Optimal Policies
- In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in average-reward reinforcement learning ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Recently, there has been growing interest in average-reward reinforcement learning
Optimality Criteria in Reinforcement Learning
- In AAAI Fall Symposium on Learning Complex Behaviors for Intelligent Adaptive Systems
, 1996
"... Embedded autonomous agents, such as robots or softbots, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Embedded autonomous agents, such as robots or softbots,
An asymptotic simplex method for singularly perturbed linear programs
, 1998
"... We study singularly perturbed linear programs. These are linear programs whose constraints and objective coecients depend on a small perturbation parameter, and furthermore the constraints become linearly dependent when the perturbation parameter goes to zero. Problems like that were studied by Jero ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We study singularly perturbed linear programs. These are linear programs whose constraints and objective coecients depend on a small perturbation parameter, and furthermore the constraints become linearly dependent when the perturbation parameter goes to zero. Problems like that were studied by Jeroslow in 1970's. He proposed simplex-like method, which works over the eld of rational functions. Here we develop an alternative asymptotic simplex method based on Laurent series expansions. This approach appears to be more computationally ecient. In addition, we point out several possible generalizations of our method and provide new simple updating formulae for the perturbed solution. Key words. asymptotic simplex method, singular perturbations, Laurent series AMS subject classications. 90C05, 90C31, 41A58 Abbreviated title. Asymptotic simplex method 1
Sensitive Discount Optimality Via Nested Linear Programs For Ergodic Markov Decision Processes
"... In this paper we discuss the sensitive discount optimality for Markov decision processes. The n-discount optimality is a refined selective criterion, that is a generalization of the average optimality and the bias optimality. Our approach is based on the system of nested linear programs. In the last ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we discuss the sensitive discount optimality for Markov decision processes. The n-discount optimality is a refined selective criterion, that is a generalization of the average optimality and the bias optimality. Our approach is based on the system of nested linear programs. In the last section we provide an algorithm for the computation of the Blackwell optimal policy. The n-discount optimal policies are obtained as by-product of this algorithm. Here we restrict ourselves to the case of completely ergodic Markov decision processes. Key Words: Markov decision processes, sensitive discount optimality, Blackwell optimality, nested linear programs. 1.
Controlled Markovprocesses with arbitrary numerical criteria, Theory
- Probab. Appl
, 1982
"... In the theory of controlled Markov processes with discrete time we study, as a rule, controlled processes either with the total reward criterion or with criteria for mean reward per unit time. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In the theory of controlled Markov processes with discrete time we study, as a rule, controlled processes either with the total reward criterion or with criteria for mean reward per unit time.
Recursive Stochastic Games with Positive Rewards
"... Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1-exit Recursive Markov Decision Processes (1-RMDPs) and Simple Stochastic Games (1-RSSGs) with strictly positive rewards. These are a class of finitely presented countable-state z ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. We study the complexity of a class of Markov decision processes and, more generally, stochastic games, called 1-exit Recursive Markov Decision Processes (1-RMDPs) and Simple Stochastic Games (1-RSSGs) with strictly positive rewards. These are a class of finitely presented countable-state zero-sum stochastic games, with total expected reward objective. They subsume standard finite-state MDPs and Condon’s simple stochastic games and correspond to optimization and game versions of several classic stochastic models, with rewards. Such stochastic models arise naturally as models of probabilistic procedural programs with recursion, and the problems we address are motivated by the goal of analyzing the optimal/pessimal expected running time in such a setting. We give polynomial time algorithms for 1-exit Recursive Markov decision processes (1-RMDPs) with positive rewards. Specifically, we show that the exact optimal value of both maximizing and minimizing 1-RMDPs with positive rewards can be computed in polynomial time (this value may be ∞). For two-player 1-RSSGs with positive rewards, we prove a “stackless and memoryless ” determinacy result, and show that deciding whether the game value is at least a given value r is in NP ∩ coNP. We also prove that a simultaneous strategy improvement algorithm converges to the value and optimal strategies for these stochastic games. We observe that 1-RSSG positive reward games are “harder ” than finite-state SSGs in several senses. 1

