Results 1 -
9 of
9
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 80 (12 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
Reinforcement learning is direct adaptive optimal control
- In Proceedings of the American Control Conference
, 1991
"... optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Q-learning) and on its analytically prov ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Q-learning) and on its analytically proven capabilities for one class of adaptive optimal control problems (markov decision problems with unknown transition probabilities).
Auto-exploratory Average Reward Reinforcement Learning
- Artificial Intelligence
, 1996
"... We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexp ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
We introduce a model-based average reward Reinforcement Learning method called H-learning and compare it with its discounted counterpart, Adaptive Real-Time Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to H-learning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Auto-exploratory H-learning" performs better than the original H-learning under previously studied exploration methods such as random, recency-based, or counter-based exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Q-learning (Watkins and Dayan 92) and Adaptive Real-Time Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...
Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function
- In Saitta
, 1996
"... Almost all the work in Average-reward Reinforcement Learning (ARL) so far has focused on table-based methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a model-based ARL method called H-learning to address the scale-up problem. We extend H-lear ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
Almost all the work in Average-reward Reinforcement Learning (ARL) so far has focused on table-based methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a model-based ARL method called H-learning to address the scale-up problem. We extend H-learning to learn action models and reward functions in the form of Bayesian networks, and approximate its value function using local linear regression. We test our algorithms on several scheduling tasks for a simulated Automatic Guided Vehicle (AGV) and show that they are effective in significantly reducing the space requirement of H-learning and making it converge faster. To the best of our knowledge, our results are the first in applying function approximation to ARL. 1 Introduction Most Reinforcement Learning (RL) methods optimize the discounted total reward received by an agent (Barto, Bradtke, & Singh, 1995; Watkins & Dayan, 1992). However, in many real-world domains, the natural criterio...
Incremental Dynamic Programming for On-Line Adaptive Optimal Control
, 1994
"... Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended for use in situations where the information or computational resources needed by traditional dynamic programming algorithms are not available. IDP algorithms attempt to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment. This class of algorithms is not new, going back at least as far as Samuel's adaptive checkers-playing programs,...
To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning
- In Proceedings of the Eleventh International Conference on Machine Learning
, 1994
"... Most work in reinforcement learning (RL) ..."
H-learning: A Reinforcement Learning Method to Optimize Undiscounted Average Reward
, 1994
"... In this paper, we introduce a model-based reinforcement learning method called H-learning, which optimizes undiscounted average reward. We compare it with three other reinforcement learning methods in the domain of scheduling Automatic Guided Vehicles, transportation robots used in modern manufactu ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper, we introduce a model-based reinforcement learning method called H-learning, which optimizes undiscounted average reward. We compare it with three other reinforcement learning methods in the domain of scheduling Automatic Guided Vehicles, transportation robots used in modern manufacturing plants and facilities. The four methods differ along two dimensions. They are either model-based or model-free, and optimize discounted total reward or undiscounted average reward. Our experimental results indicate that H-learning is more robust with respect to changes in the domain parameters, and in many cases, converges in fewer steps to better average reward per time step than all the other methods. An added advantage is that unlike the other methods it does not have any parameters to tune. 1 Introduction Automatic Guided Vehicles are robots which are used for routine transportation tasks in modern manufacturing plants, hospitals and office buildings [Minoura et al. 1993]. These r...
An Average-Reward Reinforcement Learning Algorithm for Computing Bias-Optimal Policies
- In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in average-reward reinforcement learning ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Recently, there has been growing interest in average-reward reinforcement learning

