Results 1 -
8 of
8
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 80 (12 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
Solving Semi-Markov Decision Problems using Average Reward Reinforcement Learning
- Management Science
, 1999
"... A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, obtaining which is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we exten...
Pricing of Dialup Services: an Example of Congestion-Dependent Pricing in the Internet
- in the Internet. Proceedings of the 39th IEEE Conference on Decision and Control
, 2000
"... Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestion-dependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains o ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestion-dependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains over static pricing in realistic settings, even when the many small sources assumption is violated. In this paper we give an example which is a stochastic control model for congestion-dependent pricing of Internet services. The model describes a local Internet service provider (ISP) with a single link to a peer network and two types of customers: (1) large institutions who are refunded for loss-rate violations and (2) small dialup users who "pay per click" on the world wide web according to prices set by the ISP. To understand the limits of performance, we assume that price information can be communicated instantaneously to the users. Our formulation captures the basic tradeoff in allocating bandwidth to the two classes of users in maximizing average net revenue. Optimal pricing requires that the ISP anticipate and respond to changes in bandwidth consumption. Our goal is to quantify the gain that can be achieved through dynamic pricing over open loop pricing strategies which may or may not account for time-of-day effects. We frame the problem as a continuous-time Markov decision process for which we numerically compute optimal solutions. We interpret the results for a wide range of parameter settings to isolate scenaria where real-time price feedback can substantially improve upon time of day pricing. Key Words: Network Pricing, Quality-of-Service, Discrete Stochastic Control, Markov Decision Processes This work is supported by the National Science Foundation through grants E...
An Average-Reward Reinforcement Learning Algorithm for Computing Bias-Optimal Policies
- In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in average-reward reinforcement learning ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Recently, there has been growing interest in average-reward reinforcement learning
Finite Memory Estimation and Control of Finite Probabilistic Systems
, 1977
"... A finite probabilistic system (FPS) is a stationary discrete-time controlled stochastic dynamical process, having finite input, output, and (internal) state sets. The partially-observable Markov decision process is an example of such a system. FPS formulations provide a convenient framework for the ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A finite probabilistic system (FPS) is a stationary discrete-time controlled stochastic dynamical process, having finite input, output, and (internal) state sets. The partially-observable Markov decision process is an example of such a system. FPS formulations provide a convenient framework for the study of problems of state estimation, statistical decision, or control, where state information is available only through a finite memoryless channel, and observation dynamics may depend on the inputs selected. Notions of reachability and detectability in FPS's (similar to controllability and observability in linear systems) are made precise. It is shown that every FPS can be reduced to components that are either reachable and detectable, or transient, or null-recurrent. It is well known that the information vector (whose i-th entry is the a posteriori probability that the system is in state i) is a sufficient
A Robust Robot Navigation Architecture Using Partially Observable Semi-Markov Decision Processes
, 1997
"... vii CHAPTER 1. INTRODUCTION 1 1.1. Robot Navigation 1 1.2. The POSMDP Approach 5 1.3. Thesis Overview 7 1.4. Chapter Outline 11 CHAPTER 2. THEORETICAL FOUNDATIONS 12 2.1. Markov Decision Processes 12 2.1.1. Policies and the Value Function 14 2.1.2. Dynamic Programming 15 2.2. Semi-Markov Decision Pr ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
vii CHAPTER 1. INTRODUCTION 1 1.1. Robot Navigation 1 1.2. The POSMDP Approach 5 1.3. Thesis Overview 7 1.4. Chapter Outline 11 CHAPTER 2. THEORETICAL FOUNDATIONS 12 2.1. Markov Decision Processes 12 2.1.1. Policies and the Value Function 14 2.1.2. Dynamic Programming 15 2.2. Semi-Markov Decision Processes 17 2.3. Partially Observable Markov Decision Processes 20 2.3.1. State Estimation 21 2.3.2. Planning 23 2.4. Occupancy Grids 25 2.5. Artificial Neural Networks 28 2.6. Related Work 30 CHAPTER 3. A ROBOT NAVIGATION ARCHITECTURE 33 3.1. The POSMDP Planning Layer 33 3.1.1. Abstract Actions 37 3.1.2. Abstract Observations 38 3.1.3. Probabilistic Planning in the POSMDP 40 3.1.4. Temporal Models of Robot Actions 42 3.2. The Reactive Behavior Layer 46 3.3. Feature Detectors with ANNs 48 CHAPTER 4. RESULTS 53 4.1. Feature Detection 53 4.2. Navigation Results 57 4.2.1. Odometric Uncertainty 58 4.2.2. Temporal Modeling 62 4.2.3. Learning Transition Times 64 i CHAPTER 5. CONCLUSION 71 5.1. Con...
A New Value Iteration Method For The Average Cost Dynamic Programming Problem
, 1996
"... We propose a new value iteration method for the classical average cost Markovian Decision problem, under the assumption that all stationary policies are unichain and furthermore there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the a ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We propose a new value iteration method for the classical average cost Markovian Decision problem, under the assumption that all stationary policies are unichain and furthermore there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the average cost problem and an associated stochastic shortest path problem. Contrary to the standard relative value iteration, our method involves a weighted sup norm contraction and for this reason it admits a Gauss-Seidel implementation. Computational tests indicate that the Gauss-Seidel version of the new method substantially outperforms the standard method for di#cult problems. 1 Research supported by NSF under Grant 9300494-DMI. 2 Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, Mass., 02139. 1 1. Introduction 1. INTRODUCTION We consider a controlled discrete-time dynamic system with n states, denoted 1, . . . , n. At each time, if the state is i, a control u i...
Distributed Dynamic Programming
"... with two snitch-over levels for a class of IY/G/I queuing systems Pratin P. Varaiya (!W68-SM78-F’XO), for a photograph and biography. with variable arrival and sewice rate. ” Srochasrrc Processes and see this issue. p. 655. Thar Appl.. vol. 6, pp. 213-222. 1978. [I I] P. Varaiya. Mores on Oprimizati ..."
Abstract
- Add to MetaCart
with two snitch-over levels for a class of IY/G/I queuing systems Pratin P. Varaiya (!W68-SM78-F’XO), for a photograph and biography. with variable arrival and sewice rate. ” Srochasrrc Processes and see this issue. p. 655. Thar Appl.. vol. 6, pp. 213-222. 1978. [I I] P. Varaiya. Mores on Oprimization. New York: Van Nostrand

