Results 1  10
of
11
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
Solving SemiMarkov Decision Problems using Average Reward Reinforcement Learning
 Management Science
, 1999
"... A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the ..."
Abstract

Cited by 23 (4 self)
 Add to MetaCart
A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, obtaining which is often unrealistic for large and complex systems. Recently, there has been much interest in a simulationbased stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we exten...
Pricing of Dialup Services: an Example of CongestionDependent Pricing in the Internet
 in the Internet. Proceedings of the 39th IEEE Conference on Decision and Control
, 2000
"... Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestiondependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains o ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestiondependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains over static pricing in realistic settings, even when the many small sources assumption is violated. In this paper we give an example which is a stochastic control model for congestiondependent pricing of Internet services. The model describes a local Internet service provider (ISP) with a single link to a peer network and two types of customers: (1) large institutions who are refunded for lossrate violations and (2) small dialup users who "pay per click" on the world wide web according to prices set by the ISP. To understand the limits of performance, we assume that price information can be communicated instantaneously to the users. Our formulation captures the basic tradeoff in allocating bandwidth to the two classes of users in maximizing average net revenue. Optimal pricing requires that the ISP anticipate and respond to changes in bandwidth consumption. Our goal is to quantify the gain that can be achieved through dynamic pricing over open loop pricing strategies which may or may not account for timeofday effects. We frame the problem as a continuoustime Markov decision process for which we numerically compute optimal solutions. We interpret the results for a wide range of parameter settings to isolate scenaria where realtime price feedback can substantially improve upon time of day pricing. Key Words: Network Pricing, QualityofService, Discrete Stochastic Control, Markov Decision Processes This work is supported by the National Science Foundation through grants E...
A New Value Iteration Method For The Average Cost Dynamic Programming Problem
, 1996
"... We propose a new value iteration method for the classical average cost Markovian Decision problem, under the assumption that all stationary policies are unichain and furthermore there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the a ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose a new value iteration method for the classical average cost Markovian Decision problem, under the assumption that all stationary policies are unichain and furthermore there exists a state that is recurrent under all stationary policies. This method is motivated by a relation between the average cost problem and an associated stochastic shortest path problem. Contrary to the standard relative value iteration, our method involves a weighted sup norm contraction and for this reason it admits a GaussSeidel implementation. Computational tests indicate that the GaussSeidel version of the new method substantially outperforms the standard method for di#cult problems. 1 Research supported by NSF under Grant 9300494DMI. 2 Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, Mass., 02139. 1 1. Introduction 1. INTRODUCTION We consider a controlled discretetime dynamic system with n states, denoted 1, . . . , n. At each time, if the state is i, a control u i...
An AverageReward Reinforcement Learning Algorithm for Computing BiasOptimal Policies
 In Proceedings of the Thirteenth AAAI
, 1996
"... Recently, there has been growing interest in averagereward reinforcement learning ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Recently, there has been growing interest in averagereward reinforcement learning
Finite Memory Estimation and Control of Finite Probabilistic Systems
, 1977
"... A finite probabilistic system (FPS) is a stationary discretetime controlled stochastic dynamical process, having finite input, output, and (internal) state sets. The partiallyobservable Markov decision process is an example of such a system. FPS formulations provide a convenient framework for the ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
A finite probabilistic system (FPS) is a stationary discretetime controlled stochastic dynamical process, having finite input, output, and (internal) state sets. The partiallyobservable Markov decision process is an example of such a system. FPS formulations provide a convenient framework for the study of problems of state estimation, statistical decision, or control, where state information is available only through a finite memoryless channel, and observation dynamics may depend on the inputs selected. Notions of reachability and detectability in FPS's (similar to controllability and observability in linear systems) are made precise. It is shown that every FPS can be reduced to components that are either reachable and detectable, or transient, or nullrecurrent. It is well known that the information vector (whose ith entry is the a posteriori probability that the system is in state i) is a sufficient
A Robust Robot Navigation Architecture Using Partially Observable SemiMarkov Decision Processes
, 1997
"... vii CHAPTER 1. INTRODUCTION 1 1.1. Robot Navigation 1 1.2. The POSMDP Approach 5 1.3. Thesis Overview 7 1.4. Chapter Outline 11 CHAPTER 2. THEORETICAL FOUNDATIONS 12 2.1. Markov Decision Processes 12 2.1.1. Policies and the Value Function 14 2.1.2. Dynamic Programming 15 2.2. SemiMarkov Decision Pr ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
vii CHAPTER 1. INTRODUCTION 1 1.1. Robot Navigation 1 1.2. The POSMDP Approach 5 1.3. Thesis Overview 7 1.4. Chapter Outline 11 CHAPTER 2. THEORETICAL FOUNDATIONS 12 2.1. Markov Decision Processes 12 2.1.1. Policies and the Value Function 14 2.1.2. Dynamic Programming 15 2.2. SemiMarkov Decision Processes 17 2.3. Partially Observable Markov Decision Processes 20 2.3.1. State Estimation 21 2.3.2. Planning 23 2.4. Occupancy Grids 25 2.5. Artificial Neural Networks 28 2.6. Related Work 30 CHAPTER 3. A ROBOT NAVIGATION ARCHITECTURE 33 3.1. The POSMDP Planning Layer 33 3.1.1. Abstract Actions 37 3.1.2. Abstract Observations 38 3.1.3. Probabilistic Planning in the POSMDP 40 3.1.4. Temporal Models of Robot Actions 42 3.2. The Reactive Behavior Layer 46 3.3. Feature Detectors with ANNs 48 CHAPTER 4. RESULTS 53 4.1. Feature Detection 53 4.2. Navigation Results 57 4.2.1. Odometric Uncertainty 58 4.2.2. Temporal Modeling 62 4.2.3. Learning Transition Times 64 i CHAPTER 5. CONCLUSION 71 5.1. Con...
Optimal Dynamic Server Allocation in Systems with On/Off Sources
"... Index Terms — Resource allocation, dynamic optimisation, bursty arrival sources Recent developments in distributed and grid computing have facilitated the hosting of service provisioning systems on clusters of computers. Users do not have to specify the server on which their requests (or ‘jobs’) are ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Index Terms — Resource allocation, dynamic optimisation, bursty arrival sources Recent developments in distributed and grid computing have facilitated the hosting of service provisioning systems on clusters of computers. Users do not have to specify the server on which their requests (or ‘jobs’) are going to be
Distributed Dynamic Programming
"... with two snitchover levels for a class of IY/G/I queuing systems Pratin P. Varaiya (!W68SM78F’XO), for a photograph and biography. with variable arrival and sewice rate. ” Srochasrrc Processes and see this issue. p. 655. Thar Appl.. vol. 6, pp. 213222. 1978. [I I] P. Varaiya. Mores on Oprimizati ..."
Abstract
 Add to MetaCart
with two snitchover levels for a class of IY/G/I queuing systems Pratin P. Varaiya (!W68SM78F’XO), for a photograph and biography. with variable arrival and sewice rate. ” Srochasrrc Processes and see this issue. p. 655. Thar Appl.. vol. 6, pp. 213222. 1978. [I I] P. Varaiya. Mores on Oprimization. New York: Van Nostrand
ON HABITUAL INSTRUMENTAL BEHAVIOR
"... This thesis provides a normative computational analysis of how motivation affects decision making. More specifically, we provide a reinforcement learning model of optimal selfpaced (freeoperant) learning and behavior, and use it to address three broad classes of questions: (1) Why do animals work ..."
Abstract
 Add to MetaCart
This thesis provides a normative computational analysis of how motivation affects decision making. More specifically, we provide a reinforcement learning model of optimal selfpaced (freeoperant) learning and behavior, and use it to address three broad classes of questions: (1) Why do animals work harder in some instrumental tasks than in others? (2) How do motivational states affect responding in such tasks, particularly in those cases in which behavior is habitual, that is, when responding is insensitive to changes in the specific worth of its goals, such as a higher value of food when hungry rather than sated? and (3) Why do dopaminergic manipulations cause global changes in the vigor of responding, and how is this related to prominent accounts of the role of dopamine in providing basal ganglia and frontal cortical areas with a reward prediction error signal that can be used for learning to choose between actions? A fundamental question in behavioral neuroscience concerns the decisionmaking processes by which animals and humans select actions in the face of reward and punishment. In Chapter 1 we provide a brief overview of the current status of this research, focused on three themes: behavior, computation and neural substrates. In behavioral psychology, this question has been investigated through the paradigms of Pavlovian (classical) and instrumental (operant) conditioning, and much evidence has accumulated regarding the associations that control different aspects of learned behavior. The computational field of reinforcement