Results 1  10
of
13
Asynchronous Stochastic Approximation and QLearning
 Machine Learning
, 1994
"... Abstract. We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Qlearning algorithm, a reinforcement learning method for solving Markov decision problems, and establis ..."
Abstract

Cited by 149 (3 self)
 Add to MetaCart
Abstract. We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Qlearning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available. Keywords: Reinforcement learning, Qlearning, dynamic programming, stochastic approximation 1.
Efficient Learning and Planning Within the Dyna Framework
 Adaptive Behavior
, 1993
"... Sutton's Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency. The ..."
Abstract

Cited by 92 (3 self)
 Add to MetaCart
Sutton's Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency. The benefit of using these strategies is demonstrated on some simple abstract learning tasks. 1 Introduction Many problems faced by an autonomous agent in an unknown environment can be cast in the form of reinforcement learning tasks. Recent work in this area has led to a clearer understanding of the relationship between algorithms found useful for such tasks and asynchronous approaches to dynamic programming (Bertsekas & Tsitsiklis, 1989), and this understanding has led in turn to both new results relevant to the theory of dynamic programming (Barto, Bradtke, & Singh, 1991; Watkins & Dayan, 1991; Williams & Baird, 1990) and the creation of new reinforcement learning algorithms, such as Qlearn...
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a onestep lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on stateaction pairs, as are used in Qlearning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
A Hierarchy of Qualitative Representations for Space
 In Working papers of the Tenth International Workshop on Qualitative Reasoning about Physical Systems (QR96
, 1996
"... . Research in Qualitative Reasoning builds and uses discrete symbolic models of the continuous world. Inference methods such as qualitative simulation are grounded in the theory of ordinary differential equations. We argue here that cognitive mapping  building and using symbolic models of the ..."
Abstract

Cited by 35 (7 self)
 Add to MetaCart
. Research in Qualitative Reasoning builds and uses discrete symbolic models of the continuous world. Inference methods such as qualitative simulation are grounded in the theory of ordinary differential equations. We argue here that cognitive mapping  building and using symbolic models of the largescale spatial environment  is a highly appropriate domain for qualitative reasoning research. We describe the Spatial Semantic Hierarchy (SSH), a set of distinct representations for space, each with its own ontology, each with its own mathematical foundation, and each abstracted from the levels below it. At the control level, the robot and its environment are modeled as a continuous dynamical system, whose stable equilibrium points are abstracted to a discrete set of "distinctive states." Trajectories linking these states can be abstracted to actions, giving a discrete causal graph level of representation for the state space. Depending on the properties of the actions, th...
Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding ActorCritic Learning Systems
, 1993
"... This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actorcritic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Als ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
This paper studies algorithms based on an incremental dynamic programming abstraction of one of the key issues in understanding the behavior of actorcritic learning systems. The prime example of such a learning system is the ASE/ACE architecture introduced by Barto, Sutton, and Anderson (1983). Also related are Witten's adaptive controller (1977) and Holland's bucket brigade algorithm (1986). The key feature of such a system is the presence of separate adaptive components for action selection and state evaluation, and the key issue focused on here is the extent to which their joint adaptation is guaranteed to lead to optimal behavior in the limit. In the incremental dynamic programming point of view taken here, these questions are formulated in terms of the use of separate data structures for the current best choice of policy and current best estimate of state values, with separate operations used to update each at individual states. Particular emphasis here is on the effect of comple...
Exploration and Inference in Learning from Reinforcement
, 1997
"... Recently there has been a good deal of interest in using techniques developed for learning from reinforcement to guide learning in robots. Motivated by the desire to find better robot learning methods, this thesis presents a number of novel extensions to existing techniques for controlling explorati ..."
Abstract

Cited by 22 (2 self)
 Add to MetaCart
Recently there has been a good deal of interest in using techniques developed for learning from reinforcement to guide learning in robots. Motivated by the desire to find better robot learning methods, this thesis presents a number of novel extensions to existing techniques for controlling exploration and inference in reinforcement learning. First I distinguish between the well known explorationexploitation tradeoff and what I term exploration for future exploitation. It is argued that there are many tasks where it is more appropriate to maximise this latter measure. In particular it is appropriate when we want to employ learning algorithms as part of the process of designing a controller. Informed by this insight I develop a number of novel measures of the agent's task knowledge. The first of these is a measure of the probability of a particular course of action being the optimal course of action. Estimators are developed for this measure for boolean and nonboolean processes. These...
Biasing exploration in an anticipatory learning classifier system
, 2002
"... The chapter investigates how model and behavioral learning can be improved in an anticipatory learning classifier system by biasing exploration. First, the applied system ACS2 is explained. Next, an overview over the possibilities of applying exploration biases in an anticipatory learning classifie ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
The chapter investigates how model and behavioral learning can be improved in an anticipatory learning classifier system by biasing exploration. First, the applied system ACS2 is explained. Next, an overview over the possibilities of applying exploration biases in an anticipatory learning classifier system and specifically ACS2 is provided. In ACS2, a recency bias termed action delay bias as well as an error bias termed knowledge array bias is implemented. The system is applied in a dynamic maze task and an handeye coordination task to validate the biases. The experiments exhibit that biased exploration enables ACS2 to evolve and adapt its internal environmental model faster. Also adaptive behavior is improved.
Fuzzy ModelBased Reinforcement Learning
"... : Modelbased reinforcement learning methods are known to be highly ecient with respect to the number of trials required for learning optimal policies. In this article, a novel fuzzy modelbased reinforcement learning approach, fuzzy prioritized sweeping (FPS), is presented. The approach is capable ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: Modelbased reinforcement learning methods are known to be highly ecient with respect to the number of trials required for learning optimal policies. In this article, a novel fuzzy modelbased reinforcement learning approach, fuzzy prioritized sweeping (FPS), is presented. The approach is capable of learning strategies for Markov decision problems with continuous state and action spaces. The output of the algorithm is a TakagiSugeno fuzzy system with linear terms in the consequents of the rules. From the Qfunction approximated by this fuzzy system an optimal control strategy can be easily derived. The proposed method is applied to the problem of selecting optimal framework signal plans in urban trac networks. It is shown that the method outperforms existing modelbased approaches. KEYWORDS: reinforcement learning, modelbased learning, fuzzy prioritized sweeping, TakagiSugeno fuzzy systems, framework signal plans INTRODUCTION Reinforcement learning means learning from experiences...
ATM Scheduling with Queuing Delay Predictions
, 1993
"... Efficient utilization of cell switched networks supporting diverse applications will require service disciplines that are well designed for the particular quality of service constraints and traffic mix, a difficult task in view of the paucity of information about the expected traffic. We demonstrate ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Efficient utilization of cell switched networks supporting diverse applications will require service disciplines that are well designed for the particular quality of service constraints and traffic mix, a difficult task in view of the paucity of information about the expected traffic. We demonstrate the use of online dynamic programming in an adaptive cell scheduling mechanism that can easily be engineered to meet arbitrary quality of service constraints. When the objective is to minimize the total cell loss rate, our algorithm, urgency scheduling, compares favorably with the optimal earliest deadline first algorithm. For more complex quality of service constraints where optimal scheduling algorithms are unavailable, the simulations show urgency scheduling can provide significant increases in the usable bandwidth of a link. The learning techniques we develop are quite general and should be readily applicable to other network control problems.
CTrace: A new algorithm for reinforcement learning of robotic control.
, 1996
"... There has been much recent interest in the potential of using reinforcement learning techniques for control in autonomous robotic agents. How to implement effective reinforcement learning in a realworld robotic environment still involves many open questions. Are standard reinforcement learning algo ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
There has been much recent interest in the potential of using reinforcement learning techniques for control in autonomous robotic agents. How to implement effective reinforcement learning in a realworld robotic environment still involves many open questions. Are standard reinforcement learning algorithms like Watkins' Qlearning appropriate, or are other approaches more suitable ? Some specific issues to be considered are noise/disturbance and the possibly nonMarkovian aspects of the control problem. These are the particular issues we focus upon in this paper. The testbed for the experiments described in this paper is a real sixlegged insectoid walking robot; the task set is to learn an effectively coordinated walking gait. The performance of a new algorithm we call CTrace is compared to Watkins' wellknown 1step Qlearning reinforcement learning algorithm. We discuss the markedly superior performance of this new algorithm in the context of both theoretical and existing empirical ...