Results 1 
9 of
9
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1714 (25 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
R.: Incremental multistep Qlearning
, 1996
"... Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinfor ..."
Abstract

Cited by 111 (2 self)
 Add to MetaCart
Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q(A)learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Truncated Temporal Differences with Function Approximation: Successful Examples Using CMAC
 In Proceedings of the Thirteenth European Symposium on Cybernetics and Systems Research (EMCSR96
, 1996
"... Combining reinforcement learning algorithms with function approximators in order to generalize over the state space has recently received particular interest and is widely believed to be one of the crucial issues for scaling reinforcement learning to practically interesting domains. This paper exami ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Combining reinforcement learning algorithms with function approximators in order to generalize over the state space has recently received particular interest and is widely believed to be one of the crucial issues for scaling reinforcement learning to practically interesting domains. This paper examines the combination of the TTD procedure, a computationally efficient approximate implementation of TD(lambda) methods, with CMAC, a function approximator especially suitable for reinforcement learning due to its computational efficiency and online learning capability. Most of previous studies have investigated the combination of CMAC with either TD(0)based algorithms, which usually learn much slower than for ? 0, or with the traditional implementation of TD(lambda) based on eligibility traces, associated with high computational costs. This study, by combining CMAC with TTD, attempts to reconcile fast learning with computational efficiency and generalization capabilities. The presented experimental re...
Faster Temporal Credit Assignment in Learning Classifier Systems
 In Proceedings of the First Polish Conference on Evolutionary Algorithms
, 1996
"... Classifier systems are geneticsbased learning systems using the paradigm of reinforcement learning. In the most challenging case of delayed reinforcement, it involves a difficult temporal credit assignment problem. Standard classifier systems solve this problem using the bucket brigade algorithm. I ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Classifier systems are geneticsbased learning systems using the paradigm of reinforcement learning. In the most challenging case of delayed reinforcement, it involves a difficult temporal credit assignment problem. Standard classifier systems solve this problem using the bucket brigade algorithm. In this paper we show how to make the temporal credit assignment process faster by augmenting this algorithm by some refinements borrowed from a related field of reinforcement learning algorithms based on the methods of temporal differences (TD). These algorithms usually converge significantly faster if they are used in combination with TD( ? 0). As a natural consequence of the easily noticeable similarity between the bucket brigade and TD(0), the BB() algorithm is derived, using the standard technique of eligibility traces. The TTD(; m) procedure, which eliminates eligibility traces and implements an approximation of TD() in a computationally efficient way, has also been ported to the contex...
of
"... Improving reinforcement learning through a better exploration strategy and an adjustable representation ..."
Abstract
 Add to MetaCart
(Show Context)
Improving reinforcement learning through a better exploration strategy and an adjustable representation
ABSTRACT XCS with Eligibility Traces
"... The development of the XCS Learning Classifier System has produced a robust and stable implementation that performs competitively in directreward environments. Although investigations in delayedreward (i.e. multistep) environments have shown promise, XCS still struggles to efficiently find optima ..."
Abstract
 Add to MetaCart
(Show Context)
The development of the XCS Learning Classifier System has produced a robust and stable implementation that performs competitively in directreward environments. Although investigations in delayedreward (i.e. multistep) environments have shown promise, XCS still struggles to efficiently find optimal solutions in environments with long actionchains. This paper highlights the strong relation of XCS to reinforcement learning and identifies some of the major differences. This makes it possible to add Eligibility Traces to XCS, a method taken from reinforcement learning to update the prediction of the whole actionchain on each step, which should cause prediction update to be faster and more accurate. However, it is shown that the discrete nature of the condition representation of a classifier and the operation of the genetic algorithm cause traces to propagate back incorrect prediction values and in some cases results in a decrease of system performance. As a result further investigation of the existing approach to generalisation is proposed. Categories and Subject Descriptors:
Truncated Temporal Differences and Sequential Replay: Comparison, Integration, and Experiments
"... This paper examines two techniques for speeding up reinforcement learning algorithms based on the methods of temporal differences (TD). The first of them, recently developed by the author and known as the TTD procedure, is an approximate implementation of TD( ? 0), significantly more computationa ..."
Abstract
 Add to MetaCart
(Show Context)
This paper examines two techniques for speeding up reinforcement learning algorithms based on the methods of temporal differences (TD). The first of them, recently developed by the author and known as the TTD procedure, is an approximate implementation of TD( ? 0), significantly more computationally efficient than the traditional eligibility traces implementation. The second technique, previously used by other authors and called here sequential replay (SR), relies on performing at each step several temporally chained TD(0) updates. In this paper the SR technique is shown to yield roughly mathematically equivalent overall effects as TTD with certain parameter values. It is also demonstrated how sequential replay can be integrated with the TTD procedure, leading to the TTDSR procedure, covering both the basic techniques as special cases and making it possible to use sequential replay with ? 0. The results of experimental studies carried out with the combination of TTDSR an...
GBQL: A Novel GeneticsBased Reinforcement Learning Architecture
, 1995
"... This research attempts to integrate the existing ideas in two fields: reinforcement learning algorithms based on the methods of temporal differences (TD), in particular Qlearning, and geneticsbased machine learning, in particular classifier systems (CS). Close relations between the bucket brigade ..."
Abstract
 Add to MetaCart
This research attempts to integrate the existing ideas in two fields: reinforcement learning algorithms based on the methods of temporal differences (TD), in particular Qlearning, and geneticsbased machine learning, in particular classifier systems (CS). Close relations between the bucket brigade credit assignment algorithm used in classifier systems and TD methods, several widely realized drawbacks of CS, and good theoretical properties of TD, gave the initial motivation for developing a learning architecture that would combine TDbased temporal credit assignment algorithms with geneticsbased adaptive knowledge representation. This paper presents a simple instantiation of this idea, called GBQL (GeneticsBased QLearning). This learning architecture may be expected to be a promising alternative for stimulusresponse classifier systems on one hand, and for the implementations of Qlearning using other knowledge representation methods (e.g., connectionist networks) on the other hand....
Integrated Learning and Planning Based on Truncating Temporal Differences
"... . Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their longterm effects. A large number of time steps may be required before this trialanderror process converges to a satisfactory policy. It is highly desirable that the number of experie ..."
Abstract
 Add to MetaCart
. Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their longterm effects. A large number of time steps may be required before this trialanderror process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This wellknown idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are ...