Results 1  10
of
55
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1405 (23 self)
 Add to MetaCart
(Show Context)
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Learning to predict by the methods of temporal differences
 MACHINE LEARNING
, 1988
"... This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional predictionlearning methods assign credit by means of the difference between predi ..."
Abstract

Cited by 1328 (46 self)
 Add to MetaCart
(Show Context)
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional predictionlearning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporaldifference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervisedlearning methods. For most realworld prediction problems, temporaldifference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporaldifference methods can be applied to advantage.
Learning and Sequential Decision Making
 LEARNING AND COMPUTATIONAL NEUROSCIENCE
, 1989
"... In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been pr ..."
Abstract

Cited by 200 (11 self)
 Add to MetaCart
(Show Context)
In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of longterm payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the nonengineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.
Efficient Exploration In Reinforcement Learning
, 1992
"... Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper d ..."
Abstract

Cited by 126 (3 self)
 Add to MetaCart
Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper distinguishes between two families of exploration schemes: undirected and directed exploration. While the former family is closely related to random walk exploration, directed exploration techniques memorize explorationspecific knowledge which is used for guiding the exploration search. In many finite deterministic domains, any learning technique based on undirected exploration is inefficient in terms of learning time, i.e. learning time is expected to scale exponentially with the size of the state space (Whitehead, 1991b) . We prove that for all these domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of e...
Strategy Learning with Multilayer Connectionist Representations
 In Proceedings of the Fourth International Workshop on Machine Learning
, 1987
"... Results are presented that demonstrate the learning and finetuning of search strategies using connectionist mechanisms. Previous studies of strategy learning within the symbolic, productionrule formalism have not addressed finetuning behavior. Here a twolayer connectionist system is presented th ..."
Abstract

Cited by 77 (4 self)
 Add to MetaCart
(Show Context)
Results are presented that demonstrate the learning and finetuning of search strategies using connectionist mechanisms. Previous studies of strategy learning within the symbolic, productionrule formalism have not addressed finetuning behavior. Here a twolayer connectionist system is presented that develops its search from a weak to a taskspecific strategy and finetunes its performance. The system is applied to a simulated, realtime, balancecontrol task. We compare the performance of onelayer and twolayer networks, showing that the ability of the twolayer network to discover new features and thus enhance the original representation is critical to solving the balancing task.
An upper bound on the loss from approximate optimalvalue functions
 Machine Learning
, 1994
"... Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP). The value of this theoretical understanding, however, is tempered by many practical concerns. One important question is whether DPbased appro ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
(Show Context)
Many reinforcement learning approaches can be formulated using the theory of Markov decision processes and the associated method of dynamic programming (DP). The value of this theoretical understanding, however, is tempered by many practical concerns. One important question is whether DPbased approaches that use function approximation rather than lookup tables can avoid catastrophic e ects on performance. This note presents a result of Bertsekas (1987) which guarantees that small errors in the approximation of a task's optimal value function cannot produce arbitrarily bad performance when actions are selected by a greedy policy. We derive an upper bound on performance loss that is slightly tighter than that in Bertsekas (1987), and we show the extension of the bound to Qlearning (Watkins, 1989). These results provide a partial theoretical rationale for the approximation of value functions, an issue of great practical importance in reinforcement learning.
Optimal Ordered Problem Solver
, 2002
"... We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, eciently searching not only the ..."
Abstract

Cited by 61 (20 self)
 Add to MetaCart
(Show Context)
We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, eciently searching not only the space of domainspecific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience.
Reinforcement Learning in Markovian and NonMarkovian Environments
, 1991
"... This work addresses three problems with reinforcement learning and adaptive neurocontrol: 1. NonMarkovian interfaces between learner and environment. 2. Online learning based on system realization. 3. Vectorvalued adaptive critics. An algorithm is described which is based on system realizatio ..."
Abstract

Cited by 54 (35 self)
 Add to MetaCart
This work addresses three problems with reinforcement learning and adaptive neurocontrol: 1. NonMarkovian interfaces between learner and environment. 2. Online learning based on system realization. 3. Vectorvalued adaptive critics. An algorithm is described which is based on system realization and on two interacting fully recurrent continually running networks which may learn in parallel. Problems with parallel learning are attacked by 'adaptive randomness'. It is also described how interacting model/controller systems can be combined with vectorvalued 'adaptive critics' (previous critics have been scalar).
Reinforcement Learning And Its Application To Control
, 1992
"... Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can ..."
Abstract

Cited by 53 (2 self)
 Add to MetaCart
Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be used to train the controller. But when such control actions are not known a priori, appropriate control behavior has to be inferred from observations of the IP. One can distinguish between two classes of methods for training controllers under such circumstances. Indirect methods involve constructing a model of the problem's IP and using the model to obtain training information for the controller. On the other hand, direct, or modelfree,...
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...