Results 1  10
of
19
Prioritized sweeping: Reinforcement learning with less data and less time
 Machine Learning
, 1993
"... We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of ..."
Abstract

Cited by 316 (5 self)
 Add to MetaCart
We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized Sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of statespace. We compare Prioritized Sweeping with other reinforcement learning schemes for a number of di erent stochastic optimal control problems. It successfully solves large statespace real time problems with which other methods have di culty. 1 1
Learning and Sequential Decision Making
 LEARNING AND COMPUTATIONAL NEUROSCIENCE
, 1989
"... In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of ..."
Abstract

Cited by 195 (10 self)
 Add to MetaCart
In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of longterm payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the nonengineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.
Efficient Exploration In Reinforcement Learning
, 1992
"... Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper d ..."
Abstract

Cited by 122 (4 self)
 Add to MetaCart
Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper distinguishes between two families of exploration schemes: undirected and directed exploration. While the former family is closely related to random walk exploration, directed exploration techniques memorize explorationspecific knowledge which is used for guiding the exploration search. In many finite deterministic domains, any learning technique based on undirected exploration is inefficient in terms of learning time, i.e. learning time is expected to scale exponentially with the size of the state space (Whitehead, 1991b) . We prove that for all these domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of e...
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Reinforcement learning is direct adaptive optimal control
 In Proceedings of the American Control Conference
, 1991
"... optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Qlearning) and on its analytically prov ..."
Abstract

Cited by 47 (4 self)
 Add to MetaCart
optimal controls are estimated directly more attractive. We view reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems. For concreteness, we focus on one reinforcement learning method (Qlearning) and on its analytically proven capabilities for one class of adaptive optimal control problems (markov decision problems with unknown transition probabilities).
Exploration bonuses and dual control
 MACHINE LEARNING
, 1996
"... Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, GonzalezZubieta & Miller, 1965) arising from a form of dual contr ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, GonzalezZubieta & Miller, 1965) arising from a form of dual control. This systematizes and extends existing uses of exploration bonuses in reinforcement learning (Sutton, 1990). The approach has two components: a statistical model of uncertainty in the world and a way of turning this into exploratory behavior. This general approach is applied to twodimensional mazes with moveable barriers and its performance is compared with Suttonâ€™s DYNA system.
Biped dynamic walking using reinforcement learning
 Robotics and Autonomous Systems
, 1997
"... biped robot, legged robot. This paper presents some results from a study of biped dynamic walking using reinforcement learning. During this study a hardware biped robot was built, a new reinforcement learning algorithm as well as a new learning architecture were developed. The biped learned dynamic ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
biped robot, legged robot. This paper presents some results from a study of biped dynamic walking using reinforcement learning. During this study a hardware biped robot was built, a new reinforcement learning algorithm as well as a new learning architecture were developed. The biped learned dynamic walking without any previous knowledge about its dynamic model. The Self Scaling Reinforcement learning algorithm was developed in order to deal with the problem of reinforcement learning in continuous action domains. The learning architecture was developed in order to solve complex control problems. It uses different modules that consist of simple controllers and small neural networks. The architecture allows for easy incorporation of new modules that represent new knowledge, or new requirements for the desired task. 1
On the Computational Economics of Reinforcement Learning
, 1990
"... Following terminology used in adaptive control, we distinguish between indirect learning methods, which learn explicit models of the dynamic structure of the system to be controlled, and direct learning methods, which do not. We compare an existing indirect method, which uses a conventional dynamic ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Following terminology used in adaptive control, we distinguish between indirect learning methods, which learn explicit models of the dynamic structure of the system to be controlled, and direct learning methods, which do not. We compare an existing indirect method, which uses a conventional dynamic programming algorithm, with a closely related direct reinforcement learning method by applying both methods to an infinite horizon Markov decision problem with unknown statetransition probabilities. The simulations show that although the direct method requires much less space and dramatically less computation per control action, its learning ability in this task is superior to, or compares favorably with, that of the more complex indirect method. Although these results do not address how the methods' performances compare as problems become more difficult, they suggest that given a fixed amount of computational power available per control action, it may be better to use a direct reinforcemen...
Incremental Dynamic Programming for OnLine Adaptive Optimal Control
, 1994
"... Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended for use in situations where the information or computational resources needed by traditional dynamic programming algorithms are not available. IDP algorithms attempt to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment. This class of algorithms is not new, going back at least as far as Samuel's adaptive checkersplaying programs,...
LargeScale Dynamic Optimization Using Teams of Reinforcement Learning Agents
, 1996
"... Recent algorithmic and theoretical advances in reinforcement learning (RL) are attracting widespread interest. RL algorithms have appeared that approximate dynamic programming (DP) on an incremental basis. Unlike traditional DP algorithms, these algorithms do not require knowledge of the state trans ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Recent algorithmic and theoretical advances in reinforcement learning (RL) are attracting widespread interest. RL algorithms have appeared that approximate dynamic programming (DP) on an incremental basis. Unlike traditional DP algorithms, these algorithms do not require knowledge of the state transition probabilities or reward structure of a system. This allows them to be trained using real or simulated experiences, focusing their computations on the areas of state space that are actually visited during control, making them computationally tractable on very large problems. RL algorithms can be used as components of multiagent algorithms. If each member of a team of agents employs one of these algorithms, a new collective learning algor...