Results 1  10
of
401
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1309 (22 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Reinforcement Learning
, 1998
"... � How should a robot decide what to do? � It should plan for each move (Planning) � It should plan for all moves and compile its results into a set of rapid reactions (Reactive Systems) � It should Learn a set of reactions by trialanderror ..."
Abstract

Cited by 854 (7 self)
 Add to MetaCart
� How should a robot decide what to do? � It should plan for each move (Planning) � It should plan for all moves and compile its results into a set of rapid reactions (Reactive Systems) � It should Learn a set of reactions by trialanderror
Prioritized sweeping: Reinforcement learning with less data and less time
 Machine Learning
, 1993
"... We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of ..."
Abstract

Cited by 316 (5 self)
 Add to MetaCart
We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized Sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of statespace. We compare Prioritized Sweeping with other reinforcement learning schemes for a number of di erent stochastic optimal control problems. It successfully solves large statespace real time problems with which other methods have di culty. 1 1
Learning from demonstration
 Advances in Neural Information Processing Systems 9
, 1997
"... By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strategies how to approach a learning problem from instructions and/or demonstra ..."
Abstract

Cited by 312 (30 self)
 Add to MetaCart
By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strategies how to approach a learning problem from instructions and/or demonstrations of other humans. For learning control, this paper investigates how learning from demonstration can be applied in the context of reinforcement learning. We consider priming the Qfunction, the value function, the policy, and the model of the task dynamics as possible areas where demonstrations can speed up learning. In general nonlinear learning problems, only modelbased reinforcement learning shows significant speedup after a demonstration, while in the special case of linear quadratic regulator (LQR) problems, all methods profit from the demonstration. In an implementation of pole balancing on a complex anthropomorphic robot arm, we demonstrate that, when facing the complexities of real signal processing, modelbased reinforcement learning offers the most robustness for LQR problems. Using the suggested methods, the robot learns pole balancing in just a single trial after a 30 second long demonstration of the human instructor. 1.
Forward models: Supervised learning with a distal teacher
 Cognitive Science
, 1992
"... Internal models of the environment have an important role to play in adaptive systems in general and are of particular importance for the supervised learning paradigm. In this paper we demonstrate that certain classical problems associated with the notion of the \teacher " in supervised learnin ..."
Abstract

Cited by 299 (7 self)
 Add to MetaCart
Internal models of the environment have an important role to play in adaptive systems in general and are of particular importance for the supervised learning paradigm. In this paper we demonstrate that certain classical problems associated with the notion of the \teacher " in supervised learning can be solved by judicious use of learned internal models as components of the adaptive system. In particular, we show how supervised learning algorithms can be utilized in cases in which an unknown dynamical system intervenes between actions and desired outcomes. Our approach applies to any supervised learning algorithm that is capable of learning in multilayer networks.
Selfimproving reactive agents based on reinforcement learning, planning and teaching
 Machine Learning
, 1992
"... Abstract. To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much ..."
Abstract

Cited by 276 (2 self)
 Add to MetaCart
Abstract. To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning. This paper compares eight reinforcement learning frameworks: adaptive heuristic critic (AHC) learning due to Sutton, Qlearning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The enviromaaent is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.
Acting Optimally in Partially Observable Stochastic Domains
, 1994
"... In this paper, we describe the partially observable Markov decision process (pomdp) approach to finding optimal or nearoptimal control strategies for partially observable stochastic environments, given a complete model of the environment. The pomdp approach was originally developed in the oper ..."
Abstract

Cited by 275 (16 self)
 Add to MetaCart
In this paper, we describe the partially observable Markov decision process (pomdp) approach to finding optimal or nearoptimal control strategies for partially observable stochastic environments, given a complete model of the environment. The pomdp approach was originally developed in the operations research community and provides a formal basis for planning problems that have been of interest to the AI community. We found the existing algorithms for computing optimal control strategies to be highly computationally inefficient and have developed a new algorithm that is empirically more efficient. We sketch this algorithm and present preliminary results on several small problems that illustrate important properties of the pomdp approach.
The partigame algorithm for variable resolution reinforcement learning in multidimensional statespaces
 MACHINE LEARNING
, 1995
"... Partigame is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous statespaces. In high dimensions it is essential that learning does not plan uniformly over a statespace. Partigame maintains a decisiontree partitioning of statespace and applies tec ..."
Abstract

Cited by 224 (7 self)
 Add to MetaCart
Partigame is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous statespaces. In high dimensions it is essential that learning does not plan uniformly over a statespace. Partigame maintains a decisiontree partitioning of statespace and applies techniques from gametheory and computational geometry to efficiently and adaptively concentrate high resolution only on critical areas. The current version of the algorithm is designed to find feasible paths or trajectories to goal regions in high dimensional spaces. Future versions will be designed to find a solution that optimizes a realvalued criterion. Many simulated problems have been tested, ranging from twodimensional to ninedimensional statespaces, including mazes, path planning, nonlinear dynamics, and planar snake robots in restricted spaces. In all cases, a good solution is found in less than ten trials and a few minutes.
Learning to coordinate behaviors
 In Proceedings of AAAI90
, 1990
"... We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find ..."
Abstract

Cited by 207 (3 self)
 Add to MetaCart
We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find out (i) whether it is relevant (ie. whether it is at all correlated to positive feedback) and (ii) what the conditions are under which it becomes reliable (i.e. the conditions under which it maximizes the probability of receiving positive feedback and minimizes the probability of receiving negative feedback). The algorithm has been tested successfully on an autonomous 6legged robot which had to learn how to coordinate its legs so as to walk forward. Situation of the Problem Since 1985, the MIT Mobile Robot group has advocated a radically different architecture for autonomous intelligent agents (Brooks, 1986). Instead of decomposing the architecture into functional modules, such as perception, modeling, and planning (figure 1), the architecture is decomposed into taskachieving modules, also called behaviors (figure 2). This novel approach has already demonstrated to be very successful and similar approaches have become more