Results 1  10
of
52
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1297 (22 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Machine Learning
, 1992
"... Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinfor ..."
Abstract

Cited by 318 (0 self)
 Add to MetaCart
Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediatereinforcement tasks and certain limited forms of delayedreinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Reinforcement Learning In Continuous Time and Space
 Neural Computation
, 2000
"... This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value f ..."
Abstract

Cited by 112 (5 self)
 Add to MetaCart
This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value functions and for improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuoustime form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived and their correspondences with the conventional residual gradient, TD(0), and TD() algorithms are shown. For policy improvement, two methods, namely, a continuous actorcritic method and a valuegradient based greedy policy, are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived....
Reinforcement Learning And Its Application To Control
, 1992
"... Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be us ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be used to train the controller. But when such control actions are not known a priori, appropriate control behavior has to be inferred from observations of the IP. One can distinguish between two classes of methods for training controllers under such circumstances. Indirect methods involve constructing a model of the problem's IP and using the model to obtain training information for the controller. On the other hand, direct, or modelfree,...
Advantage Updating
, 1993
"... A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Qlearning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q learning. When there is noise or small time steps, advantage updating learns more quickly than Qlearning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results...
A Comparison of Direct and ModelBased Reinforcement Learning
 IN INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION
, 1997
"... This paper compares direct reinforcement learning (no explicit model) and modelbased reinforcement learning on a simple task: pendulum swing up. We find that in this task modelbased approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing g ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
This paper compares direct reinforcement learning (no explicit model) and modelbased reinforcement learning on a simple task: pendulum swing up. We find that in this task modelbased approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing goals. 1 Introduction Many proposed reinforcement learning algorithms require large amounts of training data before achieving acceptable performance. This paper explores the training data requirements of two kinds of reinforcement learning algorithms, direct (modelfree) and indirect (modelbased), when continuous actions are available. Direct reinforcement learning algorithms learn a policy or value function without explicitly representing a model of the controlled system (Sutton et al., 1992). Modelbased approaches learn an explicit model of the system simultaneously with a value function and policy (Sutton, 1990, 1991a,b; Barto et al., 1995; Kaelbling et al., 1996). We find that in the p...
Biped dynamic walking using reinforcement learning
 Robotics and Autonomous Systems
, 1997
"... biped robot, legged robot. This paper presents some results from a study of biped dynamic walking using reinforcement learning. During this study a hardware biped robot was built, a new reinforcement learning algorithm as well as a new learning architecture were developed. The biped learned dynamic ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
biped robot, legged robot. This paper presents some results from a study of biped dynamic walking using reinforcement learning. During this study a hardware biped robot was built, a new reinforcement learning algorithm as well as a new learning architecture were developed. The biped learned dynamic walking without any previous knowledge about its dynamic model. The Self Scaling Reinforcement learning algorithm was developed in order to deal with the problem of reinforcement learning in continuous action domains. The learning architecture was developed in order to solve complex control problems. It uses different modules that consist of simple controllers and small neural networks. The architecture allows for easy incorporation of new modules that represent new knowledge, or new requirements for the desired task. 1
Reinforcement Learning of Motor Skills in High Dimensions: A Path Integral Approach
"... Abstract — Reinforcement learning (RL) is one of the most general approaches to learning control. Its applicability to complex motor systems, however, has been largely impossible so far due to the computational difficulties that reinforcement learning encounters in high dimensionsal continuous state ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
Abstract — Reinforcement learning (RL) is one of the most general approaches to learning control. Its applicability to complex motor systems, however, has been largely impossible so far due to the computational difficulties that reinforcement learning encounters in high dimensionsal continuous stateaction spaces. In this paper, we derive a novel approach to RL for parameterized control policies based on the framework of stochastic optimal control with path integrals. While solidly grounded in optimal control theory and estimation theory, the update equations for learning are surprisingly simple and have no danger of numerical instabilites as neither matrix inversions nor gradient learning rates are required. Empirical evaluations demonstrate significant performance improvements over gradientbased policy learning and scalability to highdimensional control problems. Finally, a learning experiment on a robot dog illustrates the functionality of our algorithm in a realworld scenario. We believe that our new algorithm, Policy Improvement with Path Integrals (PI 2), offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL in robotics. I.
Temporal Difference Learning in Continuous Time and Space
 Advances in Neural Information Processing Systems 8
, 1996
"... A continuoustime, continuousstate version of the temporal difference (TD) algorithm is derived in order to facilitate the application of reinforcement learning to realworld control tasks and neurobiological modeling. An optimal nonlinear feedback control law was also derived using the derivatives ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
A continuoustime, continuousstate version of the temporal difference (TD) algorithm is derived in order to facilitate the application of reinforcement learning to realworld control tasks and neurobiological modeling. An optimal nonlinear feedback control law was also derived using the derivatives of the value function. The performance of the algorithms was tested in a task of swinging up a pendulum with limited torque. Both the "critic" that specifies the paths to the upright position and the "actor" that works as a nonlinear feedback controller were successfully implemented by radial basis function (RBF) networks. 1 INTRODUCTION The temporaldifference (TD) algorithm (Sutton, 1988) for delayed reinforcement learning has been applied to a variety of tasks, such as robot navigation, board games, and biological modeling (Houk et al., 1994). Elucidation of the relationship between TD learning and dynamic programming (DP) has provided good theoretical insights (Barto et al., 1995). How...