Results 1  10
of
429
Reinforcement Learning I: Introduction
, 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract

Cited by 5500 (120 self)
 Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve longterm goals.
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1690 (26 self)
 Add to MetaCart
(Show Context)
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 461 (12 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
An analysis of temporaldifference learning with function approximation
 IEEE Transactions on Automatic Control
, 1997
"... We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodi ..."
Abstract

Cited by 311 (8 self)
 Add to MetaCart
(Show Context)
We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporaldifference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporaldifference learning. Second, we present anexample illustrating the possibility of divergence when temporaldifference learning is used in the presence of a nonlinear function approximator.
Nearoptimal reinforcement learning in polynomial time
 Machine Learning
, 1998
"... We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the m ..."
Abstract

Cited by 305 (5 self)
 Add to MetaCart
(Show Context)
We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the ExplorationExploitation tradeoff. 1
Recent advances in hierarchical reinforcement learning
, 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract

Cited by 225 (25 self)
 Add to MetaCart
(Show Context)
A preliminary unedited version of this paper was incorrectly published as part of Volume
Treebased batch mode reinforcement learning
 Journal of Machine Learning Research
, 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract

Cited by 222 (40 self)
 Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Qfunction. The Qfunction approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical treebased supervised learning methods (CART, Kdtree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of fourtuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
Reinforcement Learning In Continuous Time and Space
 Neural Computation
, 2000
"... This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value f ..."
Abstract

Cited by 176 (7 self)
 Add to MetaCart
(Show Context)
This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value functions and for improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuoustime form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived and their correspondences with the conventional residual gradient, TD(0), and TD() algorithms are shown. For policy improvement, two methods, namely, a continuous actorcritic method and a valuegradient based greedy policy, are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived....
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 154 (7 self)
 Add to MetaCart
(Show Context)
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
Reinforcement learning for RoboCupsoccer keepaway
 Adaptive Behavior
, 2005
"... 1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMD ..."
Abstract

Cited by 133 (35 self)
 Add to MetaCart
(Show Context)
1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tilecoding function approximation and variable λ to learning higherlevel decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers, ” tries to keep control of the ball for as long as possible despite the efforts of “the takers. ” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.