Results 1  10
of
57
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1298 (23 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
 Advances in Neural Information Processing Systems 8
, 1996
"... On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have ..."
Abstract

Cited by 355 (18 self)
 Add to MetaCart
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparsecoarsecoded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...
Prioritized sweeping: Reinforcement learning with less data and less time
 Machine Learning
, 1993
"... We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of ..."
Abstract

Cited by 316 (5 self)
 Add to MetaCart
We present a new algorithm, Prioritized Sweeping, for e cient prediction and control of stochastic Markov systems. Incremental learning methods such asTemporal Di erencing and Qlearning have fast real time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized Sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of statespace. We compare Prioritized Sweeping with other reinforcement learning schemes for a number of di erent stochastic optimal control problems. It successfully solves large statespace real time problems with which other methods have di culty. 1 1
An analysis of temporaldifference learning with function approximation
 IEEE Transactions on Automatic Control
, 1997
"... We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodi ..."
Abstract

Cited by 218 (7 self)
 Add to MetaCart
We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporaldifference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporaldifference learning. Second, we present anexample illustrating the possibility of divergence when temporaldifference learning is used in the presence of a nonlinear function approximator.
Convergence of Stochastic Iterative Dynamic Programming Algorithms
 Neural Computation
, 1994
"... Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of th ..."
Abstract

Cited by 207 (8 self)
 Add to MetaCart
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of the behavior of these methods has been missing. In this paper we relate DPbased learning algorithms to powerful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD() and Qlearning belong. 1
Reinforcement Learning with Replacing Eligibility Traces
 MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract

Cited by 186 (11 self)
 Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replacetrace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replacetrace TD is unbiased. In addition, we show that t...
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
 Advances in Neural Information Processing Systems 7
, 1995
"... Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable ..."
Abstract

Cited by 151 (7 self)
 Add to MetaCart
Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of nonMarkov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a MonteCarlo policy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy. Although the space of stochastic policies is continuouseven for a discrete action spaceour algorith...
The linear programming approach to approximate dynamic programming
 Operations Research
, 2001
"... The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of largescale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach “fits ” a linear ..."
Abstract

Cited by 140 (16 self)
 Add to MetaCart
The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of largescale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach “fits ” a linear combination of preselected basis functions to the dynamic programming costtogo function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and “staterelevance weights ” that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology. (Dynamic programming/optimal control: approximations/largescale problems. Queues, algorithms: control of queueing networks.)
Learning Without StateEstimation in Partially Observable Markovian Decision Processes
 In Proceedings of the Eleventh International Conference on Machine Learning
, 1994
"... Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many ..."
Abstract

Cited by 128 (5 self)
 Add to MetaCart
Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many realworld decision tasks, however, are inherently nonMarkovian, i.e., the state of the environment is only incompletely known to the learning agent. In this paper we consider only partially observable MDPs (POMDPs), a useful class of nonMarkovian decision processes. Most previous approaches to such problems have combined computationally expensive stateestimation techniques with learning control. This paper investigates learning in POMDPs without resorting to any form of state estimation. We present results about what TD(0) and Qlearning will do when applied to POMDPs. It is shown that the conventional discounted RL framework is inadequate to deal with POMDPs. Finally we develop a new fr...