Results 1  10
of
23
Learning to predict by the methods of temporal differences
 MACHINE LEARNING
, 1988
"... This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional predictionlearning methods assign credit by means of the difference between predi ..."
Abstract

Cited by 1226 (45 self)
 Add to MetaCart
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional predictionlearning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporaldifference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervisedlearning methods. For most realworld prediction problems, temporaldifference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporaldifference methods can be applied to advantage.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
, 1999
"... Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by i ..."
Abstract

Cited by 319 (18 self)
 Add to MetaCart
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor–critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate actionvalue or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Motivated Reinforcement Learning
, 2001
"... The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequen ..."
Abstract

Cited by 252 (9 self)
 Add to MetaCart
The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.
ActorCritic Algorithms
 SIAM JOURNAL ON CONTROL AND OPTIMIZATION
, 2001
"... In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on in ..."
Abstract

Cited by 174 (1 self)
 Add to MetaCart
In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actorcritic algorithms for Markov decision processes with general state and action spaces. We state and prove two results regarding their convergence.
Learning Without StateEstimation in Partially Observable Markovian Decision Processes
 In Proceedings of the Eleventh International Conference on Machine Learning
, 1994
"... Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many ..."
Abstract

Cited by 128 (5 self)
 Add to MetaCart
Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many realworld decision tasks, however, are inherently nonMarkovian, i.e., the state of the environment is only incompletely known to the learning agent. In this paper we consider only partially observable MDPs (POMDPs), a useful class of nonMarkovian decision processes. Most previous approaches to such problems have combined computationally expensive stateestimation techniques with learning control. This paper investigates learning in POMDPs without resorting to any form of state estimation. We present results about what TD(0) and Qlearning will do when applied to POMDPs. It is shown that the conventional discounted RL framework is inadequate to deal with POMDPs. Finally we develop a new fr...
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
, 1996
"... A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning w ..."
Abstract

Cited by 92 (6 self)
 Add to MetaCart
A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of realvalued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a closeform solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarsecoded funct...
Strategy Learning with Multilayer Connectionist Representations
 In Proceedings of the Fourth International Workshop on Machine Learning
, 1987
"... Results are presented that demonstrate the learning and finetuning of search strategies using connectionist mechanisms. Previous studies of strategy learning within the symbolic, productionrule formalism have not addressed finetuning behavior. Here a twolayer connectionist system is presented th ..."
Abstract

Cited by 69 (4 self)
 Add to MetaCart
Results are presented that demonstrate the learning and finetuning of search strategies using connectionist mechanisms. Previous studies of strategy learning within the symbolic, productionrule formalism have not addressed finetuning behavior. Here a twolayer connectionist system is presented that develops its search from a weak to a taskspecific strategy and finetunes its performance. The system is applied to a simulated, realtime, balancecontrol task. We compare the performance of onelayer and twolayer networks, showing that the ability of the twolayer network to discover new features and thus enhance the original representation is critical to solving the balancing task.
Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... Recent research on hiddenstate reinforcement learning (RL) problems has concentrated on overcoming partial observability by using memory to estimate state. However, such methods are computationally extremely expensive and thus have very limited applicability. This emphasis on state estimation has c ..."
Abstract

Cited by 55 (1 self)
 Add to MetaCart
Recent research on hiddenstate reinforcement learning (RL) problems has concentrated on overcoming partial observability by using memory to estimate state. However, such methods are computationally extremely expensive and thus have very limited applicability. This emphasis on state estimation has come about because it has been widely observed that the presence of hidden state or partial observability renders popular RL methods such as Qlearning and Sarsa useless. However, this observation is misleading in two ways: first, the theoretical results supporting it only apply to RL algorithms that do not use eligibility traces, and second these results are worstcase results, which leaves open the possibility that there may be large classes of hiddenstate problems in which RL algorithms work well without any state estimation. In this paper we show empirically that Sarsa(#), a well known family of RL algorithms that use eligibility traces, can work very well on hidden state problems that ...
QLearning with HiddenUnit Restarting
 Advances in Neural Information Processing Systems 5
, 1993
"... Platt's resourceallocation network (RAN) (Platt, 1991a, 1991b) is modified for a reinforcementlearning paradigm and to "restart" existing hidden units rather than adding new units. After restarting, units continue to learn via backpropagation. The resulting restart algorithm is tested in a Qlear ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
Platt's resourceallocation network (RAN) (Platt, 1991a, 1991b) is modified for a reinforcementlearning paradigm and to "restart" existing hidden units rather than adding new units. After restarting, units continue to learn via backpropagation. The resulting restart algorithm is tested in a Qlearning network that learns to solve an inverted pendulum problem. Solutions are found faster on average with the restart algorithm than without it. 1 Introduction The goal of supervised learning is the discovery of a compact representation that generalizes well. Such representations are typically found by incremental, gradientbased search, such as error backpropagation. However, in the early stages of learning a control task, we are more concerned with fast learning than a compact representation. This implies a local representation with the extreme being the memorization of each experience. An initially local representation is also advantageous when the learning component is operating in par...
Analytical mean squared error curves in temporal difference learning
, 1997
"... We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal di erence value estimation algorithms change with o ine updates over trials in absorbing Markovchains using lookup table representations. We illustrate classes of learning curve beh ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal di erence value estimation algorithms change with o ine updates over trials in absorbing Markovchains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its stepsize and eligibility trace parameters.