Results 1 - 10
of
12
Learning to predict by the methods of temporal differences
- MACHINE LEARNING
, 1988
"... This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predi ..."
Abstract
-
Cited by 1060 (33 self)
- Add to MetaCart
This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.
Dyna, an Integrated Architecture for Learning, Planning, and Reacting
- WORKING NOTES OF THE 1991 AAAI SPRING SYMPOSIUM
, 1991
"... Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agent's actions on the world. Planning is incremental and can use the probabilistic and ofttimes ..."
Abstract
-
Cited by 427 (13 self)
- Add to MetaCart
Dyna is an AI architecture that integrates learning, planning, and reactive execution. Learning methods are used in Dyna both for compiling planning results and for updating a model of the effects of the agent's actions on the world. Planning is incremental and can use the probabilistic and ofttimes incorrect world models generated by learning processes. Execution is fully reactive in the sense that no planning intervenes between perception and action. Dyna relies on machine learning methods for learning from examples -- these are among the basic building blocks making up the architecture -- yet is not tied to any particular method. This paper briefly introduces Dyna and discusses its strengths and weaknesses with respect to other architectures.
Motivated Reinforcement Learning
, 2001
"... The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequen ..."
Abstract
-
Cited by 222 (8 self)
- Add to MetaCart
The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.
Reinforcement Learning And Its Application To Control
, 1992
"... Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be us ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be used to train the controller. But when such control actions are not known a priori, appropriate control behavior has to be inferred from observations of the IP. One can distinguish between two classes of methods for training controllers under such circumstances. Indirect methods involve constructing a model of the problem's IP and using the model to obtain training information for the controller. On the other hand, direct, or model-free,...
TD Models of Reward Predictive Responses in Dopamine Neurons
"... This article focuses on recent modeling studies of dopamine neuron activity and their influence on behavior. Activity of midbrain dopamine neurons is phasically increased by stimuli that increase the animal's reward expectation and is decreased below baseline levels when the reward fails to occur. T ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
This article focuses on recent modeling studies of dopamine neuron activity and their influence on behavior. Activity of midbrain dopamine neurons is phasically increased by stimuli that increase the animal's reward expectation and is decreased below baseline levels when the reward fails to occur. These characteristics resemble the reward prediction error signal of the temporal difference (TD) model, which is a model of reinforcement learning. Computational modeling studies show that such a dopamine-like reward prediction error can serve as a powerful teaching signal for learning with delayed reinforcement, in particular for learning of motor sequences.
Learning To Do Without Cognition
- In [57
"... In this paper we show that a phenomenon in animal learning theory (the outcome devaluation effect) for which there is dispute over whether explicit representations and symbolic reasoning is required for its performance, does not require such things. This is done using a reactive motivational model, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we show that a phenomenon in animal learning theory (the outcome devaluation effect) for which there is dispute over whether explicit representations and symbolic reasoning is required for its performance, does not require such things. This is done using a reactive motivational model, previously inspired from ethological thought, to which some simple reinforcement learning rules are attached. An instantation of the model is used as the control system of an animat in a spatial computer simulation and it succeeds in learning the necessary parameters to allow the behaviour sequencing system to exhibit the phenomenon. 1 Introduction How complex can a reactive animat's behaviours get before some begin to appeal for a return to the well established rational techniques in classical artificial intelligence ? This paper offers an analysis and performance of a phenomenon in animal learning theory that provokes controversy about the type and complexity of the cognitive machinery ...
The Control of Instrumental Action Following Outcome Devaluation in Young Children Aged Between 1 and 4 Years
"... To determine the role of action–outcome learning in the control of young children’s instrumental behavior, the authors trained 18- to 48-month-olds to manipulate visual icons on a touch-sensitive display to obtain different types of video clips as outcomes. Subsequently, one of the outcomes was deva ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
To determine the role of action–outcome learning in the control of young children’s instrumental behavior, the authors trained 18- to 48-month-olds to manipulate visual icons on a touch-sensitive display to obtain different types of video clips as outcomes. Subsequently, one of the outcomes was devalued by repeated exposure, and children’s propensity to perform the trained actions was tested in extinction. On test, children with a mean age greater than 2.5 years performed the action trained with the devalued outcome less than those trained with the still-valued outcome, thereby demonstrating that their actions were mediated by action–outcome learning. By contrast, the instrumental responses of younger children (mean age �2 years) were resistant to outcome devaluation and may have been elicited directly by the icons associated with each response, rather than mediated by a specific action–outcome expectation.
Heuristic Speed-Ups for Learning in Complex Stochastic Environments
, 2005
"... We describe a novel methodology by which a software agent can learn to predict future events in complex stochastic environments together with an important heuristic-based acceleration technique for computing the prediction. This speed-up enables us to use much more context in our predictions t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We describe a novel methodology by which a software agent can learn to predict future events in complex stochastic environments together with an important heuristic-based acceleration technique for computing the prediction. This speed-up enables us to use much more context in our predictions than was previously possible [Darken, 2005] .
Explorations of the Practical Issues of Learning Prediction-Control Tasks Using Temporal Difference Learning Methods
- Master’s thesis, MIT
, 1992
"... There has been recent interest in using a class of incremental learning algorithms called temporal difference learning methods to attack problems of prediction. These algorithms have been brought to bear on various prediction problems in the past, but have remained poorly understood. It is the purpo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
There has been recent interest in using a class of incremental learning algorithms called temporal difference learning methods to attack problems of prediction. These algorithms have been brought to bear on various prediction problems in the past, but have remained poorly understood. It is the purpose of this thesis to further explore this class of algorithms, particularly the TD (l ) algorithm. A number of practical issues are raised and discussed from a general theoretical perspective and then explored in the context of several case studies. The thesis presents a framework for viewing these algorithms independent of the particular task at hand and uses this framework to explore not only tasks of prediction, but also prediction tasks that require control, whether complete or partial. This includes applying the TD (l) algorithm to two tasks: 1) learning to play tic-tac-toe from the outcome of self-play and the outcome of play against a perfectly-playing opponent and 2) learning two sim...
Approximately as appeared in: Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds., pp. 497--537. MIT Press, 1990.
- Learning and Computational Neuroscience: Foundations of Adaptive Networks
, 1990
"... this paper, however, we analyze it from the point of view of animal learning theory. Our intended audience is both animal learning researchers interested in computational theories of behavior and machine learning researchers interested in how their learning algorithms relate to, and may be constrain ..."
Abstract
- Add to MetaCart
this paper, however, we analyze it from the point of view of animal learning theory. Our intended audience is both animal learning researchers interested in computational theories of behavior and machine learning researchers interested in how their learning algorithms relate to, and may be constrained by, animal learning studies. For an exposition of the TD model from an engineering point of view, see Chapter 13 of this volume

