Results 1  10
of
13
LongTerm Reward Prediction in TD Models of the Dopamine System
, 2002
"... This article addresses the relationship between longterm reward predictions and slowtimescale neural activity in temporal difference (TD) models of the dopamine system. Such models attempt to explain how the activity of dopamine (DA) neurons relates to errors in the prediction of future rewards. P ..."
Abstract

Cited by 26 (2 self)
 Add to MetaCart
This article addresses the relationship between longterm reward predictions and slowtimescale neural activity in temporal difference (TD) models of the dopamine system. Such models attempt to explain how the activity of dopamine (DA) neurons relates to errors in the prediction of future rewards. Previous models have been mostly restricted to shortterm predictions of rewards expected during a single, somewhat artificially defined trial. Also, the models focused exclusively on the phasic pauseandburst activity of primate DA neurons; the neurons' slower, tonic background activity was assumed to be constant. This has led to difficulty in explaining the results of neurochemical experiments that measure indications of DA release on a slow timescale, results that seem at first glance inconsistent with a reward prediction model. In this article, we investigate a TD model of DA activity modified so as to enable it to make longerterm predictions about rewards expected far in the future. We show that these predictions manifest themselves as slow changes in the baseline error signal, which we associate with tonic DA activity. Using this model, we make new predictions about the behavior of the DA system in a number of experimental situations. Some of these predictions suggest new computational explanations for previously puzzling data, such as indications from microdialysis studies of elevated DA activity triggered by aversive events
Performance loss bounds for approximate value iteration with state aggregation
 Mathematics of Operations Research
, 2005
"... We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies as projection weights. Such projection weighting relates to what is done by temporaldifference learning. Our analysis also leads to the first performance loss bound for approximate value iteration with an averagecost objective. Key words: approximate value iteration; state aggregation; temporaldifference learning
Prospective and Retrospective Temporal Difference Learning
"... A striking recent finding is that monkeys behave maladaptively in a class of tasks in which they know that reward is going to be systematically delayed. This may be explained by a malign Pavlovian influence arising from states with low predicted values. However, by very carefully analyzing behaviour ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A striking recent finding is that monkeys behave maladaptively in a class of tasks in which they know that reward is going to be systematically delayed. This may be explained by a malign Pavlovian influence arising from states with low predicted values. However, by very carefully analyzing behavioural data from such tasks, La Camera & Richmond (PLoS Computational Biology, doi:10.1371/journal.pcbi.1000131) observed the additional important characteristic that subjects perform differently on states in the task that are equal distances from the future reward, depending on what has happened in the recent past. The authors pointed out that this violates the definition of state value in the standard reinforcement learning models that are ubiquitous as accounts of operant and classical conditioned behavior; they suggested and analyzed an alternative temporal difference model in which past and future are melded. Here, we show that, in fact, a standard temporal difference model can actually exhibit the same behavior, and that this avoids deleterious consequences for choice. At the heart of the model is the average reward per step, which acts as a baseline for measuring immediate rewards. Relatively subtle changes to this baseline occasioned by the past can markedly influence predictions and thus behavior. Author Summary When monkeys perform a sequence of identical tasks before getting a reward, they have been found to make many errors when they know the reward is far away. Oddly, these errors do not only depend on the number of trials to the future reward, they also depend, retrospectively, on the length of the sequence. A recent suggestion for modeling this result suggests an heuristic modification to an otherwise normative account. Here, we show an alternative way that the retrospective dependence could have arisen in a model based on estimating the average reward per step.
NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY
, 2007
"... This dissertation research addresses the challenge of scaling up algorithms for sequential decision making under uncertainty. In my dissertation, I developed new approximation strategies for planning and learning in the presence of uncertainty while maintaining useful theoretical properties that all ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This dissertation research addresses the challenge of scaling up algorithms for sequential decision making under uncertainty. In my dissertation, I developed new approximation strategies for planning and learning in the presence of uncertainty while maintaining useful theoretical properties that allow larger problems to be tackled than is practical with exact methods. In particular, my research tackles three outstanding issues in sequential decision making in uncertain environments: performing stable generalization during offpolicy updates, balancing exploration with exploitation, and handling partial observability of the environment. The first key contribution of my thesis is the development of novel dual representations and algorithms for planning and learning in stochastic environments. This dual view I have developed offers a coherent and comprehensive approach to optimal sequential decision making problems, provides an alternative to standard value function based techniques, and opens new avenues for solving sequential decision making problems. In particular, I have shown that dual dynamic program
Hyperbolically discounted temporal difference learning. Neural Comput
 Science
, 2010
"... Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of tem ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of temporal discounting, such as temporal difference learning, assume that future outcomes are discounted exponentially. Exponential discounting has been preferred largely because it can be expressed recursively, whereas hyperbolic discounting has heretofore been thought not to have a recursive definition. In this letter, we define a learning algorithm, hyperbolically discounted temporal difference (HDTD) learning, which constitutes a recursive formulation of the hyperbolic model.
LETTER Communicated by David S. Touretzky Hyperbolically Discounted Temporal Difference Learning
"... Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of te ..."
Abstract
 Add to MetaCart
Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of temporal discounting, such as temporal difference learning, assume that future outcomes are discounted exponentially. Exponential discounting has been preferred largely because it can be expressed recursively, whereas hyperbolic discounting has heretofore been thought not to have a recursive definition. In this letter, we define a learning algorithm, hyperbolically discounted temporal difference (HDTD) learning, which constitutes a recursive formulation of the hyperbolic model. 1
TD(0) Leads to Better Policies than Approximate Value Iteration
"... We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed ..."
Abstract
 Add to MetaCart
(Show Context)
We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1
LETTER Communicated by Mark Ungless Representation and Timing in Theories of the Dopamine System
"... Although the responses of dopamine neurons in the primate midbrain are well characterized as carrying a temporal difference (TD) error signal for reward prediction, existing theories do not offer a credible account of how the brain keeps track of past sensory events that may be relevant to predictin ..."
Abstract
 Add to MetaCart
Although the responses of dopamine neurons in the primate midbrain are well characterized as carrying a temporal difference (TD) error signal for reward prediction, existing theories do not offer a credible account of how the brain keeps track of past sensory events that may be relevant to predicting future reward. Empirically, these shortcomings of previous theories are particularly evident in their account of experiments in which animals were exposed to variation in the timing of events. The original theories mispredicted the results of such experiments due to their use of a representational device called a tapped delay line. Here we propose that a richer understanding of history representation and a better account of these experiments can be given by considering TD algorithms for a formal setting that incorporates two features not originally considered in theories of the dopaminergic response: partial observability (a distinction between the animal’s sensory experience and the true underlying state of the world) and semiMarkov dynamics (an explicit
Dynamic Routing And Wavelength Assignment In Wdm Optical Networks Using NeuroDynamic Programming
"... We consider a dynamic call admission / routing and wavelength assignment (CA/RWA) problem in wavelength division multiplexed optical networks. The goal in CA/RWA is taken to be minimization of total blocking rate in the network. A dynamic programming problem is formulated under the assumption of mem ..."
Abstract
 Add to MetaCart
We consider a dynamic call admission / routing and wavelength assignment (CA/RWA) problem in wavelength division multiplexed optical networks. The goal in CA/RWA is taken to be minimization of total blocking rate in the network. A dynamic programming problem is formulated under the assumption of memoryless call interarrival and holding times. Exact solution of this problem typically is not feasible, therefore neurodynamic programming (NDP) is used to obtain approximate solutions. A linear approximation architecture is employed together with TD(0) parameter training algorithm. The features used in NDP are borrowed from previously proposed heuristic rules for the RWA problem. Simulations indicate that NDP based solutions provide significant reductions in the blocking rate in comparison to the heuristics. The numerical results also verify the flexible and powerful nature of the employed NDP in the CA/RWA problem.