Results 1  10
of
3,023
On average versus discounted reward temporaldifference learning
 Machine Learning
, 2002
"... Abstract. We provide an analytical comparison between discounted and average reward temporaldifference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function prod ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Abstract. We provide an analytical comparison between discounted and average reward temporaldifference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function
LETTER Communicated by A. David Redish A Neurocomputational Model for Cocaine Addiction
"... Based on the dopamine hypotheses of cocaine addiction and the assumption of decrement of brain reward system sensitivity after longterm drug exposure, we propose a computational model for cocaine addiction. Utilizing average reward temporal difference reinforcement learning, we incorporate the el ..."
Abstract
 Add to MetaCart
Based on the dopamine hypotheses of cocaine addiction and the assumption of decrement of brain reward system sensitivity after longterm drug exposure, we propose a computational model for cocaine addiction. Utilizing average reward temporal difference reinforcement learning, we incorporate
LETTER Communicated by A. David Redish A Neurocomputational Model for Cocaine Addiction
"... Based on the dopamine hypotheses of cocaine addiction and the assumption of decrement of brain reward system sensitivity after longterm drug exposure, we propose a computational model for cocaine addiction. Utilizing average reward temporal difference reinforcement learning, we incorporate the elev ..."
Abstract
 Add to MetaCart
Based on the dopamine hypotheses of cocaine addiction and the assumption of decrement of brain reward system sensitivity after longterm drug exposure, we propose a computational model for cocaine addiction. Utilizing average reward temporal difference reinforcement learning, we incorporate
Policy gradient methods for reinforcement learning with function approximation.
 In NIPS,
, 1999
"... Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly repres ..."
Abstract

Cited by 439 (20 self)
 Add to MetaCart
output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ρ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately
Average cost temporaldifference learning
, 1999
"... We propose a variant of temporaldifference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of t ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
We propose a variant of temporaldifference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory
Temporaldifference networks
 In Advances in Neural Information Processing Systems 17
, 2005
"... We introduce a generalization of temporaldifference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set ..."
Abstract

Cited by 44 (8 self)
 Add to MetaCart
world knowledge in entirely predictive, grounded terms. Temporaldifference (TD) learning is widely used in reinforcement learning methods to learn momenttomoment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more dataefficient than other
Conditional skewness in asset pricing tests
 Journal of Finance
, 2000
"... If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the crosssectional variation of expect ..."
Abstract

Cited by 342 (6 self)
 Add to MetaCart
If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the crosssectional variation
Temporal difference models and rewardrelated learning in the human brain
 Neuron
, 2003
"... John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to ..."
Abstract

Cited by 200 (11 self)
 Add to MetaCart
John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to
RMAX  A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning
, 2001
"... Rmax is a very simple modelbased reinforcement learning algorithm which can attain nearoptimal average reward in polynomial time. In Rmax, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The mod ..."
Abstract

Cited by 297 (10 self)
 Add to MetaCart
Rmax is a very simple modelbased reinforcement learning algorithm which can attain nearoptimal average reward in polynomial time. In Rmax, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model
The reward circuit: linking primate anatomy and human imaging
 Neuropsychopharmacology
, 2010
"... Although cells in many brain regions respond to reward, the corticalbasal ganglia circuit is at the heart of the reward system. The key structures in this network are the anterior cingulate cortex, the orbital prefrontal cortex, the ventral striatum, the ventral pallidum, and the midbrain dopamine ..."
Abstract

Cited by 220 (3 self)
 Add to MetaCart
between these areas forms a complex neural network that mediates different aspects of reward processing. Advances in neuroimaging techniques allow better spatial and temporal resolution. These studies now demonstrate that human functional and structural imaging results map increasingly close to primate
Results 1  10
of
3,023