Results 1 - 10
of
1,345
Behavioral considerations suggest an average reward TD model . . .
- Neurocomputing
, 2000
"... Recently there has been much interest in modeling the activity of primate midbrain dopamine neurons as signalling reward prediction error. But since the models are based on temporal difference (TD) learning, they assume an exponential decline with time in the value of delayed reinforcers, an assumpt ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
, an assumption long known to conflict with animal behavior. We show that a variant of TD learning that tracks variations in the average reward per timestep rather than cumulative discounted reward preserves the models' success at explaining neurophysiological data while significantly increasing
On average versus discounted reward temporal-difference learning
- Machine Learning
, 2002
"... Abstract. We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function prod ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Abstract. We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function
Policy gradient methods for reinforcement learning with function approximation.
- In NIPS,
, 1999
"... Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly repres ..."
Abstract
-
Cited by 439 (20 self)
- Add to MetaCart
output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ρ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately
Conditional skewness in asset pricing tests
- Journal of Finance
, 2000
"... If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the cross-sectional variation of expect ..."
Abstract
-
Cited by 342 (6 self)
- Add to MetaCart
If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the cross-sectional variation
R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
, 2001
"... R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The mod ..."
Abstract
-
Cited by 297 (10 self)
- Add to MetaCart
R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model
Temporal difference models and reward-related learning in the human brain
- Neuron
, 2003
"... John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to ..."
Abstract
-
Cited by 200 (11 self)
- Add to MetaCart
John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to
Reinforcement Learning with Replacing Eligibility Traces
- MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract
-
Cited by 241 (14 self)
- Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional
Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task
- J. Neurosci
, 1993
"... The present investigation had two aims: (1) to study responses of dopamine neurons to stimuli with attentional and motivationai significance during several steps of learning a behavioral task, and (2) to study the activity of dopamine neurons during the performance of cognitive tasks known to be imp ..."
Abstract
-
Cited by 200 (6 self)
- Add to MetaCart
of primary liquid reward, whereas only 9 % of 163 neurons responded to this event once task performance was established. This produced an average population response during but not after learning of each task. Reward responses during learning were significantly
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 130 (13 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous
Infinite-horizon policy-gradient estimation
- Journal of Artificial Intelligence Research
, 2001
"... Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a si ..."
Abstract
-
Cited by 208 (5 self)
- Add to MetaCart
simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief
Results 1 - 10
of
1,345