• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 1,345
Next 10 →

Behavioral considerations suggest an average reward TD model . . .

by Nathaniel Daw, D. Touretzky - Neurocomputing , 2000
"... Recently there has been much interest in modeling the activity of primate midbrain dopamine neurons as signalling reward prediction error. But since the models are based on temporal difference (TD) learning, they assume an exponential decline with time in the value of delayed reinforcers, an assumpt ..."
Abstract - Cited by 9 (2 self) - Add to MetaCart
, an assumption long known to conflict with animal behavior. We show that a variant of TD learning that tracks variations in the average reward per timestep rather than cumulative discounted reward preserves the models' success at explaining neurophysiological data while significantly increasing

On average versus discounted reward temporal-difference learning

by John N. Tsitsiklis, Benjamin Van Roy, Satinder Singh - Machine Learning , 2002
"... Abstract. We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function prod ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
Abstract. We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function

Policy gradient methods for reinforcement learning with function approximation.

by Richard S Sutton , David Mcallester , Satinder Singh , Yishay Mansour - In NIPS, , 1999
"... Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly repres ..."
Abstract - Cited by 439 (20 self) - Add to MetaCart
output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ρ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately

Conditional skewness in asset pricing tests

by Campbell R. Harvey, Akhtar Siddique - Journal of Finance , 2000
"... If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the cross-sectional variation of expect ..."
Abstract - Cited by 342 (6 self) - Add to MetaCart
If asset returns have systematic skewness, expected returns should include rewards for accepting this risk. We formalize this intuition with an asset pricing model that incorporates conditional skewness. Our results show that conditional skewness helps explain the cross-sectional variation

R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

by Ronen I. Brafman, Moshe Tennenholtz, Pack Kaelbling , 2001
"... R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The mod ..."
Abstract - Cited by 297 (10 self) - Add to MetaCart
R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model

Temporal difference models and reward-related learning in the human brain

by Peter Dayan, Karl Friston, Hugo Critchley, London Wcn Bg - Neuron , 2003
"... John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to ..."
Abstract - Cited by 200 (11 self) - Add to MetaCart
John P. O’Doherty, V(tUCS) and V(tUCS � 1) generates a positive prediction error that, in the simplest form of TD learning, is used to increment the value at time tUCS – 1 (in proportion to

Reinforcement Learning with Replacing Eligibility Traces

by Satinder Singh, Richard S. Sutton - MACHINE LEARNING , 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract - Cited by 241 (14 self) - Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional

Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task

by Wolfram Schultz, Paul Apicella, Tomas Ljungbergb - J. Neurosci , 1993
"... The present investigation had two aims: (1) to study responses of dopamine neurons to stimuli with attentional and motivationai significance during several steps of learning a behavioral task, and (2) to study the activity of dopamine neurons during the performance of cognitive tasks known to be imp ..."
Abstract - Cited by 200 (6 self) - Add to MetaCart
of primary liquid reward, whereas only 9 % of 163 neurons responded to this event once task performance was established. This produced an average population response during but not after learning of each task. Reward responses during learning were significantly

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results

by Sridhar Mahadevan , 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract - Cited by 130 (13 self) - Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous

Infinite-horizon policy-gradient estimation

by Jonathan Baxter, Peter L. Bartlett - Journal of Artificial Intelligence Research , 2001
"... Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a si ..."
Abstract - Cited by 208 (5 self) - Add to MetaCart
simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief
Next 10 →
Results 1 - 10 of 1,345
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University