Results 11  20
of
27
Combining Configural and TD Learning on a Robot
 In ICDL 2
, 2002
"... We combine configural and temporal difference learning in a classical conditioning model. The model is able to solve the negative patterning problem, discriminate sequences of stimuli, and exhibit second order conditioning. We have implemented the algorithm on the Sony AIBO entertainment robot, allo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
We combine configural and temporal difference learning in a classical conditioning model. The model is able to solve the negative patterning problem, discriminate sequences of stimuli, and exhibit second order conditioning. We have implemented the algorithm on the Sony AIBO entertainment robot, allowing us to interact with the conditioning model in real time.
A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies
 In Proceeding of the 21st Conference Uncertainty in Artificial Intelligence (UAI'05
, 2005
"... We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finitestate controllers. We show that the gradient estimation can be done in the ActorCritic framework, by making the critic compute ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We consider the estimation of the policy gradient in partially observable Markov decision processes (POMDP) with a special class of structured policies that are finitestate controllers. We show that the gradient estimation can be done in the ActorCritic framework, by making the critic compute a “value” function that does not depend on the states of POMDP. This function is the conditional mean of the true value function that depends on the states. We show that the critic can be implemented using temporal difference (TD) methods with linear function approximations, and the analytical results on TD and ActorCritic can be transfered to this case. Although ActorCritic algorithms have been used extensively in Markov decision processes (MDP), up to now they have not been proposed for POMDP as an alternative to the earlier proposal GPOMDP algorithm, an actoronly method. Furthermore, we show that the same idea applies to semiMarkov problems with a subset of finitestate controllers. 1
Adaptive DataAware UtilityBased Scheduling in ResourceConstrained Systems
, 2007
"... This paper addresses the problem of dynamic scheduling of dataintensive multiprocessor jobs. Each job requires some number of CPUs and some amount of data that needs to be downloaded into a local storage space before starting the job. The completion of each job brings some benefit (utility) to the ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
This paper addresses the problem of dynamic scheduling of dataintensive multiprocessor jobs. Each job requires some number of CPUs and some amount of data that needs to be downloaded into a local storage space before starting the job. The completion of each job brings some benefit (utility) to the system, and the goal is to find the optimal scheduling policy that maximizes the average utility per unit of time obtained from all completed jobs. A coevolutionary solution methodology is proposed, where the utilitybased policies for managing local storage and for scheduling jobs onto the available CPUs mutually affect each other’s environments, with both policies being adaptively tuned using the Reinforcement Learning methodology. Our simulation results demonstrate the feasibility of this approach and show that it performs better than the best heuristic scheduling policy we could find for this domain.
NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY
, 2007
"... This dissertation research addresses the challenge of scaling up algorithms for sequential decision making under uncertainty. In my dissertation, I developed new approximation strategies for planning and learning in the presence of uncertainty while maintaining useful theoretical properties that all ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This dissertation research addresses the challenge of scaling up algorithms for sequential decision making under uncertainty. In my dissertation, I developed new approximation strategies for planning and learning in the presence of uncertainty while maintaining useful theoretical properties that allow larger problems to be tackled than is practical with exact methods. In particular, my research tackles three outstanding issues in sequential decision making in uncertain environments: performing stable generalization during offpolicy updates, balancing exploration with exploitation, and handling partial observability of the environment. The first key contribution of my thesis is the development of novel dual representations and algorithms for planning and learning in stochastic environments. This dual view I have developed offers a coherent and comprehensive approach to optimal sequential decision making problems, provides an alternative to standard value function based techniques, and opens new avenues for solving sequential decision making problems. In particular, I have shown that dual dynamic program
Hyperbolically discounted temporal difference learning. Neural Comput
 Science
, 2010
"... Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of tem ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of temporal discounting, such as temporal difference learning, assume that future outcomes are discounted exponentially. Exponential discounting has been preferred largely because it can be expressed recursively, whereas hyperbolic discounting has heretofore been thought not to have a recursive definition. In this letter, we define a learning algorithm, hyperbolically discounted temporal difference (HDTD) learning, which constitutes a recursive formulation of the hyperbolic model.
Efficient Asymptotic Approximation in Temporal Difference Learning
"... TD( ) is an algorithm that learns the value function associated to a policy in a Markov Decision Process (MDP). We propose in this paper an asymptotic approximation of online TD( ) with accumulating eligibility trace, called ATD( ). We then use the Ordinary Differential Equation (ODE) metho ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
TD( ) is an algorithm that learns the value function associated to a policy in a Markov Decision Process (MDP). We propose in this paper an asymptotic approximation of online TD( ) with accumulating eligibility trace, called ATD( ). We then use the Ordinary Differential Equation (ODE) method to analyse ATD( ) and to optimize the choice of the parameter and the learning stepsize, and we introduce ATD, a new efficient temporal difference learning algorithm. 1
LETTER Communicated by David S. Touretzky Hyperbolically Discounted Temporal Difference Learning
"... Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of te ..."
Abstract
 Add to MetaCart
Hyperbolic discounting of future outcomes is widely observed to underlie choice behavior in animals. Additionally, recent studies (Kobayashi & Schultz, 2008) have reported that hyperbolic discounting is observed even in neural systems underlying choice. However, the most prevalent models of temporal discounting, such as temporal difference learning, assume that future outcomes are discounted exponentially. Exponential discounting has been preferred largely because it can be expressed recursively, whereas hyperbolic discounting has heretofore been thought not to have a recursive definition. In this letter, we define a learning algorithm, hyperbolically discounted temporal difference (HDTD) learning, which constitutes a recursive formulation of the hyperbolic model. 1
TD(0) Leads to Better Policies than Approximate Value Iteration
"... We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed ..."
Abstract
 Add to MetaCart
(Show Context)
We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1
2002 Special issue Opponent interactions between serotonin and dopamine
, 2002
"... Anatomical and pharmacological evidence suggests that the dorsal raphe serotonin system and the ventral tegmental and substantia nigra dopamine system may act as mutual opponents. In the light of the temporal difference model of the involvement of the dopamine system in reward learning, we consider ..."
Abstract
 Add to MetaCart
Anatomical and pharmacological evidence suggests that the dorsal raphe serotonin system and the ventral tegmental and substantia nigra dopamine system may act as mutual opponents. In the light of the temporal difference model of the involvement of the dopamine system in reward learning, we consider three aspects of motivational opponency involving dopamine and serotonin. We suggest that a tonic serotonergic signal reports the longrun average reward rate as part of an averagecase reinforcement learning model; that a tonic dopaminergic signal reports the longrun average punishment rate in a similar context; and finally speculate that a phasic serotonin signal