Results 1  10
of
13
Reinforcement Learning with Replacing Eligibility Traces
 MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract

Cited by 196 (11 self)
 Add to MetaCart
(Show Context)
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replacetrace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replacetrace TD is unbiased. In addition, we show that t...
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
, 1996
"... A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning w ..."
Abstract

Cited by 104 (6 self)
 Add to MetaCart
A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of realvalued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a closeform solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarsecoded funct...
R.: Incremental multistep Qlearning
, 1996
"... Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinfor ..."
Abstract

Cited by 93 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q(A)learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract

Cited by 87 (1 self)
 Add to MetaCart
(Show Context)
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a onestep lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on stateaction pairs, as are used in Qlearning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
Temporal sequence learning, prediction and control  a review of different models and their relation to biological mechanisms
 Neural Computation
, 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are rewardbased (e.g. TDlearning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an openloop condition, where behavioral feedback does not alter the learning. Here we observe, that rewardbased and correlation based learning are indeed very similar. Machinecontrol is then used to introduce the problem of closedloop control (e.g. “actorcritic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closedloop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basalganglia, thalamus and
Value Function Based Production Scheduling
 In International Conference on Machine Learning
, 1998
"... Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard schedu ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard scheduling models, such as jobshop, inadequate. Currently applied algorithms, such as simulated annealing and constraint propagation, must employ adhoc methods such as frequent replanning to cope with uncertainty. In this paper, we describe a Markov Decision Process (MDP) formulation of production scheduling which captures stochasticity in both production and demands. The solution to this MDP is a value function which can be used to generate optimal scheduling decisions online. A simple example illustrates the theoretical superiority of this approach over replanningbased methods. We then describe an industrial application and two reinforcement learning methods for generating an approximate valu...
Approximate dynamic programming strategies and their applicability for process control: A review and future directions
 International Journal of Control Automation and Systems
, 2004
"... Abstract: This paper reviews dynamic programming (DP), surveys approximate solution methods for it, and considers their applicability to process control problems. Reinforcement Learning (RL) and NeuroDynamic Programming (NDP), which can be viewed as approximate DP techniques, are already establishe ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract: This paper reviews dynamic programming (DP), surveys approximate solution methods for it, and considers their applicability to process control problems. Reinforcement Learning (RL) and NeuroDynamic Programming (NDP), which can be viewed as approximate DP techniques, are already established techniques for solving difficult multistage decision problems in the fields of operations research, computer science, and robotics. Owing to the significant disparity of problem formulations and objective, however, the algorithms and techniques available from these fields are not directly applicable to process control problems, and reformulations based on accurate understanding of these techniques are needed. We categorize the currently available approximate solution techniques for dynamic programming and identify those most suitable for process control problems. Several open issues are also identified and discussed.
A Study on Architecture, Algorithms, and Applications of Approximate Dynamic Programming Based Approach to Optimal Control
, 2004
"... ..."