Results 1 
9 of
9
Reinforcement Learning with Replacing Eligibility Traces
 MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract

Cited by 186 (11 self)
 Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replacetrace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replacetrace TD is unbiased. In addition, we show that t...
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
, 1996
"... A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning w ..."
Abstract

Cited by 92 (6 self)
 Add to MetaCart
A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of realvalued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a closeform solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarsecoded funct...
Incremental MultiStep QLearning
 Machine Learning
, 1996
"... . This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement le ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q()learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. Keywords: reinforcement learning, temporal difference learning 1. Introduction The incremental multistep Qlearning (Q()learning) method is a new direct (or modelfree) algorithm that extends the onestep Qlearning algorithm (Watkins 1989) by combining it with TD() ret...
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a onestep lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on stateaction pairs, as are used in Qlearning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
Temporal sequence learning, prediction and control  a review of different models and their relation to biological mechanisms
 Neural Computation
, 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are rewardbased (e.g. TDlearning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an openloop condition, where behavioral feedback does not alter the learning. Here we observe, that rewardbased and correlation based learning are indeed very similar. Machinecontrol is then used to introduce the problem of closedloop control (e.g. “actorcritic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closedloop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basalganglia, thalamus and
Value Function Based Production Scheduling
 In International Conference on Machine Learning
, 1998
"... Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard schedu ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard scheduling models, such as jobshop, inadequate. Currently applied algorithms, such as simulated annealing and constraint propagation, must employ adhoc methods such as frequent replanning to cope with uncertainty. In this paper, we describe a Markov Decision Process (MDP) formulation of production scheduling which captures stochasticity in both production and demands. The solution to this MDP is a value function which can be used to generate optimal scheduling decisions online. A simple example illustrates the theoretical superiority of this approach over replanningbased methods. We then describe an industrial application and two reinforcement learning methods for generating an approximate valu...
A Study on Architecture, Algorithms, and Applications of Approximate Dynamic Programming Based Approach to Optimal Control
, 2004
"... ..."
Learning Adaptive Reactive Agents
, 1997
"... this document, we will use the term learning to denote the knowledgelevel learning and the term adaptation to denote symbollevel learning. Intuitively, learning involves the acquisition of new knowledge, and adaptation involves modifying the internal self of the agent. This section discusses the c ..."
Abstract
 Add to MetaCart
this document, we will use the term learning to denote the knowledgelevel learning and the term adaptation to denote symbollevel learning. Intuitively, learning involves the acquisition of new knowledge, and adaptation involves modifying the internal self of the agent. This section discusses the classification of agents according to these dimensions. The classification is useful for characterizing the research in this area. Two main factors influence the actions the agent chooses to execute at any given stage: the agent's state and the policy. The state of the agent characterizes the current situation of the agent and the policy represents the decisionmaking strategy of the agent. Thus, at any given decision stage, the agent applies its policy to the current state to decide what action to execute next. The main objective of the policy is to map states to actions in such a way that produces efficient task performance when interacting with the environment. The implementation of the policy can range from a simple lookup in a table mapping the appropriate action for all possible states to a complex computation that depends exclusively on the entire state history. There are two possible dimensions that influence the selection of the action and they can be used to usefully classify agents. These dimensions corresponds to whether the policy remains fixed or changes with experience (i.e., adaptation), and whether the state is capable or not of incorporating new information about the unknown properties of the environment with new sensations (i.e., learning). An agent using a fixed policy has a strategy for selecting actions that remains constant over time. Such an agent will respond with the same action whenever it finds itself in the same state, or, in case of an stochastic ...
From SingleAgent to MultiAgent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course
"... Interest in robotic and software agents has increased a lot in the last decades. They allow us to do tasks that we would hardly accomplish otherwise. Particularly, multiagent systems motivate distributed solutions that can be cheaper and more efficient than centralized singleagent ones. In this co ..."
Abstract
 Add to MetaCart
Interest in robotic and software agents has increased a lot in the last decades. They allow us to do tasks that we would hardly accomplish otherwise. Particularly, multiagent systems motivate distributed solutions that can be cheaper and more efficient than centralized singleagent ones. In this context, reinforcement learning provides a way for agents to compute optimal ways of performing the required tasks, with just a small instruction indicating if the task was or was not accomplished. Learning in multiagent systems, however, poses the problem of nonstationarity due to interactions with other agents. In fact, the RL methods for the single agent domain assume stationarity of the environment and cannot be applied directly. This work is divided in two main parts. In the first one, the reinforcement learning framework for singleagent domains is analyzed and some classical solutions presented, based on Markov decision processes. In the second part, the multiagent domain is analyzed, borrowing tools