Results 1 -
8 of
8
Reinforcement Learning with Replacing Eligibility Traces
- Machine Learning
, 1996
"... . The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract
-
Cited by 168 (8 self)
- Add to MetaCart
. The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that t...
Incremental Multi-Step Q-Learning
- Machine Learning
, 1996
"... . This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic programming-based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic programming-based reinforcement le ..."
Abstract
-
Cited by 81 (2 self)
- Add to MetaCart
. This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic programming-based reinforcement learning method, with the TD() return estimation process, which is typically used in actor-critic learning, another well-known dynamic programming-based reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse state-space quantization. The resulting algorithm, Q()-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. Keywords: reinforcement learning, temporal difference learning 1. Introduction The incremental multi-step Q-learning (Q()-learning) method is a new direct (or model-free) algorithm that extends the one-step Q-learning algorithm (Watkins 1989) by combining it with TD() ret...
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
, 1996
"... A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning w ..."
Abstract
-
Cited by 77 (6 self)
- Add to MetaCart
A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the long-term utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of real-valued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a close-form solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarse-coded funct...
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on state-action pairs, as are used in Q-learning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
Temporal sequence learning, prediction and control - a review of different models and their relation to biological mechanisms
- Neural Computation
, 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are reward-based (e.g. TD-learning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe, that reward-based and correlation based learning are indeed very similar. Machine-control is then used to introduce the problem of closed-loop control (e.g. “actor-critic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus and
Value Function Based Production Scheduling
- In International Conference on Machine Learning
, 1998
"... Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard schedu ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard scheduling models, such as job-shop, inadequate. Currently applied algorithms, such as simulated annealing and constraint propagation, must employ ad-hoc methods such as frequent replanning to cope with uncertainty. In this paper, we describe a Markov Decision Process (MDP) formulation of production scheduling which captures stochasticity in both production and demands. The solution to this MDP is a value function which can be used to generate optimal scheduling decisions online. A simple example illustrates the theoretical superiority of this approach over replanning-based methods. We then describe an industrial application and two reinforcement learning methods for generating an approximate valu...
A Study on Architecture, Algorithms, and Applications of Approximate Dynamic Programming Based Approach to Optimal Control
, 2004
"... ..."
Learning Adaptive Reactive Agents
, 1997
"... this document, we will use the term learning to denote the knowledge-level learning and the term adaptation to denote symbol-level learning. Intuitively, learning involves the acquisition of new knowledge, and adaptation involves modifying the internal self of the agent. This section discusses the c ..."
Abstract
- Add to MetaCart
this document, we will use the term learning to denote the knowledge-level learning and the term adaptation to denote symbol-level learning. Intuitively, learning involves the acquisition of new knowledge, and adaptation involves modifying the internal self of the agent. This section discusses the classification of agents according to these dimensions. The classification is useful for characterizing the research in this area. Two main factors influence the actions the agent chooses to execute at any given stage: the agent's state and the policy. The state of the agent characterizes the current situation of the agent and the policy represents the decision-making strategy of the agent. Thus, at any given decision stage, the agent applies its policy to the current state to decide what action to execute next. The main objective of the policy is to map states to actions in such a way that produces efficient task performance when interacting with the environment. The implementation of the policy can range from a simple lookup in a table mapping the appropriate action for all possible states to a complex computation that depends exclusively on the entire state history. There are two possible dimensions that influence the selection of the action and they can be used to usefully classify agents. These dimensions corresponds to whether the policy remains fixed or changes with experience (i.e., adaptation), and whether the state is capable or not of incorporating new information about the unknown properties of the environment with new sensations (i.e., learning). An agent using a fixed policy has a strategy for selecting actions that remains constant over time. Such an agent will respond with the same action whenever it finds itself in the same state, or, in case of an stochastic ...

