• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

On-line Q-learning using connectionist systems (1994)

by G A Rummery, M Niranjan
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 381
Next 10 →

Reinforcement Learning I: Introduction

by Richard S. Sutton, Andrew G. Barto , 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract - Cited by 5614 (118 self) - Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve long-term goals.

Reinforcement learning: a survey

by Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore - Journal of Artificial Intelligence Research , 1996
"... This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract - Cited by 1714 (25 self) - Add to MetaCart
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
(Show Context)

Citation Context

...pects of the experiments: 1. Small changes to the task speci cations. 2. A very di erent kind of function approximator (CMAC [2]) that has weak generalization. 3. A di erent learning algorithm: SARSA =-=[95]-=- instead of value iteration. 4. A di erent training regime. Boyan and Moore sampled states uniformly in state space, whereas Sutton's method sampled along empirical trajectories. There are intuitive r...

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

by Thomas G. Dietterich - Journal of Artificial Intelligence Research , 2000
"... This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. Th ..."
Abstract - Cited by 443 (6 self) - Add to MetaCart
This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consisten...
(Show Context)

Citation Context

...erge to the cumulative reward of the optimal policy for the MDP. In this paper, we will make use of two well-known learning algorithms: Q learning (Watkins, 1989; Watkins & Dayan, 1992) and SARSA(0) (=-=Rummery & Niranjan, 1994-=-). Both of these algorithms maintain a tabular representation of the action-value function Q(s,a). Every entry of the table is initialized arbitrarily. In Q learning, after the algorithm has observed ...

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

by Richard S. Sutton - Advances in Neural Information Processing Systems 8 , 1996
"... On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have ..."
Abstract - Cited by 433 (20 self) - Add to MetaCart
On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...

Near-optimal reinforcement learning in polynomial time

by Michael Kearns - Machine Learning , 1998
"... We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the m ..."
Abstract - Cited by 304 (5 self) - Add to MetaCart
We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off. 1
(Show Context)

Citation Context

...rategies that guarantee both sufficient exploration for asymptotic convergence to optimal actions, and asymptotic exploitation, for both the Q-learning and SARSA algorithms (a variant of Q-learning) (=-=Rummery & Niranjan, 1994-=-; Singh & Sutton, 1996; Sutton, 1995). Gullapalli and Barto (1994) and Jalali and Ferguson (1989) presented algorithms that learn a model of the environment from experience, perform value iteration on...

Reinforcement Learning with Replacing Eligibility Traces

by Satinder Singh, Richard S. Sutton - MACHINE LEARNING , 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract - Cited by 241 (14 self) - Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that t...

Recent advances in hierarchical reinforcement learning

by Andrew G. Barto , 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract - Cited by 229 (24 self) - Add to MetaCart
A preliminary unedited version of this paper was incorrectly published as part of Volume
(Show Context)

Citation Context

...eters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa =-=[59, 70]-=-. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectively replaced by a s...

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

by Satinder Singh, Tommi Jaakkola, Michael L. Littman, Csaba Szepesvári - MACHINE LEARNING , 1998
"... An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract - Cited by 154 (7 self) - Add to MetaCart
An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
(Show Context)

Citation Context

...s actions uniformly at random. Later, we describe several other learning policies that result in convergence when combined with the Q-learning update rule. The update rule for sarsa(0)(Rummery, 1994; =-=Rummery & Niranjan, 1994-=-; John, 1994, 1995; Singh & Sutton, 1995; Sutton, 1996) is quite similar to Q-learning: Qt+1(st;at) = (1, t(st;at))Qt(st;at) + t(st;at)[rt + Qt(st+1;at+1)]: (3) The main di erence is that Q-learning m...

Reinforcement learning for RoboCup-soccer keepaway

by Peter Stone, Richard S. Sutton, Gregory Kuhlmann - Adaptive Behavior , 2005
"... 1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, in-cluding a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our appli-cation of episodic SMD ..."
Abstract - Cited by 134 (36 self) - Add to MetaCart
1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, in-cluding a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our appli-cation of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers, ” tries to keep control of the ball for as long as possible despite the efforts of “the takers. ” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.
(Show Context)

Citation Context

...forcement Learning Algorithm We use the SMDP version of the Sarsa(λ) algorithm with linear tile-coding function approximation (also known as CMACs) and replacing eligibility traces (see (Albus, 1981; =-=Rummery & Niranjan, 1994-=-; Sutton & Barto, 1998)). Each player learns simultaneously and independently from its own actions and its own perception of the state. Note that as a result, the value of a player’s decision 13sdepen...

Algorithms for Reinforcement Learning

by Csaba Szepesvári , 2009
"... ..."
Abstract - Cited by 129 (6 self) - Add to MetaCart
Abstract not found
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University