Results 1  10
of
106
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1586 (25 self)
 Add to MetaCart
(Show Context)
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
OnLine QLearning Using Connectionist Systems
, 1994
"... Reinforcement learning algorithms are a powerful machine learning technique. However, much of the work on these algorithms has been developed with regard to discrete finitestate Markovian problems, which is too restrictive for many realworld environments. Therefore, it is desirable to extend these ..."
Abstract

Cited by 352 (1 self)
 Add to MetaCart
(Show Context)
Reinforcement learning algorithms are a powerful machine learning technique. However, much of the work on these algorithms has been developed with regard to discrete finitestate Markovian problems, which is too restrictive for many realworld environments. Therefore, it is desirable to extend these methods to high dimensional continuous statespaces, which requires the use of function approximation to generalise the information learnt by the system. In this report, the use of backpropagation neural networks (Rumelhart, Hinton and Williams 1986) is considered in this context. We consider a number of different algorithms based around QLearning (Watkins 1989) combined with the Temporal Difference algorithm (Sutton 1988), including a new algorithm (Modified Connectionist QLearning), and Q() (Peng and Williams 1994). In addition, we present algorithms for applying these updates online during trials, unlike backward replay used by Lin (1993) that requires waiting until the end of each t...
Reinforcement Learning with Replacing Eligibility Traces
 MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract

Cited by 219 (13 self)
 Add to MetaCart
(Show Context)
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replacetrace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replacetrace TD is unbiased. In addition, we show that t...
The MAXQ Method for Hierarchical Reinforcement Learning
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchi ..."
Abstract

Cited by 136 (5 self)
 Add to MetaCart
(Show Context)
This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. Conditions under which the MAXQ decomposition can represent the optimal value function are derived. The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary "flat" Q learning. Finally, the paper discusses some interesting issues that arise in hierarchical reinforcement learning including the hierarchical credit assignment problem and nonhierarchical execution of the MAXQ hierarchy. 1 Introduction Hierarchical approaches to reinforcement learning (RL) problems promise ma...
Nash QLearning for GeneralSum Stochastic Games
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We extend Qlearning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Qfunctions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Qvalues. This learning protocol provably conv ..."
Abstract

Cited by 133 (0 self)
 Add to MetaCart
(Show Context)
We extend Qlearning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Qfunctions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Qvalues. This learning protocol provably converges given certain restrictions on the stage games (defined by Qvalues) that arise during learning. Experiments with a pair of twoplayer grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Qfunction, but sometimes fails to converge in the second, which has three different equilibrium Qfunctions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Qlearning than with a singleagent Qlearning method. When at least one agent adopts Nash Qlearning, the performance of both agents is better than using singleagent Qlearning. We have also implemented an online version of Nash Qlearning that balances exploration with exploitation, yielding improved performance.
Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces
, 1996
"... A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning w ..."
Abstract

Cited by 113 (6 self)
 Add to MetaCart
A key element in the solution of reinforcement learning problems is the value function. The purpose of this function is to measure the longterm utility or value of any given state and it is important because an agent can use it to decide what to do next. A common problem in reinforcement learning when applied to systems having continuous states and action spaces is that the value function must operate with a domain consisting of realvalued variables, which means that it should be able to represent the value of infinitely many state and action pairs. For this reason, function approximators are used to represent the value function when a closeform solution of the optimal policy is not available. In this paper, we extend a previously proposed reinforcement learning algorithm so that it can be used with function approximators that generalize the value of individual experiences across both, state and action spaces. In particular, we discuss the benefits of using sparse coarsecoded funct...
Eligibility Traces for OffPolicy Policy Evaluation
 Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporaldifference methods. Here we generalize eligibility traces to offpolicy learning, in which one learns about a policy different from the ..."
Abstract

Cited by 63 (7 self)
 Add to MetaCart
(Show Context)
Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporaldifference methods. Here we generalize eligibility traces to offpolicy learning, in which one learns about a policy different from the policy that generates the data. Offpolicy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macroactions. In this paper we consider the offpolicy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over onestep and Monte Carlo methods. Our results are restricted to modelfree, tablelookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally. 1.
Problem Solving With Reinforcement Learning
, 1995
"... This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high di ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
(Show Context)
This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous statespace environments. In particular, the extension of online updating methods is considered, where the term implies systems that learn as each experience arrives, rather than storing the experiences for use in a separate offline learning phase. Firstly, the use of alternative update rules in place of standard Qlearning (Watkins 1989) is examined to provide faster convergence rates. Secondly, the use of multilayer perceptton (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalising function approximators. Finally, consideration is given to the combination of Adaptive Heuristic Critic (AHC) methods and Qlearning to produce systems combining the benefits of realvalued actions and discrete switching
A Generalized ReinforcementLearning Model: Convergence and Applications
 In Proceedings of the 13th International Conference on Machine Learning
, 1996
"... Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this pap ..."
Abstract

Cited by 45 (6 self)
 Add to MetaCart
Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in mdps extend to a generalized mdp model that includes mdps, twoplayer games and mdps under a worstcase optimality criterion as special cases. The basis of this extension is a stochasticapproximation theorem that reduces asynchronous convergence to synchronous convergence. 1 INTRODUCTION Reinforcement learning is the process by which an agent improves its behavior in an environment via experience. A reinforcementlearning scenario is defined by the experience presented to the agent at each step, and the criterion for evaluating the agent's behavior. One particularly wellstudied reinforcementle...
Fast Concurrent Reinforcement Learners
 IN PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2001
"... When several agents learn concurrently, the payoff received by an agent is dependent on the behavior of the other agents. As the other agents learn, the reward of one agent becomes nonstationary. This makes learning in multiagent systems more difficult than singleagent learning. A few methods ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
When several agents learn concurrently, the payoff received by an agent is dependent on the behavior of the other agents. As the other agents learn, the reward of one agent becomes nonstationary. This makes learning in multiagent systems more difficult than singleagent learning. A few methods, however, are known to guarantee convergence to equilibrium in the limit in such systems. In this paper we experimentally study one such technique, the minimaxQ, in a competitive domain and prove its equivalence with another wellknown method for competitive domains. We study the rate of convergence of minimaxQ and investigate possible ways for increasing the same. We also present a variant of the algorithm, minimaxSARSA, and prove its convergence to minimaxQ values under appropriate conditions. Finally we show that this new algorithm performs better than simple minimaxQ in a generalsum domain as well.