Results 1  10
of
24
Markov games as a framework for multiagent reinforcement learning
 IN PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1994
"... In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior ..."
Abstract

Cited by 500 (10 self)
 Add to MetaCart
In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Qlearninglike algorithm for finding optimal policies and demonstrates its application to a simple twoplayer game in which the optimal policy is probabilistic.
Generalization in Reinforcement Learning: Safely Approximating the Value Function
 Advances in Neural Information Processing Systems 7
, 1995
"... To appear in: G. Tesauro, D. S. Touretzky and T. K. Leen, eds., Advances in Neural Information Processing Systems 7, MIT Press, Cambridge MA, 1995. A straightforward approach to the curse of dimensionality in reinforcement learning and dynamic programming is to replace the lookup table with a genera ..."
Abstract

Cited by 252 (3 self)
 Add to MetaCart
To appear in: G. Tesauro, D. S. Touretzky and T. K. Leen, eds., Advances in Neural Information Processing Systems 7, MIT Press, Cambridge MA, 1995. A straightforward approach to the curse of dimensionality in reinforcement learning and dynamic programming is to replace the lookup table with a generalizing function approximator such as a neural net. Although this has been successful in the domain of backgammon, there is no guarantee of convergence. In this paper, we show that the combination of dynamic programming and function approximation is not robust, and in even very benign cases, may produce an entirely wrong policy. We then introduce GrowSupport, a new algorithm which is safe from divergence yet can still reap the benefits of successful generalization. 1 INTRODUCTION Reinforcement learningthe problem of getting an agent to learn to act from sparse, delayed rewardshas been advanced by techniques based on dynamic programming (DP). These algorithms compute a value function ...
Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach
 Advances in Neural Information Processing Systems 6
, 1994
"... This paper describes the Qrouting algorithm for packet routing, in which a reinforcement learning module is embedded into each node of a switching network. Only local communication is used by each node to keep accurate statistics on which routing decisions lead to minimal delivery times. In simple ..."
Abstract

Cited by 184 (2 self)
 Add to MetaCart
This paper describes the Qrouting algorithm for packet routing, in which a reinforcement learning module is embedded into each node of a switching network. Only local communication is used by each node to keep accurate statistics on which routing decisions lead to minimal delivery times. In simple experiments involving a 36node, irregularly connected network, Qrouting proves superior to a nonadaptive algorithm based on precomputed shortest paths and is able to route efficiently even when critical aspects of the simulation, such as the network load, are allowed to vary dynamically. The paper concludes with a discussion of the tradeoff between discovering shortcuts and maintaining stable policies. 1 INTRODUCTION The field of reinforcement learning has grown dramatically over the past several years, but with the exception of backgammon [8, 2], has had few successful applications to largescale, practical tasks. This paper demonstrates that the practical task of routing packets through...
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract

Cited by 177 (8 self)
 Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a longrun measure of reward, and "I" is an automated planning or learning system (agent). In particular,
CoEvolution in the Successful Learning of Backgammon Strategy
 Machine Learning
, 1998
"... Following Tesauro's work on TDGammon, we used a 4000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and choosing the move with the highest evaluation. However, no back ..."
Abstract

Cited by 107 (24 self)
 Add to MetaCart
Following Tesauro's work on TDGammon, we used a 4000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and choosing the move with the highest evaluation. However, no backpropagation, reinforcement or temporal difference learning methods were employed. Instead we apply simple hillclimbing in a relative fitness environment. We start with an initial champion of all zero weights and proceed simply by playing the current champion network against a slightly mutated challenger and changing weights if the challenger wins. Surprisingly, this worked rather well. We investigate how the peculiar dynamics of this domain enabled a previously discarded weak method to succeed, by preventing suboptimal equilibria in a "metagame" of selflearning. Keywords: coevolution, backgammon, reinforcement, temporal difference learning, selflearning Running Head: COEVOLUTIONARY LEA...
Coevolution of A Backgammon Player
 Proceedings Artificial Life V
"... One of the persistent themes in Artificial Life research is the use of coevolutionary arms races in the development of specific and complex behaviors. However, other than Sims’s work on artificial robots, most of the work has attacked very simple games of prisoners dilemma or predator and prey. Fol ..."
Abstract

Cited by 79 (11 self)
 Add to MetaCart
One of the persistent themes in Artificial Life research is the use of coevolutionary arms races in the development of specific and complex behaviors. However, other than Sims’s work on artificial robots, most of the work has attacked very simple games of prisoners dilemma or predator and prey. Following Tesauro’s work on TDGammon, we used a 4000 parameter feedforward neural network to develop a competitive backgammon evaluation function. Play proceeds by a roll of the dice, application of the network to all legal moves, and choosing the move with the highest evaluation. However, no backpropagation, reinforcement
Issues in Using Function Approximation for Reinforcement Learning
 IN PROCEEDINGS OF THE FOURTH CONNECTIONIST MODELS SUMMER SCHOOL
, 1993
"... ..."
Problem Solving With Reinforcement Learning
, 1995
"... This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous statespace environments. In particular, the extension of online updating methods is considered, where the term implies systems that learn as each experience arrives, rather than storing the experiences for use in a separate offline learning phase. Firstly, the use of alternative update rules in place of standard Qlearning (Watkins 1989) is examined to provide faster convergence rates. Secondly, the use of multilayer perceptton (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalising function approximators. Finally, consideration is given to the combination of Adaptive Heuristic Critic (AHC) methods and Qlearning to produce systems combining the benefits of realvalued actions and discrete switching
A Generalized ReinforcementLearning Model: Convergence and Applications
 In Proceedings of the 13th International Conference on Machine Learning
, 1996
"... Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this pap ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in mdps extend to a generalized mdp model that includes mdps, twoplayer games and mdps under a worstcase optimality criterion as special cases. The basis of this extension is a stochasticapproximation theorem that reduces asynchronous convergence to synchronous convergence. 1 INTRODUCTION Reinforcement learning is the process by which an agent improves its behavior in an environment via experience. A reinforcementlearning scenario is defined by the experience presented to the agent at each step, and the criterion for evaluating the agent's behavior. One particularly wellstudied reinforcementle...
Generalized markov decision processes: dynamicprogramming and reinforcementlearning algorithms
 in: Proceedings of the 13th International Conference of Machine Learning (ICML96
, 1996
"... The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decisionmaking problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, modelbased reinforcementlearning, and Qlcarning that can be used to make optimal dccisions in the generali7,ed model undcr various assumptions. Applications of the theory to particular models are described, including riskaverse MDPs, explorationsensitive MDPs, sarsa, Qlcarning with spreading, twoplayer games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochasticapproximation theorCIn that reduces asynchronous convergence to synchronous convergence. 1 1