## Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games (2002)

Venue: | in Advances in Neural Information Processing Systems |

Citations: | 78 - 3 self |

### BibTeX

@INPROCEEDINGS{Wang02reinforcementlearning,

author = {Xiaofeng Wang and Tuomas Sandholm},

title = {Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games},

booktitle = {in Advances in Neural Information Processing Systems},

year = {2002},

pages = {1571--1578},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Multiagent learning is a key problem in game theory and AI. It involves two interrelated learning problems: identifying the game and learning to play. These two problems prevail even in team games where the agents' interests do not conflict. Even team games can have multiple Nash equilibria, only some of which are optimal. We present optimal adaptive learning (OAL), the first algorithm that converges to an optimal Nash equilibrium for any team Markov game. We provide a convergence proof, and show that the algorithm's parameters are easy to set so that the convergence conditions are met. Our experiments show that existing algorithms do not converge in many of these problems while OAL does. We also demonstrate the importance of the fundamental ideas behind OAL: incomplete history sampling and biased action selection.

### Citations

780 |
The Theory of Learning in Games
- Fudenberg, Levine
- 1998
(Show Context)
Citation Context ...t optimally, the agents have to coordinate which equilibrium they will play. This problem is called equilibrium selection. A learning approach widely advocated to tackle this problem issctitious play =-=[3]-=-: each agent i assumes that the other agents use a stationary (but unknown) joint policy, and i models that policy based on the frequency of past observations. Each agent i uses its best-response stra... |

516 |
Dynamic Programming and and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ... in state s. The objective is tosnd a policy that maximizes P 1 t=0st E(r t j), where r t is the payo at time t, ands2 [0; 1) is a discount factor. There exists a deterministic optimal policy [12]. The Q-function for this policy, Q , is dened by the set of equations Q (s; a) = R(s; a) + P s 0 2S T (s; a; s 0 ) max a 0 2A Q (s 0 ; a 0 ). At any state s, the optimal policy chooses arg max... |

304 | The dynamics of reinforcement learning in cooperative multiagent systems
- Claus, Boutilier
- 1998
(Show Context)
Citation Context ...ula:sT e t = :99 T e t 1 . For OAL, we used C = 3:5 and B(N t ) = N 0:4 t . 5.1 OAL vs. JAL Thesrst experiment compared our OAL algorithm to Claus and Boutilier's joint action learner (JAL) algorithm =-=[2]-=-. The game is the 3-player game of Table 1, where the players do not know the game structure. OAL converged to an optimal Nash equilibrium on each of the 1,000 runs. JAL converged to a suboptimal Nash... |

294 | The Evolution of Conventions
- Young
- 1993
(Show Context)
Citation Context ...o 0 to every other joint action. Virtual games simplify the coordination task for the learning agents. For example for the game in Table 1, the VG is weakly acyclic. Denition 2 (Weakly acyclic game [1=-=8]-=-) Let G be an n-player game in matrix form. The best-response graph of G takes each joint action a 2 A as a vertex and connects two vertices a and a 0 with a directed edge a ! a 0 if and only if 1) a ... |

283 | M.P.: Multiagent reinforcement learning: Theoretical framework and an algorithm
- Hu, Wellman
- 1998
(Show Context)
Citation Context ...a in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under various types of Markov games such as zero-sum Markov games [9], generalsum Markov games =-=[6, 8]-=- and team Markov games [2]. Multiagent RL in Markov games involves two interrelated learning problems: identifying the game and learning to play. These two problems prevail even in team Markov games w... |

248 | Multi-agent reinforcement learning: Independent vs. cooperative agents
- Tan
- 1993
(Show Context)
Citation Context ...selection. 1 Introduction Multiagent learning is a key problem in game theory and AI. For a decade, computer scientists have worked on extending reinforcement learning (RL) [7] to multiagent settings =-=[11, 15, 5, 17]-=-. In game-theoretic terms, multiagent RL is the problem of learning to play Nash equilibria in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under... |

165 |
Spieltheoretische Behandlung eines Oligopolmodells mit Nachfrageträgheit. Zeitschrift für Gesamte Staatswissenschaft, 121, 301–324 and 667–689
- Selten, R
- 1965
(Show Context)
Citation Context ...s; a) be the payo that the agents 1 Throughout the paper, every Nash equilibrium that we discuss is also a subgame perfect Nash equilibrium. (This renement of Nash equilibrium wassrst introduced in [13] for dierent games). receive from the VG in state s for a joint action a. We let V G (s; a) = 1 if a = arg max a 0 2A Q (s; a 0 ) and V G (s; a) = 0 otherwise. For example, the VG for the game ... |

148 | Learning to coordinate without sharing information
- Sen, Sekaran, et al.
- 1994
(Show Context)
Citation Context ...selection. 1 Introduction Multiagent learning is a key problem in game theory and AI. For a decade, computer scientists have worked on extending reinforcement learning (RL) [7] to multiagent settings =-=[11, 15, 5, 17]-=-. In game-theoretic terms, multiagent RL is the problem of learning to play Nash equilibria in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under... |

111 | Convergence results for single-step on-policy reinforecement-learning algorithms
- Singh, Jaakkola, et al.
- 2000
(Show Context)
Citation Context ... the Limit with Innite Exploration" (GLIE) property, then Q t will converge to Q (with either a model-based or model-free approach [7]) and the agent will converge in behavior to an optimal poli=-=cy [14-=-]. Using GLIE, every state-action pair is visited innitely often, and in the limit the action selection is greedy with respect to the Q-function w.p.1. One common GLIE policy is Boltzmann exploration ... |

47 |
long run equilibria in games
- Learning
- 1993
(Show Context)
Citation Context ...ame. However, this equilibrium may not be optimal. The same problem prevails in other equilibrium-selection approaches in game theory such as adaptive play [18] and the evolutionary model proposed in =-=[7]-=-. In RL, the agents usually do not know the environmental model (game) up front and receive noisy payoffs. In this case, even the lexicographic approaches may not work because agents receive noisy pay... |

42 | Value-function reinforcement learning in Markov games
- Littman
- 2000
(Show Context)
Citation Context ...arning to play Nash equilibria in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under various types of Markov games such as zero-sum Markov games =-=[9]-=-, generalsum Markov games [6, 8] and team Markov games [2]. Multiagent RL in Markov games involves two interrelated learning problems: identifying the game and learning to play. These two problems pre... |

41 |
learning and coordination in multiagent decision processes
- Planning
- 1996
(Show Context)
Citation Context ...R; T i where S is asnite state space; A is the space of actions the agent can take; R : S A !sis a payo functions(R(s; a) is the expected payo for taking action a in state s); and T : S A S ! [0; 1=-=]-=- is a transition function (T (s; a; s 0 ) is the probability of ending in state s 0 , given that action a is taken in state s). An agent's deterministic policy (aka. strategy) is a mapping from states... |

21 |
Learning to coordinate actions in multi-agent systems
- Wei
- 1993
(Show Context)
Citation Context ...selection. 1 Introduction Multiagent learning is a key problem in game theory and AI. For a decade, computer scientists have worked on extending reinforcement learning (RL) [7] to multiagent settings =-=[11, 15, 5, 17]-=-. In game-theoretic terms, multiagent RL is the problem of learning to play Nash equilibria in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under... |

6 |
Reinforcement learning: A survey. JAIR
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...sampling and biased action selection. 1 Introduction Multiagent learning is a key problem in game theory and AI. For a decade, computer scientists have worked on extending reinforcement learning (RL) =-=[7]-=- to multiagent settings [11, 15, 5, 17]. In game-theoretic terms, multiagent RL is the problem of learning to play Nash equilibria in Markov games (aka. stochastic games) [16] when the game structure ... |

6 |
Learning in the iterated prisoner’s dilemma
- Sandholm, Crites
- 1995
(Show Context)
Citation Context ...et the convergence conditions. 1 Introduction Multiagent learning is a key problem in AI. For a decade, computer scientists have worked on extending reinforcement learning (RL) to multiagent settings =-=[11, 15, 5, 17]-=-. Markov games (aka. stochastic games) [16] have emerged as the prevalent model of multiagent RL. An approach called Nash-Q [9, 6, 8] has been proposed for learning the game structure and the agents’ ... |

4 |
Friend-or-Foe Q-learning in general sum game
- Littman
(Show Context)
Citation Context ...a in Markov games (aka. stochastic games) [16] when the game structure is unknown. This has been studied under various types of Markov games such as zero-sum Markov games [9], generalsum Markov games =-=[6, 8]-=- and team Markov games [2]. Multiagent RL in Markov games involves two interrelated learning problems: identifying the game and learning to play. These two problems prevail even in team Markov games w... |

3 |
chain: theory and applications
- Markov
- 1976
(Show Context)
Citation Context ...describing BAP without exploration. Our objective here is to show that on a WAGB, BAP with GLIE exploration will converge to the (\clustered ") terminal state. To do that, we need the following l=-=emma [4]-=-. Lemma 2 3 Let P be thesnite transition matrix of a stationary Markov chain with a unique stationary distributionsg. Let fP t g 1 t=1 be a sequence ofsnite transition matrices. Let f be a probability... |

3 |
decision processes-discrete stochastic dynamic programming
- Markov
- 1994
(Show Context)
Citation Context ...on for this policy, Q , is dened by the set of equations Q (s; a) = R(s; a) + P s 0 2S T (s; a; s 0 ) max a 0 2A Q (s 0 ; a 0 ). At any state s, the optimal policy chooses arg max a Q (s; a) [10]. Reinforcement learning (RL) [7] can be viewed as a sampling method for estimating Q when the payo function R and/or transition function T are unknown.sQ (s; a) can be approximated by a functio... |

2 |
Optimality and equilibrium in stochastic games. Centrum voor Wiskunde en Informatica
- Thusijsman
- 1992
(Show Context)
Citation Context ...inforcement learning (RL) [7] to multiagent settings [11, 15, 5, 17]. In game-theoretic terms, multiagent RL is the problem of learning to play Nash equilibria in Markov games (aka. stochastic games) =-=[16]-=- when the game structure is unknown. This has been studied under various types of Markov games such as zero-sum Markov games [9], generalsum Markov games [6, 8] and team Markov games [2]. Multiagent R... |

2 |
R.Crites. Learning in the iterated prisoner's dilemma. Biosystems
- Sandholm
- 1995
(Show Context)
Citation Context |