## Nash Q-Learning for General-Sum Stochastic Games (2003)

### Cached

### Download Links

Venue: | JOURNAL OF MACHINE LEARNING RESEARCH |

Citations: | 116 - 0 self |

### BibTeX

@ARTICLE{Hu03nashq-learning,

author = {Junling Hu and Michael P. Wellman},

title = { Nash Q-Learning for General-Sum Stochastic Games},

journal = {JOURNAL OF MACHINE LEARNING RESEARCH},

year = {2003},

volume = {4},

pages = {1039--1069}

}

### Years of Citing Articles

### OpenURL

### Abstract

We extend Q-learning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.

### Citations

4165 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...tiagent Learning 1. Introduction Researchers investigating learning in the context of multiagent systems have been particularly attracted to reinforcement learning techniques (Kaelbling et al., 1996, =-=Sutton and Barto, 1998-=-), perhaps because they do not require environment models and they allow agents to take actions while they learn. In typical multiagent systems, agents lack full information about their counterparts, ... |

1915 | A Course in Game Theory
- Osborne, Rubinstein
- 1999
(Show Context)
Citation Context ...ng in infinitely repeated games (Fudenberg and Levine, 1998) suggests that there are generally a great multiplicity of non-stationary equilibria. This fact is partially demonstrated by Folk Theorems (=-=Osborne and Rubinstein, 1994-=-). 3. Multiagent Q-learning We extend Q-learning to multiagent systems, based on the framework of stochastic games. First, we redefine Q-values for multiagent case, and then present the algorithm for ... |

1412 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...on about their counterparts, and thus the multiagent environment constantly changes as agents learn about each other and adapt their behaviors accordingly. Among reinforcement techniques, Q-learning (=-=Watkins, 1989-=-, Watkins and Dayan, 1992) has been especially well-studied, and possesses a firm foundation in the theory of Markov decision processes. It is also quite easy to use, and has seen wide application, fo... |

1329 |
Markov Decision Processes, Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...time. pi is called a behavior strategy if its decision rule may depend on the history of the game play, pit = ft(Ht). The standard solution to the problem above is through an iterative search method (=-=Puterman, 1994-=-) that searches for a fixed point of the following Bellman equation: v(s,pi) = max a n r(s,a)+β∑ s0 p(s0js,a)v(s0,pi) o , (2) where r(s,a) is the reward for taking action a at state s, s0 is the nex... |

947 |
Non-cooperative Games
- Nash
- 1951
(Show Context)
Citation Context ...e buyer and seller have compatible interests in reaching a deal, but have conflicting interests in the direction of price. The baseline solution concept for general-sum games is the Nash equilibrium (=-=Nash, 1951-=-). In a Nash equilibrium, each player effectively holds a correct expectation about the other players’ behaviors, and acts rationally with respect to this expectation. Acting rationally means the agen... |

903 |
The Theory of Learning in Games
- Fudenberg, Levine
- 1998
(Show Context)
Citation Context ... strategies, which allow conditioning of action on history of play, are more complex, and relatively less well-studied in the stochastic game framework. Work on learning in infinitely repeated games (=-=Fudenberg and Levine, 1998-=-) suggests that there are generally a great multiplicity of non-stationary equilibria. This fact is partially demonstrated by Folk Theorems (Osborne and Rubinstein, 1994). 3. Multiagent Q-learning We ... |

628 | The Linear Complementarity Problem
- Cottle, Pang, et al.
- 1992
(Show Context)
Citation Context ...Right 31, 31 0, 65 Up 0, 0 49, 49 Table 7: Grid Game 2: Q-values in state (0 2) after 61 episodes if always choosing the first Nash tation, we calculate Nash equilibria using the Lemke-Howson method (=-=Cottle et al., 1992-=-), which can be employed to generate equilibria in a fixed order.7 We then define a first-Nash learning agent as one that updates its Q-values using the first Nash equilibrium generated. A second-Nash... |

533 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...er agents, who might be presumed rational or at least regular in some important way. We might expect that accounting for this explicitly would be advantageous to the learner. Indeed, several studies (=-=Littman, 1994-=-, Claus and Boutilier, 1998, Hu and Wellman, 2000) have demonstrated situations where an agent who considers the effect of joint actions outperforms a corresponding agent who learns only in terms of i... |

508 | A general theory of equilibrium selection in games - Harsanyi, Selten - 1988 |

323 | The dynamics of reinforcement learning in cooperative multi-agent systems
- Claus, Boutilier
- 1998
(Show Context)
Citation Context ...might be presumed rational or at least regular in some important way. We might expect that accounting for this explicitly would be advantageous to the learner. Indeed, several studies (Littman, 1994, =-=Claus and Boutilier, 1998-=-, Hu and Wellman, 2000) have demonstrated situations where an agent who considers the effect of joint actions outperforms a corresponding agent who learns only in terms of its own actions. Boutilier (... |

299 | Multiagent reinforcement learning: Theoretical framework and an algorithm
- Hu, Wellman
- 1998
(Show Context)
Citation Context ... σn(an)(Q j(s,a1, . . . ,an)− ˆQ j(s,a1, . . . ,an) ∑ a1,...,an σ1(a1) σn(an) k Q j(s)− ˆQ j(s) k (15) = k Q j(s)− ˆQ j(s) k, 3. In our statement of this assumption in previous writings (=-=Hu and Wellman, 1998-=-, Hu, 1999), we neglected to include the qualification that the same condition be satisfied by all stage games. We have made the qualification more explicit subsequently (Hu and Wellman, 2000). As Bow... |

271 | Multi-agent reinforcement learning: Independent vs. cooperative agents
- Tan
- 1993
(Show Context)
Citation Context ... multiagent context, its ease of application does, and the method has been employed in such multiagent domains as robotic soccer (Stone and Sutton, 2001, Balch, 1997), predator-andprey pursuit games (=-=Tan, 1993-=-, De Jong, 1997, Ono and Fukumoto, 1996), and Internet pricebots (Kephart and Tesauro, 2000). Whereas it is possible to apply Q-learning in a straightforward fashion to each agent in a multiagent syst... |

248 | R-max – a general polynomial time algorithm for near-optimal reinforcement learning - Brafman, Tennenholtz |

243 |
Competitive Markov decision processes
- Filar, Vrieze
- 1996
(Show Context)
Citation Context ...(s,a) under the assumption that all states and actions have been visited infinitely often and the learning rate satisfies certain constraints. 2.2 Stochastic Games The framework of stochastic games (=-=Filar and Vrieze, 1997-=-, Thusijsman, 1992) models multiagent systems with discrete time1 and noncooperative nature. We employ the term “noncooperative” in the technical game-theoretic sense, where it means that agents pursu... |

228 | Graphical models for game theory
- Kearns, Littman, et al.
- 2001
(Show Context)
Citation Context ...cally employed for n-player games (McKelvey and McLennan, 1996). 2. Given known locality in agent interaction, it is sometimes possible to achieve more compact representations using graphical models (=-=Kearns et al., 2001-=-, Koller and Milch, 2001). 1046 NASH Q-LEARNING 4. Convergence We would like to prove the convergence of Qit to an equilibrium Qi for the learning agent i. The value of Qi is determined by the joint... |

196 | Multiagent learning using a variable learning rate - Bowling, Veloso - 2002 |

160 | Multi-agent influence diagrams for representing and solving games - Koller, Milch - 2001 |

151 | Sequential optimality and coordination in multiagent systems - Boutilier - 1999 |

130 | Computation of Equilibria in Finite Games - McKelvey, McLennan - 1996 |

130 | Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems
- Singh, Bertsekas
- 1996
(Show Context)
Citation Context ...kov decision processes. It is also quite easy to use, and has seen wide application, for example to such areas as (to arbitrarily choose four diverse instances) cellular telephone channel allocation (=-=Singh and Bertsekas, 1996-=-), spoken dialog systems (Walker, 2000), robotic control (Hagen, 2001), and computer vision (Bandera et al., 1996). Although the single-agent properties do not transfer c©2003 Junling Hu and Michael P... |

124 | Friend-or-Foe Q-learning in generalsum games - Littman - 2001 |

113 | Scaling reinforcement learning toward RoboCup soccer
- Stone, Sutton
- 2001
(Show Context)
Citation Context ...2003 Junling Hu and Michael P. Wellman. HU AND WELLMAN directly to the multiagent context, its ease of application does, and the method has been employed in such multiagent domains as robotic soccer (=-=Stone and Sutton, 2001-=-, Balch, 1997), predator-andprey pursuit games (Tan, 1993, De Jong, 1997, Ono and Fukumoto, 1996), and Internet pricebots (Kephart and Tesauro, 2000). Whereas it is possible to apply Q-learning in a s... |

93 | Incremental multi-step Q-learning
- Peng, Williams
- 1996
(Show Context)
Citation Context ...um stochastic games, and presented evidence of faster convergence than Minimax-Q. One of the drawbacks of Q-learning is that each state-action tuple has to be visited infinitely often. Q(λ) learning (=-=Peng and Williams, 1996-=-, Wiering and Schmidhuber, 1998) promises an interesting way to speed up the learning process. To apply Q-learning online, we seek a general criterion for setting the exploitation rate, which would ma... |

91 |
Differential games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization
- Isaacs
- 1965
(Show Context)
Citation Context ...ry strategies. The following result shows that there always exists an equilibrium in stationary strategies. 1. For a model of continuous-time multiagent systems, see literature on differential games (=-=Isaacs, 1975-=-, Petrosjan and Zenkevich, 1996). 1043 HU AND WELLMAN Theorem 4 (Fink, 1964) Every n-player discounted stochastic game possesses at least one Nash equilibrium point in stationary strategies. In this p... |

64 | An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email - Walker |

60 | Correlated q-learning - Greenwald, Hall, et al. - 2002 |

59 | Multi-agent reinforcement learning: A critical survey - Shoham, Powers, et al. - 2003 |

55 | Coordination in multiagent reinforcement learning: A bayesian approach - Chalkiadakis, Boutilier - 2003 |

49 | Value-function reinforcement learning in Markov games - Littman |

46 | Exploration of multi-state environments: local measures and back-propagation of uncertainty - Meuleau, Bourgine - 1999 |

45 | Learning Roles: Behavioral Diversity in Robot Teams
- Balch
- 1997
(Show Context)
Citation Context ...ael P. Wellman. HU AND WELLMAN directly to the multiagent context, its ease of application does, and the method has been employed in such multiagent domains as robotic soccer (Stone and Sutton, 2001, =-=Balch, 1997-=-), predator-andprey pursuit games (Tan, 1993, De Jong, 1997, Ono and Fukumoto, 1996), and Internet pricebots (Kephart and Tesauro, 2000). Whereas it is possible to apply Q-learning in a straightforwar... |

41 | Fast concurrent reinforcement learners - Banerjee, Sen, et al. - 2001 |

39 |
Equilibrium in a stochastic n-person game
- Fink
- 1964
(Show Context)
Citation Context ...rium in stationary strategies. 1. For a model of continuous-time multiagent systems, see literature on differential games (Isaacs, 1975, Petrosjan and Zenkevich, 1996). 1043 HU AND WELLMAN Theorem 4 (=-=Fink, 1964-=-) Every n-player discounted stochastic game possesses at least one Nash equilibrium point in stationary strategies. In this paper, we limit our study to stationary strategies. Non-stationary strategie... |

34 | A unified analysis of value-function-based reinforcementlearning algorithms - Szepesvari, Littman - 1999 |

30 | Convergence problems of general-sum multiagent reinforcement learning - Bowling - 2000 |

26 |
Multi-agent reinforcement learning: A modular approach
- Ono, Fukumoto
- 1996
(Show Context)
Citation Context ...ase of application does, and the method has been employed in such multiagent domains as robotic soccer (Stone and Sutton, 2001, Balch, 1997), predator-andprey pursuit games (Tan, 1993, De Jong, 1997, =-=Ono and Fukumoto, 1996-=-), and Internet pricebots (Kephart and Tesauro, 2000). Whereas it is possible to apply Q-learning in a straightforward fashion to each agent in a multiagent system, doing so (as recognized in several ... |

25 | Csaba Szepesvari. Convergence results for single-step on-policy reinforcement-learning algorithms - Singh, Jaakkola, et al. |

21 | A near optimal polynomial time algorithm for learning in certain classes of stochastic games - Brafman, Tennenholtz - 2000 |

20 | Residual Q-learning applied to visual attention
- Bandera, Vico, et al.
- 1996
(Show Context)
Citation Context ...arbitrarily choose four diverse instances) cellular telephone channel allocation (Singh and Bertsekas, 1996), spoken dialog systems (Walker, 2000), robotic control (Hagen, 2001), and computer vision (=-=Bandera et al., 1996-=-). Although the single-agent properties do not transfer c©2003 Junling Hu and Michael P. Wellman. HU AND WELLMAN directly to the multiagent context, its ease of application does, and the method has be... |

17 | Multi-agent Q-learning and regression trees for automated pricing decisions - Sridharan, Tesauro - 2000 |

15 |
Experimental results on Q-learning for general-sum stochastic games
- Hu, Wellman
- 2000
(Show Context)
Citation Context ...or at least regular in some important way. We might expect that accounting for this explicitly would be advantageous to the learner. Indeed, several studies (Littman, 1994, Claus and Boutilier, 1998, =-=Hu and Wellman, 2000-=-) have demonstrated situations where an agent who considers the effect of joint actions outperforms a corresponding agent who learns only in terms of its own actions. Boutilier (1999) studies the poss... |

15 |
A multiagent reinforcement learning algorithm using extended optimal response
- Suematsu, Hayashi
- 2002
(Show Context)
Citation Context .... Perhaps more promising than NashQ itself are the many conceivable extensions and variants that have already begun to appear in the literature. For example, the “extended optimal response” approach (=-=Suematsu and Hayashi, 2002-=-) maintains Q-tables for all agents, and anticipates the other agents’ actions based on a balance of their presumed optimal and observed behaviors. Another interesting direction is reflected in the wo... |

8 |
and Junling Hu. Conjectural equilibrium in multiagent learning
- Wellman
- 1998
(Show Context)
Citation Context ...n(an)(Q j(s,a1, . . . ,an)− ˆQ j(s,a1, . . . ,an) ∑ a1,...,an σ1(a1) σn(an) k Q j(s)− ˆQ j(s) k (15) = k Q j(s)− ˆQ j(s) k, 3. In our statement of this assumption in previous writings (Hu and =-=Wellman, 1998-=-, Hu, 1999), we neglected to include the qualification that the same condition be satisfied by all stage games. We have made the qualification more explicit subsequently (Hu and Wellman, 2000). As Bow... |

6 | Continuous State Space Q-Learning for Control of Nonlinear Systems
- Hagen
- 2001
(Show Context)
Citation Context ..., for example to such areas as (to arbitrarily choose four diverse instances) cellular telephone channel allocation (Singh and Bertsekas, 1996), spoken dialog systems (Walker, 2000), robotic control (=-=Hagen, 2001-=-), and computer vision (Bandera et al., 1996). Although the single-agent properties do not transfer c©2003 Junling Hu and Michael P. Wellman. HU AND WELLMAN directly to the multiagent context, its eas... |

6 |
Learning in Dynamic Noncooperative Multiagent Systems
- Hu
- 1999
(Show Context)
Citation Context ... . . . ,an)− ˆQ j(s,a1, . . . ,an) ∑ a1,...,an σ1(a1) σn(an) k Q j(s)− ˆQ j(s) k (15) = k Q j(s)− ˆQ j(s) k, 3. In our statement of this assumption in previous writings (Hu and Wellman, 1998, =-=Hu, 1999-=-), we neglected to include the qualification that the same condition be satisfied by all stage games. We have made the qualification more explicit subsequently (Hu and Wellman, 2000). As Bowling (2000... |

6 | and Csaba Szepesvári. A generalized reinforcement-learning model: Convergence and applications - Littman - 1996 |

6 | A unified analysis of value-function-based reinforcementlearning algorithms - Littman, Szepesvári - 1999 |

5 | Non-random exploration bonuses for online reinforcement learning - Jong - 1997 |

3 |
Optimality and Equilibria in Stochastic Games. Centrum voor Wiskunde en Informatica
- Thusijsman
- 1992
(Show Context)
Citation Context ...ion that all states and actions have been visited infinitely often and the learning rate satisfies certain constraints. 2.2 Stochastic Games The framework of stochastic games (Filar and Vrieze, 1997, =-=Thusijsman, 1992-=-) models multiagent systems with discrete time1 and noncooperative nature. We employ the term “noncooperative” in the technical game-theoretic sense, where it means that agents pursue their individual... |

1 | Petrosjan and Nikolay A. Zenkevich. Game Theory - Leon - 1996 |