## If multi-agent learning is the answer, what is the question? (2007)

Venue: | ARTIFICIAL INTELLIGENCE |

Citations: | 78 - 0 self |

### BibTeX

@ARTICLE{Shoham07ifmulti-agent,

author = {Yoav Shoham and Rob Powers and Trond Grenager},

title = {If multi-agent learning is the answer, what is the question?},

journal = {ARTIFICIAL INTELLIGENCE},

year = {2007},

volume = {171}

}

### Years of Citing Articles

### OpenURL

### Abstract

The area of learning in multi-agent systems is today one of the most fertile grounds for interaction between game theory and artificial intelligence. We focus on the foundational questions in this interdisciplinary area, and identify several distinct agendas that ought to, we argue, be separated. The goal of this article is to start a discussion in the research community that will result in firmer foundations for the area.

### Citations

4113 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ... the history of single-agent learning is as rich if not richer, with thousands of articles, many books, and some very compelling applications in a variety of fields (for some examples see [29,40], or =-=[50]-=-). While it is only in recent years that AI has branched into the multi-agent aspects of learning, it has done so with * Corresponding author. E-mail addresses: shoham@cs.stanford.edu (Y. Shoham), pow... |

2851 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...one learns how well one’s own various possible actions fare. This work takes place under the general heading of reinforcement learning, 6 and most approaches have their roots in the Bellman equations =-=[3]-=-. The basic algorithm for solving for the best policy in a known MDP starts by initializing a value function, V0 : S → R, with a value for each state in the MDP. The value function can then be iterati... |

2470 | A decision-theoretic generalization of online learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...ch of its actions with probability proportional to max(r t i (aj ,si), 0) at each time step t + 1. Recently, these ideas have also been adopted by researchers in the computer science community (e.g., =-=[17,27,55]-=-). Note that the application of approaches based on regret minimization has been restricted to the case of repeated games. The difficulties of extending this concept to stochastic games are discussed ... |

1387 | Reinforcement learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...ligence (AI) the history of single-agent learning is as rich if not richer, with thousands of articles, many books, and some very compelling applications in a variety of fields (for some examples see =-=[29,40]-=-, or [50]). While it is only in recent years that AI has branched into the multi-agent aspects of learning, it has done so with * Corresponding author. E-mail addresses: shoham@cs.stanford.edu (Y. Sho... |

973 |
The Evolution and the Theory of Games
- Smith
- 1982
(Show Context)
Citation Context ...hem, we also intend our comments to apply at a general level to large population games and evolutionary models, and particularly replicator dynamics (RD) [47] and evolutionary stable strategies (ESS) =-=[49]-=-. These are defined as follows. The replicator dynamic model assumes a population of homogeneous agents each of which continuously plays a two-player game against every other agent. Formally the setti... |

875 |
The theory of learning in games
- Fudenberg, Levine
- 1998
(Show Context)
Citation Context ...oposals for how to evaluate the success of learning rules going back to [23] and [5]. Since that time hundreds, if not thousands, of articles have been published on the topic, and at least two books (=-=[20]-=- and [54]). In Artificial Intelligence (AI) the history of single-agent learning is as rich if not richer, with thousands of articles, many books, and some very compelling applications in a variety of... |

525 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...erature attempting to extend Bellman-style single-agent reinforcement learning techniques (in particular, Q-learning [53]) to the multi-agent setting, has fared well in zero-sum repeated games (e.g., =-=[36]-=- and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., [14,31,52]), but less well in general-sum stochastic games (e.g., [21,26,37]) (for the reader unfamiliar with this line of work, w... |

459 | Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria
- Erev, Roth
- 1998
(Show Context)
Citation Context ...arning dynamics being studied. One approach is to apply the experimental methodology of the social sciences. There are several good examples of this approach in economics and game theory, for example =-=[15]-=- and [11]. There could be other supports for studying a given learning process. For example, to the extent that one accepts the Bayesian model as at least an idealized model of human decision making, ... |

321 | The dynamics of reinforcement learning in cooperative multiagent systems
- Claus, Boutilier
- 1998
(Show Context)
Citation Context ...ning techniques (in particular, Q-learning [53]) to the multi-agent setting, has fared well in zero-sum repeated games (e.g., [36] and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., =-=[14,31,52]-=-), but less well in general-sum stochastic games (e.g., [21,26,37]) (for the reader unfamiliar with this line of work, we cover it briefly in Section 4). Indeed, upon close examination, it becomes cle... |

297 | Multiagent reinforcement learning: Theoretical framework and an algorithm
- Hu, Wellman
- 1998
(Show Context)
Citation Context ...t setting, has fared well in zero-sum repeated games (e.g., [36] and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., [14,31,52]), but less well in general-sum stochastic games (e.g., =-=[21,26,37]-=-) (for the reader unfamiliar with this line of work, we cover it briefly in Section 4). Indeed, upon close examination, it becomes clear that the very foundations of MAL could benefit from explicit di... |

248 | Near-optimal reinforcement learning in polynomial time
- Kearns, Singh
(Show Context)
Citation Context ...and thus the agents’ problem is that of coordination; indeed these are also called games of pure coordination. The work on zero-sum and common-payoff games continues to be refined and extended (e.g., =-=[8,32,34,52]-=-). Much of this work has concentrated on provably optimal tradeoffs between exploration and exploitation in unknown, zerosum games; this is a fascinating topic, but not germane to our focus. More rele... |

245 | R-max - a general polynomial time algorithm for near-optimal reinforcement learning
- Brafman, Tennenholtz, et al.
- 2002
(Show Context)
Citation Context ...and thus the agents’ problem is that of coordination; indeed these are also called games of pure coordination. The work on zero-sum and common-payoff games continues to be refined and extended (e.g., =-=[8,32,34,52]-=-). Much of this work has concentrated on provably optimal tradeoffs between exploration and exploitation in unknown, zerosum games; this is a fascinating topic, but not germane to our focus. More rele... |

243 | Rational Learning Leads to Nash Equilibrium
- KALAI, LEHRER
- 1993
(Show Context)
Citation Context ... of different ways to assign these probabilities such as smooth fictitious play [18] and exponential fictitious play [19]. A more sophisticated version of the same scheme is seen in rational learning =-=[30]-=-. The model is a distribution over the repeated-game strategies. One starts with some prior distribution; for example, in a repeated Rochambeau game, the prior could state that with probability 0.5 th... |

242 | 2000) “A Simple Adaptive Procedure Leading to Correlated Equilibrium
- Hart, Mas-Colell
(Show Context)
Citation Context ...merous times over the years under the names of universal consistency, no-regret learning, and the Bayes envelope (see [16] for an overview of this history). We will describe the algorithm proposed in =-=[24]-=- as a representative of this body of work. We start by defining the regret, rt i (aj ,si) of agent i for playing the sequence of actions si instead of playing action aj , given that the opponents play... |

193 | Online convex programming and generalized infinitesimal gradient ascent
- Zinkevich
- 2003
(Show Context)
Citation Context ...ch of its actions with probability proportional to max(r t i (aj ,si), 0) at each time step t + 1. Recently, these ideas have also been adopted by researchers in the computer science community (e.g., =-=[17,27,55]-=-). Note that the application of approaches based on regret minimization has been restricted to the case of repeated games. The difficulties of extending this concept to stochastic games are discussed ... |

159 | Learning to coordinate without sharing information
- Sen, Sekaran, et al.
- 1994
(Show Context)
Citation Context ...environment is passive: � Qi(s, ai) ← (1 − αt)Qi(s, ai) + αt Ri(s, �a)+ γVi(s ′ ) � Vi(s) ← max Qi(s, ai) ai∈Ai Several authors have tested variations of the basic Q-learning algorithm for MAL (e.g., =-=[48]-=-). However, this approach ignores the multi-agent nature of the setting entirely. The Q-values are updated without regard for the actions selected by the other agents. While this can be justified when... |

149 | Multiagent planning with factored MDPs
- Guestrin, Koller, et al.
- 2001
(Show Context)
Citation Context ... or time required to learn the policy. In this case there is rarely a role for equilibrium analysis; the agents have no freedom to deviate from the prescribed algorithm. Examples of this work include =-=[12,14,22]-=- to name a small sample. Researchers interested in this agenda have access to a large body of existing work both within AI and other fields such as control theory and distributed computing. In our fin... |

141 |
Learning mixed equilibria
- Fudenberg, Kreps
- 1993
(Show Context)
Citation Context ...the highest probability, but allowing some chance of playing any of the strategies. A number of proposals have been made of different ways to assign these probabilities such as smooth fictitious play =-=[18]-=- and exponential fictitious play [19]. A more sophisticated version of the same scheme is seen in rational learning [30]. The model is a distribution over the repeated-game strategies. One starts with... |

135 | Approximating game-theoretic optimal strategies for full-scale po ker
- Billings, Burch, et al.
- 2003
(Show Context)
Citation Context ...ples where computer calculation of approximate equilibria within a restricted strategy space provided valuable guidance in constructing effective strategies. This includes the game of Poker ([33] and =-=[4]-=-), and even, as an exception to the general rule we mentioned, one program that competed in the Trading Agent Competition [13]. Our point has only been that in the context of complex games, so-called ... |

128 | Representations and solutions for game-theoretic problems
- Koller, Pfeffer
- 1997
(Show Context)
Citation Context ... are examples where computer calculation of approximate equilibria within a restricted strategy space provided valuable guidance in constructing effective strategies. This includes the game of Poker (=-=[33]-=- and [4]), and even, as an exception to the general rule we mentioned, one program that competed in the Trading Agent Competition [13]. Our point has only been that in the context of complex games, so... |

124 |
Friend-or-Foe Q-learning in generalsum games
- Littman
- 2001
(Show Context)
Citation Context ...t setting, has fared well in zero-sum repeated games (e.g., [36] and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., [14,31,52]), but less well in general-sum stochastic games (e.g., =-=[21,26,37]-=-) (for the reader unfamiliar with this line of work, we cover it briefly in Section 4). Indeed, upon close examination, it becomes clear that the very foundations of MAL could benefit from explicit di... |

114 | Regret in the on-line decision problem
- Foster, Vohra
- 1999
(Show Context)
Citation Context ... success of learning rules [5,23], and has since been extended and rediscovered numerous times over the years under the names of universal consistency, no-regret learning, and the Bayes envelope (see =-=[16]-=- for an overview of this history). We will describe the algorithm proposed in [24] as a representative of this body of work. We start by defining the regret, rt i (aj ,si) of agent i for playing the s... |

114 | Nash q-learning for general-sum stochastic games
- Hu, Wellman
(Show Context)
Citation Context ...s; this is a fascinating topic, but not germane to our focus. More relevant are the most recent efforts in this line of research to extend the “Bellman heritage” to general-sum games (e.g., Nash-Q by =-=[25]-=- and CE-Q by [21]). We do not cover these for two reasons: The description is more involved, and the results have been less satisfactory; more on the latter below. 4.1.3. Regret minimization approache... |

111 |
Consistency and cautious fictitious play
- Fudenberg, Levine
- 1995
(Show Context)
Citation Context ... some chance of playing any of the strategies. A number of proposals have been made of different ways to assign these probabilities such as smooth fictitious play [18] and exponential fictitious play =-=[19]-=-. A more sophisticated version of the same scheme is seen in rational learning [30]. The model is a distribution over the repeated-game strategies. One starts with some prior distribution; for example... |

106 |
Approximation to Bayes risk in repeated play, Contributions to the Theory
- Hannan
- 1957
(Show Context)
Citation Context ...arly as 1951, fictitious play [10] was proposed as a learning algorithm for computing equilibria in games and there have been proposals for how to evaluate the success of learning rules going back to =-=[23]-=- and [5]. Since that time hundreds, if not thousands, of articles have been published on the topic, and at least two books ([20] and [54]). In Artificial Intelligence (AI) the history of single-agent ... |

90 |
Strategic Learning and its Limits
- Young
- 2006
(Show Context)
Citation Context ...or how to evaluate the success of learning rules going back to [23] and [5]. Since that time hundreds, if not thousands, of articles have been published on the topic, and at least two books ([20] and =-=[54]-=-). In Artificial Intelligence (AI) the history of single-agent learning is as rich if not richer, with thousands of articles, many books, and some very compelling applications in a variety of fields (... |

87 | Rational and convergent learning in stochastic games
- Bowling, Veloso
(Show Context)
Citation Context ...ussed. It clearly defines what it means for a reward to be high enough (namely, to exhibit no regret); we also discussed the limitations of this criterion. A more recent example, this one from AI, is =-=[7]-=-. This work puts forward two criteria for any learning algorithm in a multi-agent setting: (1) The learning should always converge to a stationary policy, and (2) if the opponent converges to a statio... |

82 |
Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation
- Brown
- 1951
(Show Context)
Citation Context ... learning in multi-agent systems, or multi-agent learning (MAL henceforth), has a long history in game theory, almost as long as the history of game theory itself. 2 As early as 1951, fictitious play =-=[10]-=- was proposed as a learning algorithm for computing equilibria in games and there have been proposals for how to evaluate the success of learning rules going back to [23] and [5]. Since that time hund... |

81 | Reinforcement learning to play an optimal Nash equilibrium in team Markov games
- Wang, Sandholm
- 2003
(Show Context)
Citation Context ...ning techniques (in particular, Q-learning [53]) to the multi-agent setting, has fared well in zero-sum repeated games (e.g., [36] and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., =-=[14,31,52]-=-), but less well in general-sum stochastic games (e.g., [21,26,37]) (for the reader unfamiliar with this line of work, we cover it briefly in Section 4). Indeed, upon close examination, it becomes cle... |

80 | An algorithm for distributed reinforcement learning in cooperative multi-agent systems
- Lauer, Riedmiller
- 2000
(Show Context)
Citation Context ...and thus the agents’ problem is that of coordination; indeed these are also called games of pure coordination. The work on zero-sum and common-payoff games continues to be refined and extended (e.g., =-=[8,32,34,52]-=-). Much of this work has concentrated on provably optimal tradeoffs between exploration and exploitation in unknown, zerosum games; this is a fascinating topic, but not germane to our focus. More rele... |

71 |
Evolutionary" selection dynamics in games: Convergence and limit properties
- Nachbar
- 1990
(Show Context)
Citation Context ...its play can be shown to converge to an equilibrium in zero-sum games [46], 2 × 2 games with generic payoffs [41], or games that can be solved by iterated elimination of strictly dominated strategies =-=[42]-=-. Similarly in AI, in [38] minimax-Q learning is proven to converge in the limit to the correct Q-values for any zero-sum game, guaranteeing convergence to a Nash equilibrium in self-play. This result... |

70 | Convergence and no-regret in multiagent learning
- Bowling
- 2004
(Show Context)
Citation Context ...i (aj � ,si|s−i) � <ɛ In both game theory and artificial intelligence, a large number of algorithms have been show to satisfy universal consistency or no-regret requirements. In addition, recent work =-=[6]-=- has tried to combine these criteria resulting in GIGA-WoLF, a no-regret algorithm that provably achieves convergence to a Nash equilibrium in self-play for games with two players and two actions per ... |

70 |
Replicator dynamics
- Schuster, Sigmund
- 1983
(Show Context)
Citation Context ...l. Although we will not specifically include them, we also intend our comments to apply at a general level to large population games and evolutionary models, and particularly replicator dynamics (RD) =-=[47]-=- and evolutionary stable strategies (ESS) [49]. These are defined as follows. The replicator dynamic model assumes a population of homogeneous agents each of which continuously plays a two-player game... |

68 |
Rationality of Self and Others in an Economic System
- Arrow
- 1986
(Show Context)
Citation Context ...extend naturally to the multi-agent case. 8 It has been noted that game theory is somewhat unusual in having the notion of an equilibrium without associated dynamics that give rise to the equilibrium =-=[1]-=-.sY. Shoham et al. / Artificial Intelligence 171 (2007) 365–377 375 optimality) The algorithm must achieve an ɛ-optimal payoff against any ‘target opponent’, (2) (Safety) The algorithm must achieve at... |

67 | Run the GAMUT: A comprehensive approach to evaluating game-theoretic algorithms
- Nudelman, Wortman, et al.
- 2004
(Show Context)
Citation Context ...e bake-off between our proposed algorithms and the other leading contenders across a broad range of games. The algorithms we coded ourselves; the games were drawn from GAMUT, an existing testbed (see =-=[43]-=- and http://gamut.stanford.edu). GAMUT is available to the community at large. It would be useful to have a learning-algorithm repository as well. To conclude, we re-emphasize the statement made at th... |

60 | Correlated Q-learning
- Greenwald, Hall
- 2003
(Show Context)
Citation Context ...t setting, has fared well in zero-sum repeated games (e.g., [36] and [38]) as well as common-payoff (or ‘team’) repeated games (e.g., [14,31,52]), but less well in general-sum stochastic games (e.g., =-=[21,26,37]-=-) (for the reader unfamiliar with this line of work, we cover it briefly in Section 4). Indeed, upon close examination, it becomes clear that the very foundations of MAL could benefit from explicit di... |

55 | New criteria and a new algorithm for learning in multi-agent systems
- Powers, Shoham
- 2005
(Show Context)
Citation Context ...ent’s strategy, is not feasible. But this work, to our knowledge, marks the first time a formal criterion was put forward in AI. A third example of the last agenda is our own work in recent years. In =-=[45]-=- we define a criterion parameterized by a class of ‘target opponents’; with this parameter we make three requirements of any learning algorithm: (1) (Targeted 7 Although this is beyond the scope of th... |

54 |
Controlled random walks
- Blackwell
- 1956
(Show Context)
Citation Context ...951, fictitious play [10] was proposed as a learning algorithm for computing equilibria in games and there have been proposals for how to evaluate the success of learning rules going back to [23] and =-=[5]-=-. Since that time hundreds, if not thousands, of articles have been published on the topic, and at least two books ([20] and [54]). In Artificial Intelligence (AI) the history of single-agent learning... |

51 | Local-effect games
- Leyton-Brown, Tennenholtz
- 2003
(Show Context)
Citation Context ...a sample Nash equilibrium in symmetric games. Other adaptive procedures have been proposed more recently for computing other solution concepts (for example, computing equilibria in local-effect games =-=[35]-=-). These tend not to be the most efficient computation methods, but they do sometimes constitute quick-and-dirty methods that can easily be understood and implemented. The second agenda is descriptive... |

48 | Efficient learning equilibrium
- Brafman, Tennenholtz
(Show Context)
Citation Context ...ma game. Although one might expect that game theory purists might flock to this approach, there are very few examples of it. In fact, the only example we know originates in AI rather than game theory =-=[9]-=-, and it is explicitly rejected by at least some game theorists [18]. We consider it a legitimate normative theory. Its practicality depends on the complexity of the stage game being played and the le... |

43 | A generalized reinforcement-learning model: Convergence and applications
- Littman, Szepesvári
- 1996
(Show Context)
Citation Context ...ttempting to extend Bellman-style single-agent reinforcement learning techniques (in particular, Q-learning [53]) to the multi-agent setting, has fared well in zero-sum repeated games (e.g., [36] and =-=[38]-=-) as well as common-payoff (or ‘team’) repeated games (e.g., [14,31,52]), but less well in general-sum stochastic games (e.g., [21,26,37]) (for the reader unfamiliar with this line of work, we cover i... |

42 | Learning against opponents with bounded memory
- Powers, Shoham
(Show Context)
Citation Context ...s the set of stationary opponents in general-sum two-player repeated games. More recent work has extended these results to handle opponents whose play is conditional on the recent history of the game =-=[44]-=- and settings with more than two players [51]. 6. Summary In this article we have made the following points: 1. Learning in MAS is conceptually, not only technically, challenging. 2. One needs to be c... |

31 | On no-regret learning, fictitious play, and nash equilibrium
- Jafari, Greenwald, et al.
- 2001
(Show Context)
Citation Context ...ch of its actions with probability proportional to max(r t i (aj ,si), 0) at each time step t + 1. Recently, these ideas have also been adopted by researchers in the computer science community (e.g., =-=[17,27,55]-=-). Note that the application of approaches based on regret minimization has been restricted to the case of repeated games. The difficulties of extending this concept to stochastic games are discussed ... |

30 | 2002), “Sophisticated EWA Learning and Strategic Teaching in Repeated Games
- Camerer, Ho, et al.
(Show Context)
Citation Context ...namics being studied. One approach is to apply the experimental methodology of the social sciences. There are several good examples of this approach in economics and game theory, for example [15] and =-=[11]-=-. There could be other supports for studying a given learning process. For example, to the extent that one accepts the Bayesian model as at least an idealized model of human decision making, one could... |

25 |
An iterative method of solving a game. Annals of Mathematics 54:298–301
- Robinson
- 1951
(Show Context)
Citation Context ... AI. For example, while fictitious play does not in general converge to a Nash equilibrium of the stage game, the distribution of its play can be shown to converge to an equilibrium in zero-sum games =-=[46]-=-, 2 × 2 games with generic payoffs [41], or games that can be solved by iterated elimination of strictly dominated strategies [42]. Similarly in AI, in [38] minimax-Q learning is proven to converge in... |

23 | Walverine: A Walrasian Trading Agent
- Cheng, Leung, et al.
- 2003
(Show Context)
Citation Context ...nstructing effective strategies. This includes the game of Poker ([33] and [4]), and even, as an exception to the general rule we mentioned, one program that competed in the Trading Agent Competition =-=[13]-=-. Our point has only been that in the context of complex games, so-called “bounded rationality”, or the deviation from the ideal behavior of omniscient agents, is not an esoteric phenomenon to be brus... |

19 | Kaelbling. Mobilized ad-hoc networks: A reinforcement learning approach
- Chang, Ho, et al.
- 2004
(Show Context)
Citation Context ... or time required to learn the policy. In this case there is rarely a role for equilibrium analysis; the agents have no freedom to deviate from the prescribed algorithm. Examples of this work include =-=[12,14,22]-=- to name a small sample. Researchers interested in this agenda have access to a large body of existing work both within AI and other fields such as control theory and distributed computing. In our fin... |

17 | 2005): “Learning to play games in extensive form by valuation
- Jehiel, Samet
(Show Context)
Citation Context ...etting, we would be remiss not to acknowledge the literature that does not. Certainly one could discuss learning in the context of extensive-form games of incomplete and/or imperfect information (cf. =-=[28]-=-). We don’t dwell on those since it would distract from the main discussion, and since the lessons we draw from our setting will apply there as well. Although we will not specifically include them, we... |

17 | The empirical Bayes envelope and regret minimization in competitive Markov decision processes
- Mannor, Shimkin
(Show Context)
Citation Context ...Note that the application of approaches based on regret minimization has been restricted to the case of repeated games. The difficulties of extending this concept to stochastic games are discussed in =-=[39]-=-. 4.2. Some typical results One sees at least three kinds of results in the literature regarding the learning algorithms presented above, and others like them. These are:s372 Y. Shoham et al. / Artifi... |

13 | Learning against multiple opponents
- Vu, Powers, et al.
- 2006
(Show Context)
Citation Context ...sum two-player repeated games. More recent work has extended these results to handle opponents whose play is conditional on the recent history of the game [44] and settings with more than two players =-=[51]-=-. 6. Summary In this article we have made the following points: 1. Learning in MAS is conceptually, not only technically, challenging. 2. One needs to be crystal clear about the problem being addresse... |