## Convergence and no-regret in multiagent learning (2005)

### Cached

### Download Links

- [www.cs.ualberta.ca]
- [books.nips.cc]
- [webdocs.cs.ualberta.ca]
- [papersdb.cs.ualberta.ca]
- [www.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In Advances in Neural Information Processing Systems 17 |

Citations: | 65 - 0 self |

### BibTeX

@INPROCEEDINGS{Bowling05convergenceand,

author = {Michael Bowling},

title = {Convergence and no-regret in multiagent learning},

booktitle = {In Advances in Neural Information Processing Systems 17},

year = {2005},

pages = {209--216},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1

### Citations

498 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...settings.sThe desirability of convergence has been recently contested. We offer some brief insight into this debate in the introduction of the extended version of this paper [1]. Equilibrium learners =-=[2, 3, 4]-=- are one method of handling the loss of stationarity. These algorithms learn joint-action values, which are stationary, and in certain circumstances guarantee these values converge to Nash (or correla... |

304 | The dynamics of reinforcement learning in cooperative multiagent systems
- Claus, Boutilier
- 1998
(Show Context)
Citation Context ..., including when other agents play suboptimal responses. Equilibrium learners, therefore, can fail to learn best-response policies even against simple non-learning opponents. 1 Best-response learners =-=[5, 6, 7]-=- are an alternative approach that has sought to learn best-responses, but still considering whether the resulting algorithm converges in some form. These approaches usually examine convergence in self... |

283 | M.P.: Multiagent reinforcement learning: Theoretical framework and an algorithm
- Hu, Wellman
- 1998
(Show Context)
Citation Context ...settings.sThe desirability of convergence has been recently contested. We offer some brief insight into this debate in the introduction of the extended version of this paper [1]. Equilibrium learners =-=[2, 3, 4]-=- are one method of handling the loss of stationarity. These algorithms learn joint-action values, which are stationary, and in certain circumstances guarantee these values converge to Nash (or correla... |

219 | A simple adaptive procedure leading to correlated equilibrium
- Hart, Mas-Colell
- 2000
(Show Context)
Citation Context ...earner PHC [7] could be exploited by a particular dynamic strategy. One method of measuring whether an algorithm can be exploited is the notion of regret. Regret has been explored both in game theory =-=[9]-=- and computer science [10, 11]. Regret measures how much worse an algorithm performs compared to the best static strategy, with the goal to guarantee at least zero average regret, no-regret, in the li... |

186 | Gambling in a rigged casino: The adversarial multi-armed bandit problem
- Auer, Cesa-Bianchi, et al.
- 1995
(Show Context)
Citation Context ...xploited by a particular dynamic strategy. One method of measuring whether an algorithm can be exploited is the notion of regret. Regret has been explored both in game theory [9] and computer science =-=[10, 11]-=-. Regret measures how much worse an algorithm performs compared to the best static strategy, with the goal to guarantee at least zero average regret, no-regret, in the limit. These two challenges resu... |

183 | Online convex programming and generalized infinitesimal gradient ascent
- Zinkevich
- 2003
(Show Context)
Citation Context ...xploited by a particular dynamic strategy. One method of measuring whether an algorithm can be exploited is the notion of regret. Regret has been explored both in game theory [9] and computer science =-=[10, 11]-=-. Regret measures how much worse an algorithm performs compared to the best static strategy, with the goal to guarantee at least zero average regret, no-regret, in the limit. These two challenges resu... |

180 | M.: Multiagent learning using a variable learning rate
- Bowling, Veloso
- 2002
(Show Context)
Citation Context ..., including when other agents play suboptimal responses. Equilibrium learners, therefore, can fail to learn best-response policies even against simple non-learning opponents. 1 Best-response learners =-=[5, 6, 7]-=- are an alternative approach that has sought to learn best-responses, but still considering whether the resulting algorithm converges in some form. These approaches usually examine convergence in self... |

169 |
Game Theory Evolving
- Gintis
- 2000
(Show Context)
Citation Context ... 4: Trajectories of joint strategies in Rock-Paper-Scissors when both players use GIGA (a), GIGA-WoLF (b), or when GIGA plays against GIGA-WoLF (c). In Figure 6, we show results from the Blotto Game (=-=Gintis, 2000-=-), a more complicated 4 × 5 action zero-sum game. The plot shows player one’s action probabilities over time, further demonstrating GIGA-WoLF converging to a Nash equilibrium in self-play, while GIGA ... |

91 | Nash convergence of gradient dynamics in general-sum games
- Singh, Kearns, et al.
- 2000
(Show Context)
Citation Context ..., including when other agents play suboptimal responses. Equilibrium learners, therefore, can fail to learn best-response policies even against simple non-learning opponents. 1 Best-response learners =-=[5, 6, 7]-=- are an alternative approach that has sought to learn best-responses, but still considering whether the resulting algorithm converges in some form. These approaches usually examine convergence in self... |

56 | Correlated Q-learning
- Greenwald, Hall
- 2003
(Show Context)
Citation Context ...settings.sThe desirability of convergence has been recently contested. We offer some brief insight into this debate in the introduction of the extended version of this paper [1]. Equilibrium learners =-=[2, 3, 4]-=- are one method of handling the loss of stationarity. These algorithms learn joint-action values, which are stationary, and in certain circumstances guarantee these values converge to Nash (or correla... |

50 |
The extragradient method for finding saddle points and other problems
- KORPELEVICH
(Show Context)
Citation Context ...otice also that, as long as xt is not near the boundary, the change due to step (3) is of lower magnitude than the change due 4 WoLF-IGA may, in fact, be a limited variant of the extragradient method =-=[13]-=- for variational inequality problems. The extragradient algorithm is guaranteed to converge to a Nash equilibrium in self-play for all zero-sum games. Like WoLF-IGA, though, it does not have any known... |

30 | Playing is believing: The role of beliefs in multi-agent learning
- Chang, Kaelbling
- 2001
(Show Context)
Citation Context ...er opponent. A deceptive strategy may “lure” a dynamic strategy away from a safe choice in order to switch to a strategy where the learner receives much lower reward. For example, Chang and Kaelbling =-=[8]-=- demonstrated that the best-response learner PHC [7] could be exploited by a particular dynamic strategy. One method of measuring whether an algorithm can be exploited is the notion of regret. Regret ... |

5 |
Gunes Ercal, On no-regret learning, fictitious play, and nash equilibrium
- Jafari, Greenwald, et al.
(Show Context)
Citation Context ...ost exclusively been explored in isolation. For example, equilibrium learners can have arbitrarily large average regret. On the other hand, no-regret learners’ strategies rarely converge in self-play =-=[12]-=- in even the simplest of games. 2 In this paper, we seek to explore these two criteria in a single algorithm for learning in normal-form games. In Section 2 we present a more formal description of the... |