Results 1  10
of
78
The nonstochastic multiarmed bandit problem
 SIAM Journal on Computing
, 2002
"... In the multiarmed bandit problem, a gambler must decide which arm of £ nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying ou ..."
Abstract

Cited by 316 (27 self)
 Add to MetaCart
In the multiarmed bandit problem, a gambler must decide which arm of £ nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a wellbehaved stochastic process, has complete control over the payoffs. In a sequence of ¤ plays, we prove that the perround payoff of our algorithm approaches that of the best arm at the rate ¥§¦¨¤�©������� �. We show by a matching lower bound that this is best possible. We also prove that our algorithm approaches the perround payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of � strategies then our algorithm approaches the perround payoff of the strategy at the rate ¥ ¦��¨���� � �§ � ���� � ¤ ©����� � �. Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate ¥ ¦ ¤ ©����� � �.
A Simple Adaptive Procedure Leading to Correlated Equilibrium
 Econometrica, September
"... We propose a new and simple adaptive procedure for playing a game: ‘‘regretmatching.’’ In this procedure, players may depart from their current play with probabilities that are proportional to measures of regret for not having used other strategies in the past. It is shown that our adaptive procedu ..."
Abstract

Cited by 226 (13 self)
 Add to MetaCart
We propose a new and simple adaptive procedure for playing a game: ‘‘regretmatching.’’ In this procedure, players may depart from their current play with probabilities that are proportional to measures of regret for not having used other strategies in the past. It is shown that our adaptive procedure guarantees that, with probability one, the empirical distributions of play converge to the set of correlated equilibria of the game.
Gambling in a rigged casino: The adversarial multiarmed bandit problem
, 1995
"... In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying ou ..."
Abstract

Cited by 189 (7 self)
 Add to MetaCart
In the multiarmed bandit problem, a gambler must decide which arm of K nonidentical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the tradeoff between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a wellbehaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected perround payoff of our algorithm approaches that of the best arm at the rate O(T \Gamma1=2 ), and we give an improved rate of conver...
Online Convex Programming and Generalized Infinitesimal Gradient Ascent
, 2003
"... Convex programming involves a convex set F R and a convex function c : F ! R. The goal of convex programming is to nd a point in F which minimizes c. In this paper, we introduce online convex programming. In online convex programming, the convex set is known in advance, but in each step of some ..."
Abstract

Cited by 183 (4 self)
 Add to MetaCart
Convex programming involves a convex set F R and a convex function c : F ! R. The goal of convex programming is to nd a point in F which minimizes c. In this paper, we introduce online convex programming. In online convex programming, the convex set is known in advance, but in each step of some repeated optimization problem, one must select a point in F before seeing the cost function for that step. This can be used to model factory production, farm production, and many other industrial optimization problems where one is unaware of the value of the items produced until they have already been constructed. We introduce an algorithm for this domain, apply it to repeated games, and show that it is really a generalization of in nitesimal gradient ascent, and the results here imply that generalized in nitesimal gradient ascent (GIGA) is universally consistent.
SCHAPIRE: Adaptive game playing using multiplicative weights
 Games and Economic Behavior
, 1999
"... We present a simple algorithm for playing a repeated game. We show that a player using this algorithm suffers average loss that is guaranteed to come close to the minimum loss achievable by any fixed strategy. Our bounds are nonasymptotic and hold for any opponent. The algorithm, which uses the mult ..."
Abstract

Cited by 134 (14 self)
 Add to MetaCart
We present a simple algorithm for playing a repeated game. We show that a player using this algorithm suffers average loss that is guaranteed to come close to the minimum loss achievable by any fixed strategy. Our bounds are nonasymptotic and hold for any opponent. The algorithm, which uses the multiplicativeweight methods of Littlestone and Warmuth, is analyzed using the Kullback–Liebler divergence. This analysis yields a new, simple proof of the min–max theorem, as well as a provable method of approximately solving a game. A variant of our gameplaying algorithm is proved to be optimal in a very strong sense. Journal of Economic Literature
Game Theory, Online Prediction and Boosting
 In Proceedings of the Ninth Annual Conference on Computational Learning Theory
, 1996
"... We study the close connections between game theory, online prediction and boosting. After a brief review of game theory, we describe an algorithm for learning to play repeated games based on the online prediction methods of Littlestone and Warmuth. The analysis of this algorithm yields a simple pr ..."
Abstract

Cited by 133 (13 self)
 Add to MetaCart
We study the close connections between game theory, online prediction and boosting. After a brief review of game theory, we describe an algorithm for learning to play repeated games based on the online prediction methods of Littlestone and Warmuth. The analysis of this algorithm yields a simple proof of von Neumann's famous minmax theorem, as well as a provable method of approximately solving a game. We then show that the online prediction model is obtained by applying this gameplaying algorithm to an appropriate choice of game and that boosting is obtained by applying the same algorithm to the "dual" of this game. 1 INTRODUCTION The purpose of this paper is to bring out the close connections between game theory, online prediction and boosting. Briefly, game theory is the study of games and other interactions of various sorts. Online prediction is a learning model in which an agent predicts the classification of a sequence of items and attempts to minimize the total number of pre...
Regret in the Online Decision Problem
, 1999
"... At each point in time a decision maker must choose a decision. The payoff in a period from the decision chosen depends on the decision as well as the state of the world that obtains at that time. The difficulty is that the decision must be made in advance of any knowledge, even probabilistic, about ..."
Abstract

Cited by 115 (2 self)
 Add to MetaCart
At each point in time a decision maker must choose a decision. The payoff in a period from the decision chosen depends on the decision as well as the state of the world that obtains at that time. The difficulty is that the decision must be made in advance of any knowledge, even probabilistic, about which state of the world will obtain. A range of problems from a variety of disciplines can be framed in this way. In this
AWESOME: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents
, 2003
"... A satisfactory multiagent learning algorithm should, at a minimum, learn to play optimally against stationary opponents and converge to a Nash equilibrium in selfplay. The algorithm that has come closest, WoLFIGA, has been proven to have these two properties in 2player 2action repeated games— as ..."
Abstract

Cited by 81 (5 self)
 Add to MetaCart
A satisfactory multiagent learning algorithm should, at a minimum, learn to play optimally against stationary opponents and converge to a Nash equilibrium in selfplay. The algorithm that has come closest, WoLFIGA, has been proven to have these two properties in 2player 2action repeated games— assuming that the opponent’s (mixed) strategy is observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have these two properties in all repeated (finite) games. It requires only that the other players ’ actual actions (not their strategies) can be observed at each step. It also learns to play optimally against opponents that eventually become stationary. The basic idea behind AWESOME (Adapt When Everybody is Stationary, Otherwise Move to Equilibrium) is to try to adapt to the others’ strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing other multiagent learning algorithms also.
Autonomous vehicletarget assignment: a game theoretical formulation
 ASME Journal of Dynamic Systems, Measurement and Control
"... We consider an autonomous vehicletarget assignment problem where a group of vehicles are expected to optimally assign themselves to a set of targets. We introduce a gametheoretical formulation of the problem in which the vehicles are viewed as selfinterested decision makers. Thus, we seek the opti ..."
Abstract

Cited by 60 (22 self)
 Add to MetaCart
We consider an autonomous vehicletarget assignment problem where a group of vehicles are expected to optimally assign themselves to a set of targets. We introduce a gametheoretical formulation of the problem in which the vehicles are viewed as selfinterested decision makers. Thus, we seek the optimization of a global utility function through autonomous vehicles that are capable of making individually rational decisions to optimize their own utility functions. The first important aspect of the problem is to choose the utility functions of the vehicles in such a way that the objectives of the vehicles are localized to each vehicle yet aligned with a global utility function. The second important aspect of the problem is to equip the vehicles with an appropriate negotiation mechanism by which each vehicle pursues the optimization of its own utility function. We present several design procedures and accompanying caveats for vehicle utility design. We present two new negotiation mechanisms, namely, “generalized regret monitoring with fading memory and inertia ” and “selective spatial adaptive play, ” and provide accompanying proofs of their convergence. Finally, we present simulations that illustrate how vehicle negotiations can consistently lead to nearoptimal assignments provided that the utilities of the vehicles are designed appropriately. �DOI: 10.1115/1.2766722� 1