## The nonstochastic multiarmed bandit problem (2002)

### Cached

### Download Links

Venue: | SIAM Journal on Computing |

Citations: | 328 - 29 self |

### BibTeX

@ARTICLE{Auer02thenonstochastic,

author = {Peter Auer and Nicolò Cesa-bianchi and Yoav Freund and Robert E. Schapire},

title = {The nonstochastic multiarmed bandit problem},

journal = {SIAM Journal on Computing},

year = {2002},

volume = {32},

pages = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

In the multi-armed bandit problem, a gambler must decide which arm of £ non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of ¤ plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate ¥§¦¨¤�©������� �. We show by a matching lower bound that this is best possible. We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of � strategies then our algorithm approaches the per-round payoff of the strategy at the rate ¥ ¦��¨���� � �§ � ���� � ¤ ©����� � �. Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate ¥ ¦ ¤ ©����� � �.

### Citations

9216 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...i [f(r)] ≤ Eunif [f(r)] + M � −Eunif [Ni] ln(1 − 4ɛ 2 2 ). t=1 rt denotes the return of the algorithm, and Gmax = Proof. We apply standard methods that can be found, for instance, in Cover and Thomas =-=[4]-=-. For any distributions P and Q, let �P − Q� 1 be the variational distance, and let . = � r∈{0,1} T KL (P � Q) . = � r∈{0,1} T |P{r}−Q{r}| P{r} lg � � P{r} Q{r} be the Kullback–Liebler divergence or r... |

2489 | A decision-theoretic generalization of on-line learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...e an example of a simple Hannan-consistent player whose convergence rate is optimal up to logarithmic factors. Our player algorithms are based in part on an algorithm presented by Freund and Schapire =-=[6, 7]-=-, which in turn is a variant of Littlestone and Warmuth’s [15] weighted majority algorithm and Vovk’s [18] aggregating strategies. In the setting analyzed by Freund and Schapire, the player scores on ... |

702 | The weighted majority algorithm
- Littlestone, Warmuth
- 1994
(Show Context)
Citation Context ...nce rate is optimal up to logarithmic factors. Our player algorithms are based in part on an algorithm presented by Freund and Schapire [6, 7], which in turn is a variant of Littlestone and Warmuth’s =-=[15]-=- weighted majority algorithm and Vovk’s [18] aggregating strategies. In the setting analyzed by Freund and Schapire, the player scores on each pull the reward of the chosen arm but gains access to the... |

323 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...his best strategy and the actual gambler’s return. Using a randomized player that combines the choices of the N strategies (in the same vein as the algorithms for “prediction with expert advice” from =-=[3]-=-), we show that the expected regret for the best strategy is O( √ KT ln N)—see Theorem 7.1. Note that the dependence on the number of strategies is only logarithmic, and therefore the bound is quite r... |

294 | Some aspects of the sequential design of experiments
- ROBBINS
- 1952
(Show Context)
Citation Context ...ersarial bandit problem, unknown matrix games AMS subject classifications. 68Q32, 68T05, 91A20 PII. S0097539701398375 1. Introduction. In the multiarmed bandit problem, originally proposed by Robbins =-=[17]-=-, a gambler must choose which of K slot machines to play. At each time step, he pulls the arm of one of the machines and receives a reward or payoff (possibly zero or negative). The gambler’s purpose ... |

291 | Asymptotically Efficient Adaptive Allocation Rules - Lai, Robbins - 1985 |

263 |
Aggregating strategies
- Vovk
- 1990
(Show Context)
Citation Context ...s. Our player algorithms are based in part on an algorithm presented by Freund and Schapire [6, 7], which in turn is a variant of Littlestone and Warmuth’s [15] weighted majority algorithm and Vovk’s =-=[18]-=- aggregating strategies. In the setting analyzed by Freund and Schapire, the player scores on each pull the reward of the chosen arm but gains access to the rewards associated with all of the arms (no... |

227 |
Multi-Armed Bandit Allocation Indices
- Gittins
- 1989
(Show Context)
Citation Context ...gret bounds become trivial when the hardness of the sequence (j1,...,jT ) we compete against gets too close to T . As a remark, note that a deterministic bandit problem was also considered by Gittins =-=[9]-=- and Ishikida and Varaiya [13]. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received fr... |

138 | Adaptive game playing using multiplicative weights
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...e an example of a simple Hannan-consistent player whose convergence rate is optimal up to logarithmic factors. Our player algorithms are based in part on an algorithm presented by Freund and Schapire =-=[6, 7]-=-, which in turn is a variant of Littlestone and Warmuth’s [15] weighted majority algorithm and Vovk’s [18] aggregating strategies. In the setting analyzed by Freund and Schapire, the player scores on ... |

116 | Regret in the on-line decision problem
- Foster, Vohra
- 1999
(Show Context)
Citation Context ...t framework) that the weak regret per time step of the player converges to 0 with probability 1. Examples of Hannan-consistent player strategies have been provided by several authors in the past (see =-=[5]-=- for a survey of these results). By applying (slight extensions of) Theorems 6.3 and 6.4, we can provide an example of a simple Hannan-consistent player whose convergence rate is optimal up to logarit... |

112 |
Consistency and Cautious Fictitious Play
- Fudenberg, Levine
- 1995
(Show Context)
Citation Context ...how much player i lost on average for not playing the pure strategy j on all rounds, given that all the other players kept their choices fixed. A desirable property for a player is Hannan-consistency =-=[8]-=-, defined as follows. Player i is Hannan-consistent if lim sup T →∞ max j∈Si R (j) i (T ) = 0 with probability 1. The existence and properties of Hannan-consistent players have been first investigated... |

108 |
Approximation to Bayes risk in repeated plays. Contributions to the Theory of Games
- Hannan
- 1957
(Show Context)
Citation Context ...s follows. Player i is Hannan-consistent if lim sup T →∞ max j∈Si R (j) i (T ) = 0 with probability 1. The existence and properties of Hannan-consistent players have been first investigated by Hannan =-=[10]-=- and Blackwell [2] and later by many others (see [5] for a nice survey). Hannan-consistency can be also studied in the so-called unknown game setup, where it is further assumed that (1) each player kn... |

80 | A General Class of Adaptive Strategies
- Hart, Mas-Colell
(Show Context)
Citation Context ...ch player sees its own payoffs but it sees neither the choices of the other players nor the resulting payoffs. This setup was previously studied by Baños [1], Megiddo [16], and by Hart and Mas-Colell =-=[11, 12]-=-. We can apply the results of section 6 to prove that a player using algorithm Exp3.P.1 as mixed strategy is Hannan-consistent in the unknown game setup whenever the payoffs obtained by the player bel... |

69 | Discrete-Parameter Martingales - Neveu - 1975 |

54 |
Controlled Random Walks
- Blackwell
- 1956
(Show Context)
Citation Context ... is Hannan-consistent if lim sup T →∞ max j∈Si R (j) i (T ) = 0 with probability 1. The existence and properties of Hannan-consistent players have been first investigated by Hannan [10] and Blackwell =-=[2]-=- and later by many others (see [5] for a nice survey). Hannan-consistency can be also studied in the so-called unknown game setup, where it is further assumed that (1) each player knows neither the to... |

33 | A randomization rule for selecting forecasts - Foster, Vohra - 1993 |

26 | On repeated games with incomplete information played by non-Bayesian players
- Megiddo
- 1980
(Show Context)
Citation Context ... itself), (2) after each round each player sees its own payoffs but it sees neither the choices of the other players nor the resulting payoffs. This setup was previously studied by Baños [1], Megiddo =-=[16]-=-, and by Hart and Mas-Colell [11, 12]. We can apply the results of section 6 to prove that a player using algorithm Exp3.P.1 as mixed strategy is Hannan-consistent in the unknown game setup whenever t... |

20 | On pseudo-games
- Banos
- 1968
(Show Context)
Citation Context ...of possible actions, where each action is denoted by an integer 1 ≤ i ≤ K, and by an assignment of rewards, i.e., an infinite sequence x(1), x(2),... of vectors x(t) =(x1(t),...,xK(t)), where xi(t) ∈ =-=[0, 1]-=- denotes the reward obtained if action i is chosen at time step (also called “trial”) t. (Even though throughout the paper we will assume that all rewards belong to the [0, 1] interval, the generaliza... |

20 | Rakesh Vohra, Regret in the on-line decision problem - Foster - 1999 |

13 |
Multi-armed bandit problem revisited
- Ishikida, Varaiya
- 1994
(Show Context)
Citation Context ...en the hardness of the sequence (j1,...,jT ) we compete against gets too close to T . As a remark, note that a deterministic bandit problem was also considered by Gittins [9] and Ishikida and Varaiya =-=[13]-=-. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received from each arm, and their problem... |

11 |
Asymptotically efficient adaptive allocation rules, Adv
- Lai, Robbins
- 1985
(Show Context)
Citation Context ...T ) 1+ε ) with probability 1 for any fixed ε>0; see Corollary 6.5. Our worst-case bounds may appear weaker than the bounds proved using statistical assumptions, such as those shown by Lai and Robbins =-=[14]-=- of the form O(ln T ). However, when comparing our results to those in the statistics literature, it is important to point out an important difference in the asymptotic quantification. In the work of ... |

8 | Controlled Random Walks,” invited address - Blackwell - 1956 |

1 |
A simple procedure leading to correlated equilibrium
- Hart, Mas-Colell
(Show Context)
Citation Context ...ch player sees its own payoffs but it sees neither the choices of the other players nor the resulting payoffs. This setup was previously studied by Baños [1], Megiddo [16], and by Hart and Mas-Colell =-=[11, 12]-=-. We can apply the results of section 6 to prove that a player using algorithm Exp3.P.1 as mixed strategy is Hannan-consistent in the unknown game setup whenever the payoffs obtained by the player bel... |