Results 1  10
of
23
The KLUCB algorithm for bounded stochastic bandits and beyond
 In Proceedings of COLT
, 2011
"... This paper presents a finitetime analysis of the KLUCB algorithm, an online, horizonfree index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KLUCB algorithm satisfies a uniformly better regret bound than UCB and its variants; secon ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
This paper presents a finitetime analysis of the KLUCB algorithm, an online, horizonfree index policy for stochastic bandit problems. We prove two distinct results: first, for arbitrary bounded rewards, the KLUCB algorithm satisfies a uniformly better regret bound than UCB and its variants; second, in the special case of Bernoulli rewards, it reaches the lower bound of Lai and Robbins. Furthermore, we show that simple adaptations of the KLUCB algorithm are also optimal for specific classes of (possibly unbounded) rewards, including those generated from exponential families of distributions. A largescale numerical study comparing KLUCB with its main competitors (UCB, MOSS, UCBTuned, UCBV, DMED) shows that KLUCB is remarkably efficient and stable, including for short time horizons. KLUCB is also the only method that always performs better than the basic UCB policy. Our regret bounds rely on deviations results of independent interest which are stated and proved in the Appendix. As a byproduct, we also obtain an improved regret bound for the standard UCB algorithm.
Thompson sampling: An asymptotically optimal finitetime analysis
 In Algorithmic Learning Theory
"... The question of the optimality of Thompson Sampling for solving the stochastic multiarmed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finitetime analysis that matches the asymptotic rate given in the Lai an ..."
Abstract

Cited by 37 (6 self)
 Add to MetaCart
(Show Context)
The question of the optimality of Thompson Sampling for solving the stochastic multiarmed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finitetime analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case. 1
Supplement to “Kullback–Leibler upper confidence bounds for optimal sequential allocation.” DOI:10.1214/13AOS1119SUPP
, 2013
"... We consider optimal sequential allocation in the context of the socalled stochastic multiarmed bandit model. We describe a generic index policy, in ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
We consider optimal sequential allocation in the context of the socalled stochastic multiarmed bandit model. We describe a generic index policy, in
On bayesian upper confidence bounds for bandit problems
 In AISTATS
, 2012
"... Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter is a deterministic unknown quantity, and a Bayesian approach, where the parameter is drawn from a prior distribution. We show in this paper that methods derived from this second per ..."
Abstract

Cited by 25 (4 self)
 Add to MetaCart
(Show Context)
Stochastic bandit problems have been analyzed from two different perspectives: a frequentist view, where the parameter is a deterministic unknown quantity, and a Bayesian approach, where the parameter is drawn from a prior distribution. We show in this paper that methods derived from this second perspective prove optimal when evaluated using the frequentist cumulated regret as a measure of performance. We give a general formulation for a class of Bayesian index policies that rely on quantiles of the posterior distribution. For binary bandits, we prove that the corresponding algorithm, termed BayesUCB, satisfies finitetime regret bounds that imply its asymptotic optimality. More generally, BayesUCB appears as an unifying framework for several variants of the UCB algorithm addressing different bandit problems (parametric multiarmed bandits, Gaussian bandits with unknown mean and variance, linear bandits). But the generality of the Bayesian approach makes it possible to address more challenging models. In particular, we show how to handle linear bandits with sparsity constraints by resorting to Gibbs sampling. 1
A finitetime analysis of multiarmed bandits problems with KullbackLeibler divergences. 2011. URL http://hal.archivesouvertes. fr/inria00574987
"... We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finite ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finitetime analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finitetime analyses (like UCBtype algorithms).
Thompson Sampling for 1Dimensional Exponential Family Bandits
 In Neural Information Processing Systems
, 2013
"... ar ..."
Robust Riskaverse Stochastic MultiArmed Bandits
, 2013
"... Abstract. We study a variant of the standard stochastic multiarmed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximising some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative e ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. We study a variant of the standard stochastic multiarmed bandit problem when one is not interested in the arm with the best mean, but instead in the arm maximising some coherent risk measure criterion. Further, we are studying the deviations of the regret instead of the less informative expected regret. We provide an algorithm, called RAUCB to solve this problem, together with a high probability bound on its regret.
25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits
"... We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O ( √ n) worstcase regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 ..."
Abstract
 Add to MetaCart
(Show Context)
We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O ( √ n) worstcase regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multiarmed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worstcase performance with improved guarantees for “nice ” problem instances. 1.
Submitted to the Annals of Statistics arXiv: math.PR/0000000 KULLBACKLEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION By Olivier Cappe1, Aurelien Garivier2, OdalricAmbrym
"... We consider optimal sequential allocation in the context of the socalled stochastic multiarmed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper condence bounds of the arm payos computed using the KullbackLeibler divergence. We consider two classes ..."
Abstract
 Add to MetaCart
We consider optimal sequential allocation in the context of the socalled stochastic multiarmed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper condence bounds of the arm payos computed using the KullbackLeibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The klUCB algorithm is designed for oneparameter exponential families and the empirical KLUCB algorithm for bounded and nitely supported distributions. Our main contribution is a unied nitetime analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the stateoftheart.