• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The multi-armed bandit problem with covariates. Arxiv preprint arXiv:1110.6084, (2011)

by V Perchet, P Rigollet
Add To MetaCart

Tools

Sorted by:
Results 1 - 7 of 7

Bounded regret in stochastic multi-armed bandits

by Sébastien Bubeck, Vianney Perchet, et al. - JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL (2013) 1–13 , 2013
"... We study the stochastic multi-armed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bou ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
We study the stochastic multi-armed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bounded regret of order1/ ∆ is not possible if one only knowsµ (⋆).

The best of both worlds: Stochastic and adversarial bandits.

by Sébastien Bubeck , Aleksandrs Slivkins - In COLT, , 2012
"... Abstract We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O( √ n) worst-case regret of Exp3 ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Abstract We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O( √ n) worst-case regret of Exp3
(Show Context)

Citation Context

...include additional information and/or assumptions about rewards. Most relevant to this paper are algorithms UCB1 (Auer et al., 2002a) and Exp3 (Auer et al., 2002b). UCB1 has a slightly more refined regret bound than the one that we cited earlier: Rn = O( ∑ i:µi<µ∗ logn µ∗−µi ) with high probability. A matching lower bound (up to the considerations of the variance and constant factors) is proved in Lai and Robbins (1985). Several recent papers (Auer and Ortner, 2010; Honda and Takemura, 2010; Audibert et al., 2009; Audibert and Bubeck, 2010; Maillard and Munos, 2011; Garivier and Cappe, 2011; Perchet and Rigollet, 2011) improve over UCB1, obtaining algorithms with regret bounds that are even closer to the lower bound. The regret bound for Exp3 is E[Rn] = O( √ nK logK), and a version of Exp3 achieves this with high probability (Auer et al., 2002b). There is a nearly matching lower bound of Ω( √ Kn). Recently Audibert and Bubeck (2010) have shaved off the logK factor, achieving an algorithm with regret O( √ Kn) in the adversarial model against an oblivious adversary. High-level ideas. For clarity, let us consider the simplified algorithm for the special case of two arms and oblivious adversary. The algorithm s...

Clustered Bandits

by Loc Bui, Ramesh Johari, Shie Mannor , 2012
"... ..."
Abstract - Add to MetaCart
Abstract not found

ONLINE LEARNING AND GAME THEORY. A QUICK OVERVIEW WITH RECENT RESULTS AND APPLICATIONS

by Mathieu Faure, Pierre Gaillard, Bruno Gaujal, Vianney Perchet , 2015
"... We study one of the main concept of online learning and sequential decision problem known as regret minimization. We investigate three different frameworks, whether data are generated accordingly to some i.i.d. process, or when no assumption whatsoever are made on their generation and, finally, whe ..."
Abstract - Add to MetaCart
We study one of the main concept of online learning and sequential decision problem known as regret minimization. We investigate three different frameworks, whether data are generated accordingly to some i.i.d. process, or when no assumption whatsoever are made on their generation and, finally, when they are the consequences of some sequential interactions between players. The overall objective is to provide a comprehensive introduction to this domain. In each of these main setups, we define and analyze classical algorithms and we analyze their performances. Finally, we also show that some concepts of equilibria that emerged in game theory are learnable by players using online learning schemes while some other concepts are not learnable.

25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits

by Sébastien Bubeck, Aleksandrs Slivkins, Shie Mannor, Nathan Srebro, Robert C. Williamson
"... We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O ( √ n) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 ..."
Abstract - Add to MetaCart
We present a new bandit algorithm, SAO (Stochastic and Adversarial Optimal) whose regret is (essentially) optimal both for adversarial rewards and for stochastic rewards. Specifically, SAO combines the O ( √ n) worst-case regret of Exp3 (Auer et al., 2002b) and the (poly)logarithmic regret of UCB1 (Auer et al., 2002a) for stochastic rewards. Adversarial rewards and stochastic rewards are the two main settings in the literature on multi-armed bandits (MAB). Prior work on MAB treats them separately, and does not attempt to jointly optimize for both. This result falls into the general agenda to design algorithms that combine the optimal worst-case performance with improved guarantees for “nice ” problem instances. 1.
(Show Context)

Citation Context

...n Lai and Robbins (1985). Several recent papers (Auer and Ortner, 2010; Honda and Takemura, 2010; Audibert et al., 2009; Audibert and Bubeck, 2010; Maillard and Munos, 2011; Garivier and Cappé, 2011; =-=Perchet and Rigollet, 2011-=-) improve over UCB1, obtaining algorithms with regret bounds that are even closer to the lower bound. The regret bound for Exp3 is E[Rn] = O( √ nK log K), and a version of Exp3 achieves this with high...

Stochastic Optimization

by Lauren A. Hannah , 2014
"... ..."
Abstract - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...alue of the arms as a function of the side information using regression methods like linear combinations of basis functions (Li et al. 2010), discretization of the state space (Rigollet & Zeevi 2010, =-=Perchet & Rigollet 2013-=-), random histograms (Yang & Zhu 2002), nearest neighbors (Yang & Zhu 2002) or adaptive partitioning (Slivkins 2011). While all bandit problems are broadly applicable to many ecommerce problems like a...

The best of both worlds: stochastic and adversarial bandits

by Sébastien Bubeck Aleks, Rs Slivkins , 2012
"... ar ..."
Abstract - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

... Lai and Robbins [1985]. Several recent papers [Auer and Ortner, 2010, Honda and Takemura, 2010, Audibert et al., 2009, Audibert and Bubeck, 2010, Maillard and Munos, 2011, Garivier and Cappé, 2011, =-=Perchet and Rigollet, 2011-=-] improve over UCB1, obtaining algorithms with regret bounds that are even closer to the lower bound. The regret bound for Exp3 is E[Rn] = O( √ nK logK), and a version of Exp3 achieves this with high ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University