• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Optimistic Bayesian sampling in contextual-bandit problems. (2011)

by B C May, N Korda, A Lee, D S Leslie
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 20
Next 10 →

An Empirical Evaluation of Thompson Sampling

by Olivier Chapelle, Lihong Li
"... Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heur ..."
Abstract - Cited by 72 (6 self) - Add to MetaCart
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
(Show Context)

Citation Context

...he reason why it is not very popular might be because of its lack of theoretical analysis. Only two papers have tried to provide such analysis, but they were only able to prove asymptotic convergence =-=[6, 11]-=-. In this work, we present some empirical results, first on a simulated problem and then on two realworld ones: display advertisement selection and news article recommendation. In all cases, despite i...

Thompson Sampling for Contextual Bandits with Linear Payoffs.

by Shipra Agrawal , Navin Goyal , 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateof-the-ar ..."
Abstract - Cited by 28 (3 self) - Add to MetaCart
Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateof-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of , which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of √ d (or log(N )) of the information-theoretic lower bound for this problem.
(Show Context)

Citation Context

...vertising and news article recommendation modeled by the contextual bandits problem, it is competitive to or better than the other methods such as UCB. In their experiments, TS is also more robust to delayed or batched feedback than the other methods. TS has been used in an industrial-scale application for CTR prediction of search ads on search engines (Graepel et al., 2010). Kaufmann et al. (2012) do a thorough comparison of TS with the best known versions of UCB and show that TS has the lowest regret in the long run. However, the theoretical understanding of TS is limited. Granmo (2010) and May et al. (2012) provided weak guarantees, namely, a bound of o(T ) on the expected regret in time T . For the the basic (i.e. without contexts) version of the stochastic MAB problem, some significant progress was made by Agrawal & Goyal (2012), Kaufmann et al. (2012) and, more recently, by Agrawal & Goyal (2013b), who provided optimal regret bounds on the expected regret. But, many questions regarding theoretical analysis of TS remained open, including high probability regret bounds, and regret bounds for the more general contextual bandits setting. In particular, the contextual MAB problem does not seem eas...

Further Optimal Regret Bounds for Thompson Sampling. AISTATS,

by Shipra Agrawal , Navin Goyal , 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the ..."
Abstract - Cited by 23 (3 self) - Add to MetaCart
Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that proves the first near-optimal problem-independent bound of O( √ N T ln T ) on the expected regret of this algorithm. Our novel martingale-based analysis techniques are conceptually simple, and easily extend to distributions other than the Beta distribution. For the version of Thompson Sampling that uses Gaussian priors, we prove a problem-independent bound of O( √ N T ln N ) on the expected regret, and demonstrate the optimality of this bound by providing a matching lower bound. This lower bound of Ω( √ N T ln N ) is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of O( √ N T ) for the multi-armed bandit problem. Our near-optimal problem-independent bounds for Thompson Sampling solve a COLT 2012 open problem of Chapelle and Li. Additionally, our techniques simultaneously provide the optimal problem-dependent bound of (1 + ǫ) i ln T d(µi,µ1) + O( N ǫ 2 ) on the expected regret. The optimal problem-dependent regret bound for this problem was first proven recently by

Learning to Optimize Via Posterior Sampling

by Daniel Russo, Benjamin Van Roy , 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract - Cited by 18 (8 self) - Add to MetaCart
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
(Show Context)

Citation Context

...st proposed almost eighty years ago, it has until recently received little attention in the literature on multi-armed bandits. While its asymptotic convergence has been established in some generality =-=[20]-=-, not much else is known about its theoretical properties in the case of dependent arms, or even in the case of independent arms with general prior distributions. Our work provides some of the first t...

Exponential regret bounds for Gaussian process bandits with deterministic observations

by Nando De Freitas, Alex J. Smola, Masrour Zoghi - In ICML , 2012
"... This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al., 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, (Sriniv ..."
Abstract - Cited by 17 (9 self) - Add to MetaCart
This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al., 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, (Srinivas et al., 2010) proved that the regret ( ) vanishes at the approximate 1√t rate of O, where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to O e − τt (ln t) d/4 with high probability. Here, d is the dimension of the search space and τ is a constant that depends on the behaviour of the objective function near its global maximum. 1.

Simulation studies in optimistic bayesian sampling in contextual-bandit problems

by Benedict C. May, David S. Leslie , 2011
"... ..."
Abstract - Cited by 13 (0 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...ulation results for the OBS algorithm are very encouraging. In every case the OBS algorithm significantly outperformed the LTS algorithm in the short term, as hypothesised in the accompanying article =-=[3]-=-. IE methods can outperform OBS in the short term, however this is at the expense of a lack of convergence and performance that is highly sensitive to the significance choice. In the cases considered,...

Open Problem: Regret Bounds for Thompson Sampling

by Lihong Li, Olivier Chapelle , 2012
"... Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff collected by an algorithm approaches that of an optimal strategy. Exploration techniques like ɛ-greedy, UCB (upper confidence bound), and their many variants have been extensively studied. Interestingly, one of the oldest exploration heuristics, dated back to Thompson (1933), has not been popular in the literature until recently when researchers started to realize its effectiveness in critical

An information-theoretic analysis of Thompson sampling.

by Daniel Russo , Benjamin Van Roy , 2014
"... Abstract We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bound ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Abstract We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.
(Show Context)

Citation Context

...in industry2. This has prompted a surge of interest in providing theoretical guarantees for Thompson sampling. One of the first theoretical guarantees for Thompson sampling was provided by May et al. =-=[22]-=-, but they showed only that the algorithm converges asymptotically to optimality. Agrawal and Goyal [2, 3], Kauffmann et al. [18] and Korda et al. [19] studied on the classical multiarmed bandit probl...

Spectral Thompson Sampling

by Tomas Kocak, Michal Valko, Remi Munoz, Shipra Agrawal , 2014
"... Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob-lem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis show-ing that the regret of SpectralTS scales as d T lnN with high probability, where T is the time horizon and N is the number of choices. Since a d T lnN regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is com-petitive on both synthetic and real-world data.

Generalized thompson sampling for contextual bandits. arXiv preprint arXiv:1310.7163

by Lihong Li , 2013
"... Abstract Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expert-learning framework, which includes Thompson Sampling as a special case. Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds.
(Show Context)

Citation Context

...g unpopular for decades, this algorithm was recently shown to be state-of-the-art in empirical studies, and has found success in important applications like news recommendation and online advertising =-=[16, 10, 7, 14]-=-. In addition, it has other advantages such as robustness to observation delay [7] and simplicity in implementation, compared to the dominant strategies based on upper confidence bounds (UCB). Despite...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University