Results 1 - 10
of
20
An Empirical Evaluation of Thompson Sampling
"... Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heur ..."
Abstract
-
Cited by 72 (6 self)
- Add to MetaCart
(Show Context)
Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
Thompson Sampling for Contextual Bandits with Linear Payoffs.
, 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateof-the-ar ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateof-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of , which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of √ d (or log(N )) of the information-theoretic lower bound for this problem.
Further Optimal Regret Bounds for Thompson Sampling. AISTATS,
, 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Abstract Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that proves the first near-optimal problem-independent bound of O( √ N T ln T ) on the expected regret of this algorithm. Our novel martingale-based analysis techniques are conceptually simple, and easily extend to distributions other than the Beta distribution. For the version of Thompson Sampling that uses Gaussian priors, we prove a problem-independent bound of O( √ N T ln N ) on the expected regret, and demonstrate the optimality of this bound by providing a matching lower bound. This lower bound of Ω( √ N T ln N ) is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of O( √ N T ) for the multi-armed bandit problem. Our near-optimal problem-independent bounds for Thompson Sampling solve a COLT 2012 open problem of Chapelle and Li. Additionally, our techniques simultaneously provide the optimal problem-dependent bound of (1 + ǫ) i ln T d(µi,µ1) + O( N ǫ 2 ) on the expected regret. The optimal problem-dependent regret bound for this problem was first proven recently by
Learning to Optimize Via Posterior Sampling
, 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
(Show Context)
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
Exponential regret bounds for Gaussian process bandits with deterministic observations
- In ICML
, 2012
"... This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al., 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, (Sriniv ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas et al., 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, (Srinivas et al., 2010) proved that the regret ( ) vanishes at the approximate 1√t rate of O, where t is the number of observations. To complement their result, we attack the deterministic case and attain a much faster exponential convergence rate. Under some regularity assumptions, we show that the regret decreases asymptotically according to O e − τt (ln t) d/4 with high probability. Here, d is the dimension of the search space and τ is a constant that depends on the behaviour of the objective function near its global maximum. 1.
Simulation studies in optimistic bayesian sampling in contextual-bandit problems
, 2011
"... ..."
(Show Context)
Open Problem: Regret Bounds for Thompson Sampling
, 2012
"... Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Contextual multi-armed bandits (Langford and Zhang, 2008) have received substantial interests in recent years due to their wide applications on the Internet, such as new recommendation and advertising. The fundamental challenge here is to balance exploration and exploitation so that the total payoff collected by an algorithm approaches that of an optimal strategy. Exploration techniques like ɛ-greedy, UCB (upper confidence bound), and their many variants have been extensively studied. Interestingly, one of the oldest exploration heuristics, dated back to Thompson (1933), has not been popular in the literature until recently when researchers started to realize its effectiveness in critical
An information-theoretic analysis of Thompson sampling.
, 2014
"... Abstract We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bound ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.
Spectral Thompson Sampling
, 2014
"... Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob-lem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis show-ing that the regret of SpectralTS scales as d T lnN with high probability, where T is the time horizon and N is the number of choices. Since a d T lnN regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is com-petitive on both synthetic and real-world data.
Generalized thompson sampling for contextual bandits. arXiv preprint arXiv:1310.7163
, 2013
"... Abstract Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expert-learning framework, which includes Thompson Sampling as a special case. Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds.