Results 1  10
of
18
Thompson Sampling for Contextual Bandits with Linear Payoffs.
, 2013
"... Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateofthear ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
Abstract Thompson Sampling is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the stateoftheart methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multiarmed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of , which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of √ d (or log(N )) of the informationtheoretic lower bound for this problem.
Priorfree and priordependent regret bounds for Thompson Sampling
"... We consider the stochastic multiarmed bandit problem with a prior distribution on the reward distributions. We are interested in studying priorfree and priordependent regret bounds, very much in the same spirit as the usual distributionfree and distributiondependent bounds for the nonBayesian ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
We consider the stochastic multiarmed bandit problem with a prior distribution on the reward distributions. We are interested in studying priorfree and priordependent regret bounds, very much in the same spirit as the usual distributionfree and distributiondependent bounds for the nonBayesian stochastic bandit. Building on the techniques of Audibert and Bubeck [2009] and Russo and Roy [2013] we first show that Thompson Sampling attains an optimal priorfree bound in the sense that for any prior distribution its Bayesian regret is bounded from above by 14 √ nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
An informationtheoretic analysis of Thompson sampling.
, 2014
"... Abstract We provide an informationtheoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decisionmaker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bound ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We provide an informationtheoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decisionmaker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimalaction distribution. This strengthens preexisting results and yields new insight into how information improves performance.
Spectral Thompson Sampling
, 2014
"... Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in realworld graphs. We deliver the analysis showing that the regret of SpectralTS scales as d T lnN with high probability, where T is the time horizon and N is the number of choices. Since a d T lnN regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and realworld data.
Generalized thompson sampling for contextual bandits. arXiv preprint arXiv:1310.7163
, 2013
"... Abstract Thompson Sampling, one of the oldest heuristics for solving multiarmed bandits, has recently been shown to demonstrate stateoftheart performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract Thompson Sampling, one of the oldest heuristics for solving multiarmed bandits, has recently been shown to demonstrate stateoftheart performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expertlearning framework, which includes Thompson Sampling as a special case. Similar to most expertlearning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds.
efficient reinforcement learning via posterior sampling
 in NIPS
"... Most provablyefficient reinforcement learning algorithms introduce optimism about poorlyunderstood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated e ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Most provablyefficient reinforcement learning algorithms introduce optimism about poorlyunderstood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an Õ(τS AT) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635
 In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, 2014
"... We consider the problem of reinforcement learning with an orientation toward contexts in which an agent must generalize from past experience and explore to reduce uncertainty. We propose an approach to exploration based on randomized value functions and an algorithm – randomized leastsquares value ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We consider the problem of reinforcement learning with an orientation toward contexts in which an agent must generalize from past experience and explore to reduce uncertainty. We propose an approach to exploration based on randomized value functions and an algorithm – randomized leastsquares value iteration (RLSVI) – that embodies this approach. We explain why versions of leastsquares value iteration that use Boltzmann or greedy exploration can be highly inefficient and present computational results that demonstrate dramatic efficiency gains enjoyed by RLSVI. Our experiments focus on learning over episodes of a finitehorizon Markov decision process and use a version of RLSVI designed for that task, but we also propose a version of RLSVI that addresses continual learning in an infinitehorizon discounted Markov decision process. 1
Efficient exploration and value function generalization in deterministic systems
 In Advances in Neural Information Processing Systems 26
, 2013
"... We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true va ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q ∗ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE[Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q ∗ does not lie in Q, for the special case where Q is the span of prespecified indicator functions over disjoint sets.
Efficient learning in largescale combinatorial semibandits.
 In Proceedings of the 32nd International Conference on Machine Learning,
, 2015
"... Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in largescale combinatorial semibandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.
Learning to Optimize Via InformationDirected Sampling
, 2014
"... We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singlep ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singleperiod regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for informationdirected sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and the knowledge gradient algorithm. Further, we present simple analytic examples illustrating that, due to the way it measures information gain, informationdirected sampling can dramatically outperform upper confidence bound algorithms and Thompson sampling. 1