• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

An information-theoretic analysis of Thompson sampling. (2014)

by D Russo, B Van Roy
Add To MetaCart

Tools

Sorted by:
Results 1 - 4 of 4

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

by Branislav Kveton, Zheng Wen, Azin Ashkan
"... A stochastic combinatorial semi-bandit is an on-line learning problem where at each step a learn-ing agent chooses a subset of ground items sub-ject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computat ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
A stochastic combinatorial semi-bandit is an on-line learning problem where at each step a learn-ing agent chooses a subset of ground items sub-ject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In partic-ular, we analyze a UCB-like algorithm for solv-ing the problem, which is known to be computa-tionally efficient; and prove O(KL(1/∆) log n) and O( KLn log n) upper bounds on its n-step regret, where L is the number of ground items, K is the maximum number of chosen items, and ∆ is the gap between the expected returns of the optimal and best suboptimal solutions. The gap-dependent bound is tight up to a constant factor and the gap-free bound is tight up to a polyloga-rithmic factor. 1

Efficient learning in large-scale combinatorial semibandits.

by Zheng Wen , Adobe Research , CA San Jose - In Proceedings of the 32nd International Conference on Machine Learning, , 2015
"... Abstract A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Abstract A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.

Learning to Optimize Via Information-Directed Sampling

by Daniel Russo, Benjamin Van Roy , 2014
"... We propose information-directed sampling – a new algorithm for online optimization prob-lems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-p ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
We propose information-directed sampling – a new algorithm for online optimization prob-lems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate simula-tion performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and the knowledge gradient algorithm. Further, we present simple analytic examples illustrating that, due to the way it measures information gain, information-directed sampling can dramatically outperform upper confidence bound algorithms and Thompson sam-pling. 1
(Show Context)

Citation Context

...assesses information gain allows it to dramatically outperform UCB algorithms and Thompson sampling. Further, by leveraging the tools of our recent information theoretic analysis of Thompson sampling =-=[47]-=-, we establish an expected regret bound for IDS that applies across a very general class of models and scales with the entropy of the optimal action distribution. We also specialize this bound to seve...

Bayesian Policy Gradient and Actor-Critic Algorithms

by Mohammad Ghavamzadeh , Yaakov Engel , Michal Valko , Mohammad Ghavamzadeh , Yaakov Engel , Michal Valko , Engel Valko Ghavamzadeh , 2016
"... Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters i ..."
Abstract - Add to MetaCart
Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this paper, we first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed Bayesian framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and thus, can be easily extended to partially observable problems. On the downside, it cannot take advantage of the Markov property when the system is Markovian. To address this issue, we proceed to supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non-parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action-value function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values allow us to obtain closed-form expressions for the posterior distribution of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University