Results 1 
4 of
4
Tight Regret Bounds for Stochastic Combinatorial SemiBandits
"... A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computat ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semibandits. In particular, we analyze a UCBlike algorithm for solving the problem, which is known to be computationally efficient; and prove O(KL(1/∆) log n) and O( KLn log n) upper bounds on its nstep regret, where L is the number of ground items, K is the maximum number of chosen items, and ∆ is the gap between the expected returns of the optimal and best suboptimal solutions. The gapdependent bound is tight up to a constant factor and the gapfree bound is tight up to a polylogarithmic factor. 1
Efficient learning in largescale combinatorial semibandits.
 In Proceedings of the 32nd International Conference on Machine Learning,
, 2015
"... Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in largescale combinatorial semibandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.
Learning to Optimize Via InformationDirected Sampling
, 2014
"... We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singlep ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singleperiod regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for informationdirected sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and the knowledge gradient algorithm. Further, we present simple analytic examples illustrating that, due to the way it measures information gain, informationdirected sampling can dramatically outperform upper confidence bound algorithms and Thompson sampling. 1
Bayesian Policy Gradient and ActorCritic Algorithms
, 2016
"... Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use MonteCarlo techniques to estimate this gradient. The policy is improved by adjusting the parameters i ..."
Abstract
 Add to MetaCart
Abstract Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use MonteCarlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since MonteCarlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this paper, we first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed Bayesian framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and thus, can be easily extended to partially observable problems. On the downside, it cannot take advantage of the Markov property when the system is Markovian. To address this issue, we proceed to supplement our Bayesian policy gradient framework with a new actorcritic learning model in which a Bayesian class of nonparametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the actionvalue function as a Gaussian process, allowing Bayes' rule to be used in computing the posterior distribution over actionvalue functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between actionvalues allow us to obtain closedform expressions for the posterior distribution of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actorcritic algorithms with classic MonteCarlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems.