Results 1 -
8 of
8
Thompson Sampling for Complex Online Problems
"... We consider stochastic multi-armed bandit prob-lems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback ob-served may not ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We consider stochastic multi-armed bandit prob-lems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback ob-served may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be cou-pled due to the nature of the reward function. We prove a frequentist regret bound for Thomp-son sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parame-ter space without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The re-gret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the re-wards. As applications, we derive improved re-gret bounds for classes of complex bandit prob-lems involving selecting subsets of arms, includ-ing the first nontrivial regret bounds for nonlin-ear MAX reward feedback from subsets. Us-ing particle filters for computing posterior distri-butions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job
Spectral Thompson Sampling
, 2014
"... Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob-lem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis show-ing that the regret of SpectralTS scales as d T lnN with high probability, where T is the time horizon and N is the number of choices. Since a d T lnN regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is com-petitive on both synthetic and real-world data.
Global Multi-armed Bandits with Hölder Continuity
, 2015
"... Standard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to ..."
Abstract
- Add to MetaCart
(Show Context)
Standard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to enable faster convergence to the optimal solution. In this paper, formalize a new class of multi-armed bandit methods, Global Multi-armed Bandit (GMAB), in which arms are globally informative through a global parameter, i.e., choosing an arm reveals information about all the arms. We propose a greedy policy for the GMAB which always selects the arm with the highest estimated expected reward, and prove that it achieves bounded parameter-dependent regret. Hence, this policy selects suboptimal arms only finitely many times, and after a finite number of initial time steps, the optimal arm is selected in all of the remaining time steps with probability one. In addition, we also study how the informativeness of the arms about each other's rewards affects the speed of learning. Specifically, we prove that the parameter-free (worst-case) regret is sublinear in time, and decreases with the informativeness of the arms. We also prove a sublinear in time Bayesian risk bound for the GMAB which reduces to the well-known Bayesian risk bound for linearly parameterized bandits when the arms are fully informative. GMABs have applications ranging from drug dosage control to dynamic pricing.
Global Multi-armed Bandits with Hölder Continuity
- UNDER REVIEW BY AISTATS 2014
"... Standard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to e ..."
Abstract
- Add to MetaCart
Standard Multi-Armed Bandit (MAB) problems assume that the arms are independent. However, in many application scenarios, the information obtained by playing an arm provides information about the remainder of the arms. Hence, in such applications, this informativeness can and should be exploited to enable faster convergence to the optimal solution. In this paper, we introduce and fo-malize the Global MAB (GMAB), in which arms are globally informative through a global parameter, i.e., choosing an arm reveals in-formation about all the arms. We propose a greedy policy for the GMAB which always selects the arm with the highest esti-mated expected reward, and prove that it achieves bounded parameter-dependent regret. Hence, this policy selects suboptimal arms only finitely many times, and after a finite number of initial time steps, the optimal arm is selected in all of the remaining time steps with probability one. In addition, we also study how the informativeness of the arms about each other’s rewards affects the speed of learning. Specifically, we prove that the parameter-free (worst-case) regret is sublinear in time, and decreases with the informativeness of the arms. We also prove a sublinear in time Bayesian risk bound for the GMAB which reduces to the well-known Bayesian risk bound for linearly parameterized bandits when the arms are fully informative. GMABs have applications ranging from drug and treatment discovery to dynamic pricing. Preliminary work.
Γ-UCB A multiplicative UCB strategy for Gamma rewards
"... We consider the stochastic multi-armed bandit problem where rewards are distributed according to Gamma probability measures (unknown up to a lower bound on the form factor). To handle this problem, we propose an UCB-like strategy where indexes are multiplicative (sampled mean times a scaling factor) ..."
Abstract
- Add to MetaCart
(Show Context)
We consider the stochastic multi-armed bandit problem where rewards are distributed according to Gamma probability measures (unknown up to a lower bound on the form factor). To handle this problem, we propose an UCB-like strategy where indexes are multiplicative (sampled mean times a scaling factor). An upper-bound for the associated regret is provided and the proposed strategy is illustrated on some simple experiments.
Bounded Regret for Finite-Armed Structured Bandits
, 2014
"... We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also giv ..."
Abstract
- Add to MetaCart
We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problem-dependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.
SequeL team
"... Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob ..."
Abstract
- Add to MetaCart
Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the compu-tational advertising. Though successful, the tools for its per-formance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit prob-lem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis show-ing that the regret of SpectralTS scales as d T lnN with high probability, where T is the time horizon and N is the number of choices. Since a d T lnN regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is com-petitive on both synthetic and real-world data. 1
Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors
, 2014
"... In stochastic bandit problems, a Bayesian policy called Thompson sampling (TS) has recently attracted much attention for its excellent empirical performance. However, the theoretical analysis of this policy is difficult and its asymptotic optimality is only proved for one-parameter models. In this p ..."
Abstract
- Add to MetaCart
(Show Context)
In stochastic bandit problems, a Bayesian policy called Thompson sampling (TS) has recently attracted much attention for its excellent empirical performance. However, the theoretical analysis of this policy is difficult and its asymptotic optimality is only proved for one-parameter models. In this paper we discuss the optimality of TS for the model of normal distributions with unknown means and variances as one of the most fundamental examples of multiparameter models. First we prove that the expected regret of TS with the uniform prior achieves the theoretical bound, which is the first result to show that the asymptotic bound is achievable for the normal distribution model. Next we prove that TS with Jeffreys prior and reference prior cannot achieve the theoretical bound. There-fore choice of priors is important for TS and non-informative priors are sometimes risky in cases of multiparameter models.