Results 1 -
3 of
3
Thompson Sampling for Complex Online Problems
"... We consider stochastic multi-armed bandit prob-lems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback ob-served may not ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We consider stochastic multi-armed bandit prob-lems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms ’ rewards, and the feedback ob-served may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be cou-pled due to the nature of the reward function. We prove a frequentist regret bound for Thomp-son sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parame-ter space without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The re-gret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the re-wards. As applications, we derive improved re-gret bounds for classes of complex bandit prob-lems involving selecting subsets of arms, includ-ing the first nontrivial regret bounds for nonlin-ear MAX reward feedback from subsets. Us-ing particle filters for computing posterior distri-butions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job
Better Optimism By Bayes: Adaptive Planning with Rich Models
, 1402
"... The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The computational costs of inference and planning have confined Bayesian model-based reinforcement learning to one of two dismal fates: powerful Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian non-parametric models but using simple, myopic planning strategies such as Thompson sampling. We ask whether it is feasible and truly beneficial to combine rich probabilistic models with a closer approximation to fully Bayesian planning. First, we use a collection of counterexamples to show formal problems with the over-optimism inherent in Thompson sampling. Then we leverage state-of-theart techniques in efficient Bayes-adaptive planning and non-parametric Bayesian methods to perform qualitatively
Bayesian Reinforcement Learning with Exploration
"... Abstract. We consider a general reinforcement learning problem and show that carefully combining the Bayesian optimal policy and an exploring policy leads to minimax sample-complexity bounds in a very general class of (history-based) environments. We also prove lower bounds and show that the new al ..."
Abstract
- Add to MetaCart
Abstract. We consider a general reinforcement learning problem and show that carefully combining the Bayesian optimal policy and an exploring policy leads to minimax sample-complexity bounds in a very general class of (history-based) environments. We also prove lower bounds and show that the new algorithm displays adaptive behaviour when the environment is easier than worst-case.