Results 1  10
of
25
Parallelizing explorationexploitation tradeoffs with gaussian process bandit optimization
 In In Proc. International Conference on Machine Learning
, 2012
"... How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be both noisy and expensive. We introduce Gaussian Process Batch Upper Confidence Bound (GPBUCB), an upper confidence boundbased algorithm, which models the reward function as a sample from a Gaussian process and which can select batches of experiments to run in parallel. We prove a general regret bound for GPBUCB, as well as the surprising result that for some common kernels, the asymptotic average regret can be made independent of the batch size. The GPBUCB algorithm is also applicable in the related case of a delay between initiation of an experiment and observation of its results, for which the same regret bounds hold. We also introduce Gaussian Process Adaptive Upper Confidence Bound (GPAUCB), a variant of GPBUCB which can exploit parallelism in an adaptive manner. We evaluate GPBUCB and GPAUCB on several simulated and real data sets. These experiments show that GPBUCB and GPAUCB are competitive with stateoftheart heuristics.1
Learning to Optimize Via Posterior Sampling
, 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
(Show Context)
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
Highdimensional gaussian process bandits.
 In Advances in Neural Information Processing Systems (NIPS),
, 2013
"... Abstract Many applications in machine learning require optimizing unknown functions defined over a highdimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some lowdimensional subsp ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract Many applications in machine learning require optimizing unknown functions defined over a highdimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some lowdimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SIBO algorithm, which leverages recent lowrank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the explorationexploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in highdimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios.
Online Learning for Linearly Parametrized Control Problems
, 2012
"... Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission.
Parallel Gaussian process optimization with upper confidence bound and pure exploration
 In Machine Learning and Knowledge Discovery in Databases
, 2013
"... Abstract. In this paper, we consider the challenge of maximizing an unknown function f for which evaluations are noisy and are acquired with high cost. An iterative procedure uses the previous measures to actively select the next estimation of f which is predicted to be the most useful. We focus on ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we consider the challenge of maximizing an unknown function f for which evaluations are noisy and are acquired with high cost. An iterative procedure uses the previous measures to actively select the next estimation of f which is predicted to be the most useful. We focus on the case where the function can be evaluated in parallel with batches of fixed size and analyze the benefit compared to the purely sequential procedure in terms of cumulative regret. We introduce the Gaussian Process Upper Confidence Bound and Pure Exploration algorithm (GPUCBPE) which combines the UCB strategy and Pure Exploration in the same batch of evaluations along the parallel iterations. We prove theoretical upper bounds on the regret with batches of size K for this procedure which show the improvement of the order of K for fixed iteration cost over purely sequential versions. Moreover, the multiplicative constants involved have the property of being dimensionfree. We also confirm empirically the efficiency of GPUCBPE on real and synthetic problems compared to stateoftheart competitors. 1
Eluder dimension and the sample complexity of optimistic exploration
 In Advances in Neural Information Processing Systems
, 2013
"... Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper c ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract This paper considers the sample complexity of the multiarmed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models.
Modeling Human Decisionmaking in Generalized Gaussian Multiarmed Bandits
, 2014
"... We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with tr ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
We present a formal model of human decisionmaking in exploreexploit tasks using the context of multiarmed bandit problems, where the decisionmaker must choose among multiple options with uncertain rewards. We address the standard multiarmed bandit problem, the multiarmed bandit problem with transition costs, and the multiarmed bandit problem on graphs. We focus on the case of Gaussian rewards in a setting where the decisionmaker uses Bayesian inference to estimate the reward values. We model the decisionmaker’s prior knowledge with the Bayesian prior on the mean reward. We develop the upper credible limit (UCL) algorithm for the standard multiarmed bandit problem and show that this deterministic algorithm achieves logarithmic cumulative expected regret, which is optimal performance for uninformative priors. We show how good priors and good assumptions on the correlation structure among arms can greatly enhance decisionmaking performance, even over short time horizons. We extend to the stochastic UCL algorithm and draw several connections to human decisionmaking behavior. We present empirical data from human experiments and show that human performance is efficiently captured by the stochastic UCL algorithm with appropriate parameters. For the multiarmed bandit problem with transition costs and the multiarmed bandit problem on graphs, we generalize the UCL algorithm to the block UCL algorithm and the graphical block UCL algorithm, respectively. We show that these algorithms also achieve logarithmic cumulative expected regret and require a sublogarithmic expected number of transitions among arms. We further illustrate the performance of these algorithms with numerical examples.
An informationtheoretic analysis of Thompson sampling.
, 2014
"... Abstract We provide an informationtheoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decisionmaker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bound ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We provide an informationtheoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decisionmaker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimalaction distribution. This strengthens preexisting results and yields new insight into how information improves performance.
On optimal foraging and multiarmed bandits
 in Proc. 51st Annu. Allerton Conf. Commun. Control Comput
, 2013
"... Abstract—We consider two variants of the standard multiarmed bandit problem, namely, the multiarmed bandit problem with transition costs and the multiarmed bandit problem on graphs. We develop block allocation algorithms for these problems that achieve an expected cumulative regret that is unifo ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract—We consider two variants of the standard multiarmed bandit problem, namely, the multiarmed bandit problem with transition costs and the multiarmed bandit problem on graphs. We develop block allocation algorithms for these problems that achieve an expected cumulative regret that is uniformly dominated by a logarithmic function of time, and an expected cumulative number of transitions from one arm to another arm uniformly dominated by a doublelogarithmic function of time. We observe that the multiarmed bandit problem with transition costs and the associated block allocation algorithm capture the key features of popular animal foraging models in literature. I.
A Stepwise uncertainty reduction approach to constrained global optimization
"... Using statistical emulators to guide sequential evaluations of complex computer experiments is now a wellestablished practice. When a model provides multiple outputs, a typical objective is to optimize one of the outputs with constraints (for instance, a threshold not to exceed) on the values o ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Using statistical emulators to guide sequential evaluations of complex computer experiments is now a wellestablished practice. When a model provides multiple outputs, a typical objective is to optimize one of the outputs with constraints (for instance, a threshold not to exceed) on the values of the other outputs. We propose here a new optimization strategy based on the stepwise uncertainty reduction paradigm, which offers an efficient tradeoff between exploration and local search near the boundaries. The strategy is illustrated on numerical examples. 1