Results 1  10
of
35
Parallelizing explorationexploitation tradeoffs with gaussian process bandit optimization
 In In Proc. International Conference on Machine Learning
, 2012
"... How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
How can we take advantage of opportunities for experimental parallelization in explorationexploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be both noisy and expensive. We introduce Gaussian Process Batch Upper Confidence Bound (GPBUCB), an upper confidence boundbased algorithm, which models the reward function as a sample from a Gaussian process and which can select batches of experiments to run in parallel. We prove a general regret bound for GPBUCB, as well as the surprising result that for some common kernels, the asymptotic average regret can be made independent of the batch size. The GPBUCB algorithm is also applicable in the related case of a delay between initiation of an experiment and observation of its results, for which the same regret bounds hold. We also introduce Gaussian Process Adaptive Upper Confidence Bound (GPAUCB), a variant of GPBUCB which can exploit parallelism in an adaptive manner. We evaluate GPBUCB and GPAUCB on several simulated and real data sets. These experiments show that GPBUCB and GPAUCB are competitive with stateoftheart heuristics.1
Learning to Optimize Via Posterior Sampling
, 2013
"... This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confide ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the margin dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayes risk bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms. 1
The Knowledge Gradient Algorithm For Online Subset Selection
"... Abstract — We derive a oneperiod lookahead policy for online subset selection problems, where learning about one subset also gives us information about other subsets. We show that the resulting decision rule is easily computable, and present experimental evidence that the policy is competitive aga ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract — We derive a oneperiod lookahead policy for online subset selection problems, where learning about one subset also gives us information about other subsets. We show that the resulting decision rule is easily computable, and present experimental evidence that the policy is competitive against other online learning policies. I.
Sequential bayesoptimal policies for multiple comparions with a control
, 2012
"... We consider the problem of efficiently allocating simulation effort to determine which of several simulated systems have mean performance exceeding a known threshold. This determination is known as multiple comparisons with a control. Within a Bayesian formulation, the optimal fully sequential polic ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We consider the problem of efficiently allocating simulation effort to determine which of several simulated systems have mean performance exceeding a known threshold. This determination is known as multiple comparisons with a control. Within a Bayesian formulation, the optimal fully sequential policy for allocating simulation effort is the solution to a dynamic program. We show that this dynamic program can be solved efficiently, providing a tractable way to compute the Bayesoptimal policy. The solution uses techniques from optimal stopping and multiarmed bandits. We then present further theoretical results characterizing this Bayesoptimal policy, compare it numerically to several approximate policies, and apply it to an application in ambulance positioning. Key words: multiple comparisons with a control; sequential experimental design; dynamic programming; Bayesian statistics; value of information. 1.
Information collection on a graph
, 2010
"... We derive a knowledge gradient policy for an optimal learning problem on a graph, in which we use sequential measurements to refine Bayesian estimates of individual edge values in order to learn about the best path. This problem differs from traditional ranking and selection, in that the implementat ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
We derive a knowledge gradient policy for an optimal learning problem on a graph, in which we use sequential measurements to refine Bayesian estimates of individual edge values in order to learn about the best path. This problem differs from traditional ranking and selection, in that the implementation decision (the path we choose) is distinct from the measurement decision (the edge we measure). Our decision rule is easy to compute, and performs competitively against other learning policies, including a Monte Carlo adaptation of the knowledge gradient policy for ranking and selection. 1
Cheap but clever: Human active learning in a bandit setting
 In Proceedings of the Cognitive Science Society Conference
, 2013
"... How people achieve longterm goals in an imperfectly known environment, via repeated tries and noisy outcomes, is an important problem in cognitive science. There are two interrelated questions: how humans represent information, both what has been learned and what can still be learned, and how the ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
How people achieve longterm goals in an imperfectly known environment, via repeated tries and noisy outcomes, is an important problem in cognitive science. There are two interrelated questions: how humans represent information, both what has been learned and what can still be learned, and how they choose actions, in particular how they negotiate the tension between exploration and exploitation. In this work, we examine human behavioral data in a multiarmed bandit setting, in which the subject choose one of four “arms ” to pull on each trial and receives a binary outcome (win/lose). We implement both the Bayesoptimal policy, which maximizes the expected cumulative reward in this finitehorizon bandit environment, as well as a variety of heuristic policies that vary in their complexity of information representation and decision policy. We find that the knowledge gradient algorithm, which combines exact Bayesian learning with a decision policy that maximizes a combination of immediate reward gain and longterm knowledge gain, captures subjects ’ trialbytrial choice best among all the models considered; it also provides the best approximation to the computationally intense optimal policy among all the heuristic policies.
Knowledgegradient methods for statistical learning
, 2009
"... We consider the class of fully sequential Bayesian information collection problems, a class that includes ranking and selection problems, multiarmed bandit problems, and many others. Although optimal policies for such problems are generally known to exist and to satisfy Bellman’s recursion, the cur ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
We consider the class of fully sequential Bayesian information collection problems, a class that includes ranking and selection problems, multiarmed bandit problems, and many others. Although optimal policies for such problems are generally known to exist and to satisfy Bellman’s recursion, the curses of dimensionality prevent us from actually computing them except in a few very special cases. Motivated by this difficulty, we develop a general class of practical and theoretically wellfounded information collection policies known as knowledgegradient (KG) policies. KG policies have several attractive qualities: they are myopically optimal in general; they are asymptotically optimal in a broad class of problems; they are flexible and may be computed easily in a broad class of problems; and they perform well numerically in several wellstudied ranking and selection problems compared with other stateoftheart policies designed specifically for these problems. iii Acknowledgements I am grateful to many people for their help in completing my PhD. First, I would like to thank my advisor, Professor Warren Powell, for his ability to choose problems, his untiring availability for questions, his high expectations, and the wonderful
Learning the demand curve in postedprice digital goods auctions
 In Proceedings of the Tenth International Joint Conference on Autonomous Agents and MultiAgent Systems
, 2011
"... Online digital goods auctions are settings where a seller with an unlimited supply of goods (e.g. music or movie downloads) interacts with a stream of potential buyers. In the posted price setting, the seller makes a takeitorleaveit offer to each arriving buyer. We study the seller’s revenue m ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Online digital goods auctions are settings where a seller with an unlimited supply of goods (e.g. music or movie downloads) interacts with a stream of potential buyers. In the posted price setting, the seller makes a takeitorleaveit offer to each arriving buyer. We study the seller’s revenue maximization problem in postedprice auctions of digital goods. We find that algorithms from the multiarmed bandit literature like UCB, which come with good regret bounds, can be slow to converge. We propose and study two alternatives: (1) a scheme based on using Gittins indices with priors that make appropriate use of domain knowledge; (2) a new learning algorithm, LLVD, that assumes a linear demand curve, and maintains a Beta prior over the free parameter using a momentmatching approximation. LLVD is not only (approximately) optimal for linear demand, but also learns fast and performs well when the linearity assumption is violated, for example in the cases of two natural valuation distributions, exponential and lognormal.
Information collection for linear programs with uncertain objective coefficients
 SIAM Journal on Optimization
, 2012
"... Abstract. Consider a linear program (LP) with uncertain objective coefficients, for which we have a Bayesian prior. We can collect information to improve our understanding of these coefficients, but this may be expensive, giving us a separate problem of optimizing the collection of information to im ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Consider a linear program (LP) with uncertain objective coefficients, for which we have a Bayesian prior. We can collect information to improve our understanding of these coefficients, but this may be expensive, giving us a separate problem of optimizing the collection of information to improve the quality of the solution relative to the true cost coefficients. We formulate this information collection problem for LPs for the first time and derive a knowledge gradient policy which finds the marginal value of each measurement by solving a sequence of LPs. We prove that this policy is asymptotically optimal and demonstrate its performance on a network flow problem. Key words. optimal learning, stochastic optimization, sequential learning, stochastic programming by