Results 1  10
of
12
Approximation algorithms for restless bandit problems
 CoRR
"... In this paper, we consider the restless bandit problem, which is one of the most wellstudied generalizations of the celebrated stochastic multiarmed bandit problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACEHard to approximate to any nontrivi ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
In this paper, we consider the restless bandit problem, which is one of the most wellstudied generalizations of the celebrated stochastic multiarmed bandit problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACEHard to approximate to any nontrivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. We make progress on this problem by showing that for an interesting and general subclass that we term Monotone bandits, a surprisingly simple and intuitive greedy policy yields a factor 2 approximation. Such greedy policies are termed index policies, and are popular due to their simplicity and their optimality for the stochastic multiarmed bandit problem. The Monotone bandit problem strictly generalizes the stochastic multiarmed bandit problem, and naturally models multiproject scheduling where the state of a project becomes increasingly uncertain when the project is not scheduled. We develop several novel techniques in the design and analysis of the index policy. Our algorithm proceeds by introducing a novel “balance” constraint to the dual of a wellknown LP relaxation to the restless bandit problem. This is followed by a structural characterization of the optimal solution by using both the exact primal as well as dual complementary slackness conditions. This yields an interpretation of the dual variables as potential functions from which we derive the index policy and the associated analysis. 1
Regret bounds for gaussian process bandit problems
 In AISTATS
, 2010
"... Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no nois ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Bandit algorithms are concerned with trading exploration with exploitation where a number of options are available but we can only learn their quality by experimenting with them. We consider the scenario in which the reward distribution for arms is modelled by a Gaussian process and there is no noise in the observed reward. Our main result is to bound the regret experienced by algorithms relative to the a posteriori optimal strategy of playing the best arm throughout based on benign assumptions about the covariance function defining the Gaussian process. We further complement these upper bounds with corresponding lower bounds for particular covariance functions demonstrating that in general there is at most a logarithmic looseness in our upper bounds. 1
Mortal MultiArmed Bandits
"... We formulate and study a new variant of the karmed bandit problem, motivated by ecommerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard karmed bandit model in wh ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
We formulate and study a new variant of the karmed bandit problem, motivated by ecommerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard karmed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is onlineadvertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the stateaware (deterministic reward function) case, and build on this technique to obtain an algorithm for the stateoblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a realworld ad serving application, show that the proposed algorithms significantly outperform the standard multiarmed bandit approaches applied to these settings. 1
Sort Me if You Can: How to Sort Dynamic Data
"... Abstract. We formulate and study a new computational model for dynamic data. In this model the data changes gradually and the goal of an algorithm is to compute the solution to some problem on the data at each time step, under the constraint that it only has a limited access to the data each time. A ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. We formulate and study a new computational model for dynamic data. In this model the data changes gradually and the goal of an algorithm is to compute the solution to some problem on the data at each time step, under the constraint that it only has a limited access to the data each time. As the data is constantly changing and the algorithm might be unaware of these changes, it cannot be expected to always output the exact right solution; we are interested in algorithms that guarantee to output an approximate solution. In particular, we focus on the fundamental problems of sorting and selection, where the true ordering of the elements changes slowly. We provide algorithms with performance close to the optimal in expectation and with high probability. 1
Dynamic Pricing with Limited Supply
, 2012
"... We consider the problem of designing revenue maximizing online postedprice mechanisms when the seller has limited supply. A seller has k identical items for sale and is facing n potential buyers (“agents”) that are arriving sequentially. Each agent is interested in buying one item. Each agent’s val ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the problem of designing revenue maximizing online postedprice mechanisms when the seller has limited supply. A seller has k identical items for sale and is facing n potential buyers (“agents”) that are arriving sequentially. Each agent is interested in buying one item. Each agent’s value for an item is an independent sample from some fixed (but unknown) distribution with support [0,1]. The seller offers a takeitorleaveit price to each arriving agent (possibly different for different agents), and aims to maximize his expected revenue. We focus on mechanisms that do not use any information about the distribution; such mechanisms are called detailfree (or priorindependent). They are desirable because knowing the distribution is unrealistic in many practical scenarios. We study how the revenue of such mechanisms compares to the revenue of the optimal offline mechanism that knows the distribution (“offline benchmark”). We present a detailfree online postedprice mechanism whose revenue is at most O((klogn) 2/3) less than the offline benchmark, for every distribution that is regular. In fact, this guarantee holds without any assumptions if the benchmark is relaxed to fixedprice mechanisms. Further, we prove a matching lower bound. The performance guarantee for the same mechanism can be improved toO ( √ klogn), with a distributiondependent constant, if the ratio k n is sufficiently small. We show that, in the worst case over all demand distributions, this is essentially the best rate that can be obtained with a distributionspecific constant. On a technical level, we exploit the connection to multiarmed bandits (MAB). While dynamic pricing with unlimited supply can easily be seen as an MAB problem, the intuition behind MAB approaches breaks when applied to the setting with limited supply. Our highlevel conceptual contribution is that even the limited supply setting can be fruitfully treated as a bandit problem.
Sorting and Selection on Dynamic Data ∗
"... We formulate and study a new computational model for dynamic data. In this model, the data changes gradually and the goal of an algorithm is to compute the solution to some problem on the data at each time step, under the constraint that it only has limited access to the data each time. As the data ..."
Abstract
 Add to MetaCart
We formulate and study a new computational model for dynamic data. In this model, the data changes gradually and the goal of an algorithm is to compute the solution to some problem on the data at each time step, under the constraint that it only has limited access to the data each time. As the data is constantly changing and the algorithm might be unaware of these changes, it cannot be expected to always output the exact right solution; we are interested in algorithms that guarantee to output an approximate solution. In particular, we focus on the fundamental problems of sorting and selection, where the true ordering of the elements changes slowly. We provide algorithms with performance close to the optimal in expectation and with high probability. 1
JEUX DE BANDITS ET FONDATIONS DU CLUSTERING Rapporteurs: M. Olivier CATONI CNRS et ENS
"... Je tiens avant tout à te remercier Rémi. Travailler avec toi fut un réel plaisir, ton enthousiasme permanent, ta disponibilité et ta vision mathématique unique me resteront longtemps en mémoire. JeanYves, nous avons tout juste commencé à explorer nos centres d’intérêt communs, et j’ai le sentiment ..."
Abstract
 Add to MetaCart
Je tiens avant tout à te remercier Rémi. Travailler avec toi fut un réel plaisir, ton enthousiasme permanent, ta disponibilité et ta vision mathématique unique me resteront longtemps en mémoire. JeanYves, nous avons tout juste commencé à explorer nos centres d’intérêt communs, et j’ai le sentiment qu’il nous reste encore beaucoup de choses à faire. Je tenais particulièrement à te remercier de partager tes idées toujours très stimulantes (pour ne pas dire plus...) avec moi. Les us et coutumes du monde académique peuvent parfois être difficile à pénétrer, heureusement dans ce domaine j’ai eu comme maitre un expert en la matière, Gilles. Au niveau mathématique tu m’as permis de débuter ma thèse sur des bases solides, et ton aide a été inestimable. I was lucky enough to be introduced to the world of research by you Ulrike. You taught me how to do (hopefully) useful theoretical research, but also all the basic tricks that a researcher has to know. I wish both of us had more time to continue our exciting projects, but I am confident that in the near future we will collaborate again! In the cold and icy land of Alberta, lived a man known for his perfect knowledge of the right references, but also for his constant kindness. Csaba, I am looking forward to (finally) start a new project with you.
Charles River Analytics
"... We present a novel methodology for decisionmaking by computer agents that leverages a computational concept of emotions. It is believed that emotions help living organisms perform well in complex environments. Can we use them to improve the decisionmaking performance of computer agents? We explore ..."
Abstract
 Add to MetaCart
We present a novel methodology for decisionmaking by computer agents that leverages a computational concept of emotions. It is believed that emotions help living organisms perform well in complex environments. Can we use them to improve the decisionmaking performance of computer agents? We explore this possibility by formulating emotions as mathematical operators that serve to update the relative priorities of the agent’s goals. The agent uses rudimentary domain knowledge to monitor the expectation that its goals are going to be accomplished in the future, and reacts to changes in this expectation by “experiencing emotions. ” The end result is a projection of the agent’s longrun utility function, which might be too complex to optimize or even represent, to a timevarying valuation function that is being myopically maximized by selecting appropriate actions. Our methodology provides a systematic way to incorporate emotion into a decisiontheoretic framework, and also provides a principled, domainindependent methodology for generating heuristics in novel situations. We test our agents in simulation in two domains: restless bandits and a simple foraging environment. Our results indicate that emotionbased agents outperform other reasonable heuristics for such difficult domains, and closely approach computationally expensive nearoptimal solutions, whenever these are computable, yet requiring only a fraction of the cost. 1
Editor: unknown Abstract
, 805
"... Multiarmed bandit problems are considered as a paradigm of the tradeoff between exploring the environment to find profitable actions and exploiting what is already known. In the stationary case, the distributions of the rewards do not change in time, UpperConfidence Bound (UCB) policies, proposed ..."
Abstract
 Add to MetaCart
Multiarmed bandit problems are considered as a paradigm of the tradeoff between exploring the environment to find profitable actions and exploiting what is already known. In the stationary case, the distributions of the rewards do not change in time, UpperConfidence Bound (UCB) policies, proposed in Agrawal (1995) and later analyzed in Auer et al. (2002), have been shown to be rate optimal. A challenging variant of the MABP is the nonstationary bandit problem where the gambler must decide which arm to play while facing the possibility of a changing environment. In this paper, we consider the situation where the distributions of rewards remain constant over epochs and change at unknown time instants. We analyze two algorithms: the discounted UCB and the slidingwindow UCB. We establish for these two algorithms an upperbound for the expected regret by upperbounding the expectation of the number of times a suboptimal arm is played. For that purpose, we derive a Hoeffding type inequality for self normalized deviations with a random number of summands. We establish a lowerbound for the regret in presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the slidingwindow UCB both match the lowerbound up to a logarithmic factor.
Sébastien BUBECK JEUX DE BANDITS ET FONDATIONS DU CLUSTERING Rapporteurs: M. Olivier CATONI CNRS et ENS
, 2013
"... ton enthousiasme permanent, ta disponibilité et ta vision mathématique unique me resteront longtemps en mémoire. JeanYves, nous avons tout juste commencé à explorer nos centres d’intérêt communs, et j’ai le sentiment qu’il nous reste encore beaucoup de choses à faire. Je tenais particulièrement à t ..."
Abstract
 Add to MetaCart
ton enthousiasme permanent, ta disponibilité et ta vision mathématique unique me resteront longtemps en mémoire. JeanYves, nous avons tout juste commencé à explorer nos centres d’intérêt communs, et j’ai le sentiment qu’il nous reste encore beaucoup de choses à faire. Je tenais particulièrement à te remercier de partager tes idées toujours très stimulantes (pour ne pas dire plus...) avec moi. Les us et coutumes du monde académique peuvent parfois être difficile à pénétrer, heureusement dans ce domaine j’ai eu comme maitre un expert en la matière, Gilles. Au niveau mathématique tu m’as permis de débuter ma thèse sur des bases solides, et ton aide a été inestimable. I was lucky enough to be introduced to the world of research by you Ulrike. You taught me how to do (hopefully) useful theoretical research, but also all the basic tricks that a researcher has to know. I wish both of us had more time to continue our exciting projects, but I am confident that in the near future we will collaborate again! In the cold and icy land of Alberta, lived a man known for his perfect knowledge of the right references, but also for his constant kindness. Csaba, I am looking forward to (finally) start a new project with you.