Results 1  10
of
51
Combinatorial Bandits
"... We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the l ..."
Abstract

Cited by 46 (7 self)
 Add to MetaCart
(Show Context)
We study sequential prediction problems in which, at each time instance, the forecaster chooses a binary vector from a certain fixed set S ⊆ {0, 1} d and suffers a loss that is the sum of the losses of those vector components that equal to one. The goal of the forecaster is to achieve that, in the long run, the accumulated loss is not much larger than that of the best possible vector in the class. We consider the “bandit ” setting in which the forecaster has only access to the losses of the chosen vectors. We introduce a new general forecaster achieving a regret bound that, for a variety of concrete choices of S, is of order √ nd ln S  where n is the time horizon. This is not improvable in general and is better than previously known bounds. We also point out that computationally efficient implementations for various interesting choices of S exist. 1
Optimistic Bayesian sampling in contextualbandit problems
, 2011
"... In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the explorationexploitation dilemma in a general setting encompassing both standard and con ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the explorationexploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise clickthrough rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout.
Learning optimally diverse rankings over large document collections
"... Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a lea ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
(Show Context)
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learningtorank formulation that optimizes the fraction of satisfied users, with a scalable algorithm that explicitly takes document similarity and ranking context into account. We present theoretical justifications for this approach, as well as a nearoptimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches. 1.
Contextual Gaussian Process Bandit Optimization
"... How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the contextaction space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint contextaction space, and develop CGPUCB, an intuitive upperconfidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGPUCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that contextsensitive optimization outperforms no or naive use of context. 1
Nonparametric bandits with covariates
 In COLT
, 2010
"... We consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admi ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We consider a bandit problem which involves sequential sampling from two populations (arms). Each arm produces a noisy reward realization which depends on an observable random covariate. The goal is to maximize cumulative expected reward. We derive general lower bounds on the performance of any admissible policy, and develop an algorithm whose performance achieves the order of said lower bound up to logarithmic terms. This is done by decomposing the global problem into suitably “localized ” bandit problems. Proofs blend ideas from nonparametric statistics and traditional methods used in the bandit literature. 1
Dynamic Pricing with Limited Supply
, 2012
"... We consider the problem of designing revenue maximizing online postedprice mechanisms when the seller has limited supply. A seller has k identical items for sale and is facing n potential buyers (“agents”) that are arriving sequentially. Each agent is interested in buying one item. Each agent’s val ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
We consider the problem of designing revenue maximizing online postedprice mechanisms when the seller has limited supply. A seller has k identical items for sale and is facing n potential buyers (“agents”) that are arriving sequentially. Each agent is interested in buying one item. Each agent’s value for an item is an independent sample from some fixed (but unknown) distribution with support [0,1]. The seller offers a takeitorleaveit price to each arriving agent (possibly different for different agents), and aims to maximize his expected revenue. We focus on mechanisms that do not use any information about the distribution; such mechanisms are called detailfree (or priorindependent). They are desirable because knowing the distribution is unrealistic in many practical scenarios. We study how the revenue of such mechanisms compares to the revenue of the optimal offline mechanism that knows the distribution (“offline benchmark”). We present a detailfree online postedprice mechanism whose revenue is at most O((klogn) 2/3) less than the offline benchmark, for every distribution that is regular. In fact, this guarantee holds without any assumptions if the benchmark is relaxed to fixedprice mechanisms. Further, we prove a matching lower bound. The performance guarantee for the same mechanism can be improved toO ( √ klogn), with a distributiondependent constant, if the ratio k n is sufficiently small. We show that, in the worst case over all demand distributions, this is essentially the best rate that can be obtained with a distributionspecific constant. On a technical level, we exploit the connection to multiarmed bandits (MAB). While dynamic pricing with unlimited supply can easily be seen as an MAB problem, the intuition behind MAB approaches breaks when applied to the setting with limited supply. Our highlevel conceptual contribution is that even the limited supply setting can be fruitfully treated as a bandit problem.
Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
"... This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the syst ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.
Graphical models for bandit problems
 In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence
, 2011
"... We introduce a rich class of graphical models for multiarmed bandit problems that permit boththestateorcontextspaceandtheaction space to be very large, yet succinctly specify the payoffs for any contextaction pair. Our main result is an algorithm for such models whose regret is bounded by the numb ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
We introduce a rich class of graphical models for multiarmed bandit problems that permit boththestateorcontextspaceandtheaction space to be very large, yet succinctly specify the payoffs for any contextaction pair. Our main result is an algorithm for such models whose regret is bounded by the number of parameters and whose running time depends only on the treewidth of the graph substructure induced by the action space.
Spectral Bandits for Smooth Graph Functions
 IN 31TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2014
"... Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommenda ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in realworld graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on realworld content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.