Results 1  10
of
87
Nearly tight bounds for the continuumarmed bandit problem
 Advances in Neural Information Processing Systems 17
, 2005
"... In the multiarmed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when th ..."
Abstract

Cited by 121 (7 self)
 Add to MetaCart
(Show Context)
In the multiarmed bandit problem, an online algorithm must choose from a set of strategies in a sequence of n trials so as to minimize the total cost of the chosen strategies. While nearly tight upper and lower bounds are known in the case when the strategy set is finite, much less is known when there is an infinite strategy set. Here we consider the case when the set of strategies is a subset of R d, and the cost functions are continuous. In the d = 1 case, we improve on the bestknown upper and lower bounds, closing the gap to a sublogarithmic factor. We also consider the case where d> 1 and the cost functions are convex, adapting a recent online convex optimization algorithm of Zinkevich to the sparser feedback model of the multiarmed bandit problem. 1
MultiArmed Bandits in Metric Spaces
 STOC'08
, 2008
"... In a multiarmed bandit problem, an online algorithm chooses from a set of strategies in a sequence of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with larg ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
In a multiarmed bandit problem, an online algorithm chooses from a set of strategies in a sequence of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions. In this work we study a very general setting for the multiarmed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the Lipschitz MAB problem. We present a complete solution for the multiarmed problem in this setting. That is, for every metric space (L, X) we define an isometry invariant MaxMinCOV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions.
The Value of Knowing a Demand Curve: Bounds on Regret for Online PostedPrice Auctions
 In Proc. of the 44nd IEEE Symp. on Foundations of Computer Science
, 2003
"... We consider the revenuemaximization problem for a seller with an unlimited supply of identical goods, interacting sequentially with a population of n buyers through an online postedprice auction mechanism, a paradigm which is frequently available to vendors selling goods over the Internet. For e ..."
Abstract

Cited by 68 (7 self)
 Add to MetaCart
We consider the revenuemaximization problem for a seller with an unlimited supply of identical goods, interacting sequentially with a population of n buyers through an online postedprice auction mechanism, a paradigm which is frequently available to vendors selling goods over the Internet. For each buyer, the seller names a price between 0 and 1; the buyer decides whether or not to buy the item at the specified price, based on her privatelyheld valuation. The price offered is allowed to vary as the auction proceeds, as the seller gains information from interactions with the earlier buyers. The additive regret of a pricing strategy is defined to be the difference between the strategy’s expected revenue and the revenue derived from the optimal fixedprice strategy. In the case where buyers ’ valuations are independent samples from a fixed probability distribution (usually specified by a demand curve), one can interpret the regret as specifying how much the seller should be willing to pay for knowledge of the demand curve from which buyers ’ valuations are sampled. The answer to the problem depends on what assumptions one makes about the buyers ’ valuations. We consider three such assumptions: that the valuations are all equal to some unknown number p, that they are independent samples from an unknown probabilility distribution, or that they are chosen by an oblivious adversary. In each case, we derive upper and lower bounds on regret which match within a factor of logn; the bounds match up to a constant factor in the case of identical valuations.
An Adaptive Algorithm for Selecting Profitable Keywords for SearchBased Advertising Services
 In EC ’06: Proceedings of the 7th ACM conference on Electronic commerce
, 2006
"... Increases in online searches have spurred the growth of searchbased advertising services offered by search engines, enabling companies to promote their products to consumers based on search queries. With millions of available keywords whose clickthru rates and profits are highly uncertain, identify ..."
Abstract

Cited by 61 (0 self)
 Add to MetaCart
Increases in online searches have spurred the growth of searchbased advertising services offered by search engines, enabling companies to promote their products to consumers based on search queries. With millions of available keywords whose clickthru rates and profits are highly uncertain, identifying the most profitable set of keywords becomes challenging. We formulate a stylized model of keyword selection in searchbased advertising services. Assuming known profits and unknown clickthru rates, we develop an approximate adaptive algorithm that prioritizes keywords based on a prefix ordering – sorting of keywords in a descending order of expectedprofittocost ratio (or “bangperbuck”). We show that the average expected profit generated by our algorithm converges to nearoptimal profits, with the convergence rate that is independent of the number of keywords and scales gracefully with the problem’s parameters. By leveraging the special structure of our problem, our algorithm trades off bias with faster convergence rate, converging very quickly but with only nearoptimal profit in the limit. Extensive numerical simulations show that when the number of keywords is large, our algorithm outperforms existing methods, increasing profits by about 20 % in as little as 40 periods. We also extend our algorithm to the setting when both the clickthru rates and the expected profits are unknown. 1
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
(Show Context)
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work
Online optimization in Xarmed bandits
 In Advances in Neural Information Processing Systems 22
, 2008
"... We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm ..."
Abstract

Cited by 47 (8 self)
 Add to MetaCart
(Show Context)
We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space and the meanpayoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy whose regret improves upon previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the meanpayoff function has a finite number of global maxima around which the behavior of the function is locally Hölder with a known exponent, then the expected regret is bounded up to a logarithmic factor by √ n, i.e., the rate of the growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm for the class of problems considered. 1 Introduction and
Algorithms for Infinitely ManyArmed Bandits
"... We consider multiarmed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the meanreward of a new selected arm which characterizes its probability of being a nearoptimal arm. Our assumption is weaker than in previous work ..."
Abstract

Cited by 44 (5 self)
 Add to MetaCart
(Show Context)
We consider multiarmed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the meanreward of a new selected arm which characterizes its probability of being a nearoptimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upperconfidencebounds applied to a restricted set of randomly selected arms and provide upperbounds on the resulting expected regret. We also derive a lowerbound which matches (up to a logarithmic factor) the upperbound in some cases. 1
Improved Rates for the Stochastic ContinuumArmed Bandit Problem
 In 20th Conference on Learning Theory (COLT
, 2007
"... Abstract. Considering onedimensional continuumarmed bandit problems, we propose an improvement of an algorithm of Kleinberg and a new set of conditions which give rise to improved rates. In particular, we introduce a novel assumption that is complementary to the previous smoothness conditions, whi ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Considering onedimensional continuumarmed bandit problems, we propose an improvement of an algorithm of Kleinberg and a new set of conditions which give rise to improved rates. In particular, we introduce a novel assumption that is complementary to the previous smoothness conditions, while at the same time smoothness of the mean payoff function is required only at the maxima. Under these new assumptions new bounds on the expected regret are derived. In particular, we show that apart from logarithmic factors, the expected regret scales with the squareroot of the number of trials, provided that the mean payoff function has finitely many maxima and its second derivatives are continuous and nonvanishing at the maxima. This improves a previous result of Cope by weakening the assumptions on the function. We also derive matching lower bounds. To complement the bounds on the expected regret, we provide high probability bounds which exhibit similar scaling. 1
Online Linear Optimization and Adaptive Routing
, 2006
"... This paper studies an online linear optimization problem generalizing the multiarmed bandit problem. Motivated primarily by the task of designing adaptive routing algorithms for overlay networks, we present two randomized online algorithms for selecting a sequence of routing paths in a network with ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
This paper studies an online linear optimization problem generalizing the multiarmed bandit problem. Motivated primarily by the task of designing adaptive routing algorithms for overlay networks, we present two randomized online algorithms for selecting a sequence of routing paths in a network with unknown edge delays varying adversarially over time. In contrast with earlier work on this problem, we assume that the only feedback after choosing such a path is the total endtoend delay of the selected path. We present two algorithms whose regret is sublinear in the number of trials and polynomial in the size of the network. The first of these algorithms generalizes to solve any online linear optimization problem, given an oracle for optimizing linear functions over the set of strategies; our work may thus be interpreted as a generalpurpose reduction from offline to online linear optimization. A key element of this algorithm is the notion of a barycentric spanner, a special type of basis for the vector space of strategies which allows any feasible strategy to be expressed as a linear combination of basis vectors using bounded coefficients. We also present a second algorithm for the online shortest path problem, which solves the problem using a chain of online decision oracles, one at each node of the graph. This has several advantages over the online linear optimization approach. First, it is effective against an adaptive adversary, whereas our linear optimization algorithm assumes an oblivious adversary. Second, even in the case of an oblivious adversary, the second algorithm performs slightly better than the first, as measured by their additive regret.
Online Decision Problems with Large Strategy Sets
, 2005
"... In an online decision problem, an algorithm performs a sequence of trials, each of which involves selecting one element from a fixed set of alternatives (the “strategy set”) whose costs vary over time. After T trials, the combined cost of the algorithm’s choices is compared with that of the single s ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
In an online decision problem, an algorithm performs a sequence of trials, each of which involves selecting one element from a fixed set of alternatives (the “strategy set”) whose costs vary over time. After T trials, the combined cost of the algorithm’s choices is compared with that of the single strategy whose combined cost is minimum. Their difference is called regret, and one seeks algorithms which are efficient in that their regret is sublinear in T and polynomial in the problem size. We study an important class of online decision problems called generalized multiarmed bandit problems. In the past such problems have found applications in areas as diverse as statistics, computer science, economic theory, and medical decisionmaking. Most existing algorithms were efficient only in the case of a small (i.e. polynomialsized) strategy set. We extend the theory by supplying nontrivial algorithms and lower bounds for cases in which the strategy set is much larger (exponential or infinite) and