• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Online Decision Problems with Large Strategy Sets (2005)

by R Kleinberg
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

An Adaptive Algorithm for Selecting Profitable Keywords for Search-Based Advertising Services

by Paat Rusmevichientong, David P. Williamson - In EC ’06: Proceedings of the 7th ACM conference on Electronic commerce , 2006
"... Increases in online searches have spurred the growth of search-based advertising services offered by search engines, enabling companies to promote their products to consumers based on search queries. With millions of available keywords whose clickthru rates and profits are highly uncertain, identify ..."
Abstract - Cited by 37 (0 self) - Add to MetaCart
Increases in online searches have spurred the growth of search-based advertising services offered by search engines, enabling companies to promote their products to consumers based on search queries. With millions of available keywords whose clickthru rates and profits are highly uncertain, identifying the most profitable set of keywords becomes challenging. We formulate a stylized model of keyword selection in search-based advertising services. Assuming known profits and unknown clickthru rates, we develop an approximate adaptive algorithm that prioritizes keywords based on a prefix ordering – sorting of keywords in a descending order of expected-profit-to-cost ratio (or “bang-per-buck”). We show that the average expected profit generated by our algorithm converges to near-optimal profits, with the convergence rate that is independent of the number of keywords and scales gracefully with the problem’s parameters. By leveraging the special structure of our problem, our algorithm trades off bias with faster convergence rate, converging very quickly but with only near-optimal profit in the limit. Extensive numerical simulations show that when the number of keywords is large, our algorithm outperforms existing methods, increasing profits by about 20 % in as little as 40 periods. We also extend our algorithm to the setting when both the clickthru rates and the expected profits are unknown. 1

Multi-Armed Bandits in Metric Spaces

by Robert Kleinberg, Aleksandrs Slivkins, Eli Upfal - STOC'08 , 2008
"... In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with larg ..."
Abstract - Cited by 24 (4 self) - Add to MetaCart
In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of n trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions. In this work we study a very general setting for the multiarmed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the Lipschitz MAB problem. We present a complete solution for the multi-armed problem in this setting. That is, for every metric space (L, X) we define an isometry invariant MaxMinCOV(X) which bounds from below the performance of Lipschitz MAB algorithms for X, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions.

Algorithms for Infinitely Many-Armed Bandits

by Yizao Wang, Jean-yves Audibert, Rémi Munos
"... We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous work ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases. 1

High-Probability Regret Bounds for Bandit Online Linear Optimization

by Peter L. Bartlett, Varsha Dani, Alexander Rakhlin, Thomas P. Hayes, Ambuj Tewari, Sham M. Kakade
"... We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high probability has regret at most O ∗ ( √ T) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectatio ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
We present a modification of the algorithm of Dani et al. [8] for the online linear optimization problem in the bandit setting, which with high probability has regret at most O ∗ ( √ T) against an adaptive adversary. This improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary. We obtain the same dependence on the dimension (n 3/2) as that exhibited by Dani et al. The results of this paper rest firmly on those of [8] and the remarkable technique of Auer et al. [2] for obtaining highprobability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings. 1

Adapting to a Changing Environment: the Brownian Restless Bandits

by Aleksandrs Slivkins, Eli Upfal
"... In the multi-armed bandit (MAB) problem there are k distributions associated with the rewards of playing each of k strategies (slot machine arms). The reward distributions are initially unknown to the player. The player iteratively plays one strategy per round, observes the associated reward, and de ..."
Abstract - Cited by 8 (3 self) - Add to MetaCart
In the multi-armed bandit (MAB) problem there are k distributions associated with the rewards of playing each of k strategies (slot machine arms). The reward distributions are initially unknown to the player. The player iteratively plays one strategy per round, observes the associated reward, and decides on the strategy for the next iteration. The goal is to maximize the reward by balancing exploitation: the use of acquired information, with exploration: learning new information. We introduce and study a dynamic MAB problem in which the reward functions stochastically and gradually change in time. Specifically, the expected reward of each arm follows a Brownian motion, a discrete random walk, or similar processes. In this setting a player has to continuously keep exploring in order to adapt to the changing environment. Our formulation is (roughly) a special case of the notoriously intractable restless MAB problem. Our goal here is to characterize the cost of learning and adapting to the changing environment, in terms of the stochastic rate of the change. We consider an infinite time horizon, and strive to minimize the average cost per step which we define with respect to a hypothetical algorithm that at every step plays the arm with the maximum expected reward at this step. A related line of work on the adversarial MAB problem used a significantly weaker benchmark, the best time-invariant policy. The dynamic MAB problem models a variety of practical online, game-against- nature type optimization settings. While building on prior work, algorithms and steady-state analysis for the dynamic setting require a novel approach based on different stochastic tools.

Dynamic Cost-Per-Action Mechanisms and Applications to Online Advertising

by Hamid Nazerzadeh, Amin Saberi, Rakesh Vohra
"... We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme in online advertising. In this scheme, instead of paying per click, the advertisers pay only when a user takes a specific action (e.g. fills out a form) or completes a transaction on their websites. We focus on designing effi ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
We study the Cost-Per-Action or Cost-Per-Acquisition (CPA) charging scheme in online advertising. In this scheme, instead of paying per click, the advertisers pay only when a user takes a specific action (e.g. fills out a form) or completes a transaction on their websites. We focus on designing efficient and incentive compatible mechanisms that use this charging scheme. We describe a mechanism based on a sampling-based learning algorithm that under suitable assumptions is asymptotically individually rational, asymptotically Bayesian incentive compatible and asymptotically ex-ante efficient. In particular, we demonstrate our mechanism for the case where the utility functions of the advertisers are independent and identically-distributed random variables as well as the case where they evolve like independent reflected Brownian motions.

Contextual Bandits with Similarity Information

by Aleksandrs Slivkins - 24TH ANNUAL CONFERENCE ON LEARNING THEORY , 2011
"... In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the context-arm pairs which bounds from above the difference between the respective expected payoffs. Prior work

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

by Diana M. Negoescu, Peter I. Frazier, Warren B. Powell - INFORMS J. on Computing , 2010
"... We present a new technique for adaptively choosing the sequence of molecular compounds to test in drug discovery. Beginning with a base compound, we consider the problem of searching for a chemical derivative of the molecule that best treats a given disease. The problem of choosing molecules to test ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
We present a new technique for adaptively choosing the sequence of molecular compounds to test in drug discovery. Beginning with a base compound, we consider the problem of searching for a chemical derivative of the molecule that best treats a given disease. The problem of choosing molecules to test to maximize the expected quality of the best compound discovered may be formulated mathematically as a ranking-andselection problem in which each molecule is an alternative. We apply a recently developed algorithm, known as the knowledge-gradient algorithm, that uses correlations in our Bayesian prior distribution between the performance of different alternatives (molecules) to dramatically reduce the number of molecular tests required, but it has heavy computational requirements that limit the number of possible alternatives to a few thousand. We develop computational improvements that allow the knowledge-gradient method to consider much larger sets of alternatives, and we demonstrate the method on a problem with 87,120 alternatives.

Mortal Multi-Armed Bandits

by Deepayan Chakrabarti, Filip Radlinski, Ravi Kumar, Eli Upfal
"... We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in wh ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
We formulate and study a new variant of the k-armed bandit problem, motivated by e-commerce applications. In our model, arms have (stochastic) lifetime after which they expire. In this setting an algorithm needs to continuously explore new arms, in contrast to the standard k-armed bandit model in which arms are available indefinitely and exploration is reduced once an optimal arm is identified with nearcertainty. The main motivation for our setting is online-advertising, where ads have limited lifetime due to, for example, the nature of their content and their campaign budgets. An algorithm needs to choose among a large collection of ads, more than can be fully explored within the typical ad lifetime. We present an optimal algorithm for the state-aware (deterministic reward function) case, and build on this technique to obtain an algorithm for the state-oblivious (stochastic reward function) case. Empirical studies on various reward distributions, including one derived from a real-world ad serving application, show that the proposed algorithms significantly outperform the standard multi-armed bandit approaches applied to these settings. 1

Hierarchical Knowledge Gradient for Sequential Sampling

by Martijn R. K. Mes, Warren B. Powell, Peter I. Frazier, Ronald Parr
"... We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional ve ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University