Results 1  10
of
280
Bandit based MonteCarlo Planning
 In: ECML06. Number 4212 in LNCS
, 2006
"... Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algo ..."
Abstract

Cited by 217 (6 self)
 Add to MetaCart
Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
Online Choice of Active Learning Algorithms
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2004
"... This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in poolbased active learning. We develop an activelearning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in poolbased active learning. We develop an activelearning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major challenge in successfully choosing top performing active learners online is to reliably estimate their progress during the learning session. To this end we propose a simple maximum entropy criterion that provides effective estimates in realistic settings. We study the performance of the proposed master algorithm using an ensemble containing two of the best known activelearning algorithms as well as a new algorithm. The resulting
A ContextualBandit Approach to Personalized News Article Recommendation
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract

Cited by 59 (11 self)
 Add to MetaCart
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its articleselection strategy based on userclick feedback to maximize total user clicks. The contributions of this work are threefold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard contextfree bandit algorithm, and the advantage becomes even greater when data gets more scarce.
Online Learning in Online Auctions
, 2003
"... ding truthfully and setting b i = v i . As shown in that paper, this condition # Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Email: avrim@cs.cmu.edu + Strategic Planning and Optimization Team, Amazon.com, Seattle, WA, Email: vijayk@amazon.com # Department of Compute ..."
Abstract

Cited by 58 (5 self)
 Add to MetaCart
ding truthfully and setting b i = v i . As shown in that paper, this condition # Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Email: avrim@cs.cmu.edu + Strategic Planning and Optimization Team, Amazon.com, Seattle, WA, Email: vijayk@amazon.com # Department of Computer Science, University of Texas at Austin, Austin, TX. This work was done while the author was at IBM India Research Lab, New Delhi, India. Email: atri@cs.utexas.edu Computer Science Division, University of California at Berkeley, Berkeley, CA, Email: felix@cs.berkeley.edu is equivalent to the condition that each s i depends only on the first i 1 bids, and not on the ith bid. Hence, the auction mechanism is essentially trying to guess the ith valuation, based on the first i 1 valuations. As in previous papers [3, 5, 6], we will use competitive analysis to analyze the performance of any given auction. Hence, we are interested in the worstcase ratio (over all sequences of valuations)
Learning diverse rankings with multiarmed bandits
 In Proceedings of the 25 th ICML
, 2008
"... Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We presen ..."
Abstract

Cited by 56 (4 self)
 Add to MetaCart
Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two online learning algorithms that directly learn a diverse ranking of documents based on users ’ clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. Moreover, one of our algorithms asymptotically achieves optimal worstcase performance even if users’ interests change. 1.
Competing in the dark: An efficient algorithm for bandit linear optimization
 In Proceedings of the 21st Annual Conference on Learning Theory (COLT
, 2008
"... We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multiarmed bandit problem, and the existence of an efficient optimal algorithm has bee ..."
Abstract

Cited by 50 (9 self)
 Add to MetaCart
We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multiarmed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a selfconcordant potential function. Our approach presents a novel connection between online learning and interior point methods. 1
Robbing the Bandit: Less Regret in Online Geometric Optimization Against an Adaptive Adversary
 In Proceedings of the 17th ACMSIAM Symposium on Discrete Algorithms (SODA
, 2006
"... We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decisionmaking procedure and that of the best single ..."
Abstract

Cited by 47 (5 self)
 Add to MetaCart
We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decisionmaking procedure and that of the best single decision in hindsight. “Geometric optimization ” refers to a generalization of the wellknown multiarmed bandit problem, in which the decision space is some bounded subset of R d, the adversary is restricted to linear loss functions, and regret bounds should depend on the dimensionality d, rather than the total number of possible decisions. “Bandit ” refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function. McMahan and Blum [10] presented the best known algorithm in this setting, and proved that its expected additive regret is O(poly(d)T 3/4). We simplify and improve their analysis of this algorithm to obtain regret O(poly(d)T 2/3). We also prove that, for a large class of fullinformation online optimization problems, the optimal regret against an adaptive adversary is the same as against a nonadaptive adversary. 1
Improved secondorder bounds for prediction with expert advice
 In COLT
, 2005
"... Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bou ..."
Abstract

Cited by 46 (9 self)
 Add to MetaCart
Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bounds for the wellknown exponentially weighted average forecaster and for a new forecaster with a different multiplicative update rule. Our analysis has two main advantages: first, no preliminary knowledge about the payoff sequence is needed, not even its range; second, our bounds are expressed in terms of sums of squared payoffs, replacing larger firstorder quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence. 1.
NearOptimal Online Auctions
 In Proceedings of the 16th Annual ACMSIAM Symposium on Discrete Algorithms
, 2005
"... Abstract We consider the online auction problem proposed byBarYossef, Hildrum, and Wu [4] in which an auctioneer is selling identical items to bidders arriving one at atime. We give an auction that achieves a constant factor of the optimal profit less an O(h) additive loss term,where h is the value ..."
Abstract

Cited by 45 (11 self)
 Add to MetaCart
Abstract We consider the online auction problem proposed byBarYossef, Hildrum, and Wu [4] in which an auctioneer is selling identical items to bidders arriving one at atime. We give an auction that achieves a constant factor of the optimal profit less an O(h) additive loss term,where h is the value of the highest bid. Furthermore,this auction does not require foreknowledge of the range of bidders ' valuations. On both counts, this answersopen questions from [4, 5]. We further improve on the results from [5] for the online postedprice problem by reducing their additive loss term from O(h log h log log h)to O(h log log h). Finally, we define the notion of an(offline) attribute auction for modeling the problem of auctioning items to consumers who are not apriori indistinguishable. We apply our online auction solution to achieve good bounds for the attribute auction problemwith 1dimensional attributes.