Results 1 - 10
of
173
Bandit based Monte-Carlo Planning
- In: ECML-06. Number 4212 in LNCS
, 2006
"... Abstract. For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algo ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
Abstract. For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
Online Choice of Active Learning Algorithms
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2004
"... This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in pool-based active learning. We develop an active-learning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in pool-based active learning. We develop an active-learning master algorithm, based on a known competitive algorithm for the multiarmed bandit problem. A major challenge in successfully choosing top performing active learners online is to reliably estimate their progress during the learning session. To this end we propose a simple maximum entropy criterion that provides effective estimates in realistic settings. We study the performance of the proposed master algorithm using an ensemble containing two of the best known active-learning algorithms as well as a new algorithm. The resulting
Online Learning in Online Auctions
, 2003
"... ding truthfully and setting b i = v i . As shown in that paper, this condition # Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Email: avrim@cs.cmu.edu + Strategic Planning and Optimization Team, Amazon.com, Seattle, WA, Email: vijayk@amazon.com # Department of Compute ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
ding truthfully and setting b i = v i . As shown in that paper, this condition # Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, Email: avrim@cs.cmu.edu + Strategic Planning and Optimization Team, Amazon.com, Seattle, WA, Email: vijayk@amazon.com # Department of Computer Science, University of Texas at Austin, Austin, TX. This work was done while the author was at IBM India Research Lab, New Delhi, India. Email: atri@cs.utexas.edu Computer Science Division, University of California at Berkeley, Berkeley, CA, Email: felix@cs.berkeley.edu is equivalent to the condition that each s i depends only on the first i 1 bids, and not on the ith bid. Hence, the auction mechanism is essentially trying to guess the ith valuation, based on the first i 1 valuations. As in previous papers [3, 5, 6], we will use competitive analysis to analyze the performance of any given auction. Hence, we are interested in the worst-case ratio (over all sequences of valuations)
Near-Optimal Online Auctions
- In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms
, 2005
"... Abstract We consider the online auction problem proposed byBar-Yossef, Hildrum, and Wu [4] in which an auctioneer is selling identical items to bidders arriving one at atime. We give an auction that achieves a constant factor of the optimal profit less an O(h) additive loss term,where h is the value ..."
Abstract
-
Cited by 41 (10 self)
- Add to MetaCart
Abstract We consider the online auction problem proposed byBar-Yossef, Hildrum, and Wu [4] in which an auctioneer is selling identical items to bidders arriving one at atime. We give an auction that achieves a constant factor of the optimal profit less an O(h) additive loss term,where h is the value of the highest bid. Furthermore,this auction does not require foreknowledge of the range of bidders ' valuations. On both counts, this answersopen questions from [4, 5]. We further improve on the results from [5] for the online posted-price problem by re-ducing their additive loss term from O(h log h log log h)to O(h log log h). Finally, we define the notion of an(offline) attribute auction for modeling the problem of auctioning items to consumers who are not a-priori in-distinguishable. We apply our online auction solution to achieve good bounds for the attribute auction problemwith 1-dimensional attributes.
Robbing the Bandit: Less Regret in Online Geometric Optimization Against an Adaptive Adversary
- In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA
, 2006
"... We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decision-making procedure and that of the best single ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
We consider “online bandit geometric optimization, ” a problem of iterated decision making in a largely unknown and constantly changing environment. The goal is to minimize “regret, ” defined as the difference between the actual loss of an online decision-making procedure and that of the best single decision in hindsight. “Geometric optimization ” refers to a generalization of the well-known multi-armed bandit problem, in which the decision space is some bounded subset of R d, the adversary is restricted to linear loss functions, and regret bounds should depend on the dimensionality d, rather than the total number of possible decisions. “Bandit ” refers to the setting in which the algorithm is only told its loss on each round, rather than the entire loss function. McMahan and Blum [10] presented the best known algorithm in this setting, and proved that its expected additive regret is O(poly(d)T 3/4). We simplify and improve their analysis of this algorithm to obtain regret O(poly(d)T 2/3). We also prove that, for a large class of full-information online optimization problems, the optimal regret against an adaptive adversary is the same as against a non-adaptive adversary. 1
Competing in the dark: An efficient algorithm for bandit linear optimization
- In Proceedings of the 21st Annual Conference on Learning Theory (COLT
, 2008
"... We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has bee ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O ∗ ( √ T) regret. The setting is a natural generalization of the nonstochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods. 1
Improved second-order bounds for prediction with expert advice
- In COLT
, 2005
"... Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bou ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bounds for the well-known exponentially weighted average forecaster and for a new forecaster with a different multiplicative update rule. Our analysis has two main advantages: first, no preliminary knowledge about the payoff sequence is needed, not even its range; second, our bounds are expressed in terms of sums of squared payoffs, replacing larger firstorder quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence. 1.
Minimizing regret with label efficient prediction
- IEEE Trans. Inform. Theory
, 2005
"... Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number o ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number of times. We determine matching upper and lower bounds for the best possible excess error when the number of allowed queries is a constant. We also prove that a query rate of order (ln n)(ln ln n) 2 /n is sufficient for achieving Hannan consistency, a fundamental property in game-theoretic prediction models. Finally, we apply the label efficient framework to pattern classification and prove a label efficient mistake bound for a randomized variant of Littlestone’s zero-threshold Winnow algorithm. 1
Learning diverse rankings with multi-armed bandits
- In Proceedings of the 25 th ICML
, 2008
"... Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We presen ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Algorithms for learning to rank Web documents usually assume a document’s relevance is independent of other documents. This leads to learned ranking functions that produce rankings with redundant results. In contrast, user studies have shown that diversity at high ranks is often preferred. We present two online learning algorithms that directly learn a diverse ranking of documents based on users ’ clicking behavior. We show that these algorithms minimize abandonment, or alternatively, maximize the probability that a relevant document is found in the top k positions of a ranking. Moreover, one of our algorithms asymptotically achieves optimal worst-case performance even if users’ interests change. 1.

