Results 1  10
of
47
Bandit based MonteCarlo Planning
 In: ECML06. Number 4212 in LNCS
, 2006
"... Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algo ..."
Abstract

Cited by 217 (5 self)
 Add to MetaCart
Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
The EpochGreedy Algorithm for Contextual Multiarmed Bandits
"... We present EpochGreedy, an algorithm for contextual multiarmed bandits (also known as bandits with side information). EpochGreedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by EpochGreedy is controlled by a sample complexity bound for a ..."
Abstract

Cited by 49 (9 self)
 Add to MetaCart
We present EpochGreedy, an algorithm for contextual multiarmed bandits (also known as bandits with side information). EpochGreedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by EpochGreedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
Exploration scavenging
 In Proceedings of the International Conference on Machine Learning
, 2008
"... We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
We examine the problem of evaluating a policy in the contextual bandit setting using only observations collected during the execution of another policy. We show that policy evaluation can be impossible if the exploration policy chooses actions based on the side information provided at each time step. We then propose and prove the correctness of a principled method for policy evaluation which works when this is not the case, even when the exploration policy is deterministic, as long as each action is explored sufficiently often. We apply this general technique to the problem of offline evaluation of internet advertising policies. Although our theoretical results hold only when the exploration policy chooses ads independent of side information, an assumption that is typically violated by commercial systems, we show how clever uses of the theory provide nontrivial and realistic applications. We also provide an empirical demonstration of the effectiveness of our techniques on real ad placement data.
Creating an UpperConfidenceTree program for Havannah
 ACG 12
, 2009
"... ... huge improvements in computerGo. In this paper, we test the generality of the approach by experimenting on another game, Havannah, which is known for being especially difficult for computers. We show that the same results hold, with slight differences related to the absence of clearly known pat ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
... huge improvements in computerGo. In this paper, we test the generality of the approach by experimenting on another game, Havannah, which is known for being especially difficult for computers. We show that the same results hold, with slight differences related to the absence of clearly known patterns for the game of Havannah, in spite of the fact that Havannah is more related to connection games like Hex than to territory games like Go.
The Offset Tree for Learning with Partial Labels
"... We present an algorithm, called the offset tree, for learning in situations where a loss associated with different decisions is not known, but was randomly probed. The algorithm is an optimal reduction from this problem to binary classification. In particular, it has regret at most (k −1) times the ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
We present an algorithm, called the offset tree, for learning in situations where a loss associated with different decisions is not known, but was randomly probed. The algorithm is an optimal reduction from this problem to binary classification. In particular, it has regret at most (k −1) times the regret of the binary classifier it uses, where k is the number of decisions, and no reduction to binary classification can do better. We test the offset tree empirically and discover that it generally results in superior (or equal) performance, compared to several plausible alternative approaches. 1
Consistency Modifications for Automatically Tuned MonteCarlo Tree Search
 in Proc. Learn. Intell. Optim
, 2010
"... Abstract. MonteCarlo Tree Search algorithms (MCTS [4, 6]), including upper confidence trees (UCT [9]), are known for their impressive ability in high dimensional control problems. Whilst the main testbed is the game of Go, there are increasingly many applications [13, 12, 7]; these algorithms are n ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Abstract. MonteCarlo Tree Search algorithms (MCTS [4, 6]), including upper confidence trees (UCT [9]), are known for their impressive ability in high dimensional control problems. Whilst the main testbed is the game of Go, there are increasingly many applications [13, 12, 7]; these algorithms are now widely accepted as strong candidates for highdimensional control applications. Unfortunately, it is known that for optimal performance on a given problem, MCTS requires some tuning; this tuning is often handcrafted or automated, with in some cases a loss of consistency, i.e. a bad behavior asymptotically in the computational power. This highly undesirable property led to a stupid behavior of our main MCTS program MoGo in a realworld situation described in section 3. This is a big trouble for our several works on automatic parameter tuning [3] and the genetic programming of new features in MoGo. We will see in this paper: – A theoretical analysis of MCTS consistency;
BanditBased Genetic Programming
 13TH EUROPEAN CONFERENCE ON GENETIC PROGRAMMING (2010)
, 2010
"... We consider the validation of randomly generated patterns in a MonteCarlo Tree Search program. Our banditbased genetic programming (BGP) algorithm, with proved mathematical properties, outperformed a highly optimized handcrafted module of a wellknown computerGo program with several world record ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We consider the validation of randomly generated patterns in a MonteCarlo Tree Search program. Our banditbased genetic programming (BGP) algorithm, with proved mathematical properties, outperformed a highly optimized handcrafted module of a wellknown computerGo program with several world records in the game of Go.
A finitetime analysis of multiarmed bandits problems with KullbackLeibler divergences. 2011. URL http://hal.archivesouvertes. fr/inria00574987
"... We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finite ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We consider a KullbackLeiblerbased algorithm for the stochastic multiarmed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finitetime analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finitetime analyses (like UCBtype algorithms).