Results 1  10
of
39
Bandit based MonteCarlo Planning
 In: ECML06. Number 4212 in LNCS
, 2006
"... Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algo ..."
Abstract

Cited by 215 (5 self)
 Add to MetaCart
Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
Automating the Construction of Internet Portals with Machine Learning
 Information Retrieval
, 2000
"... Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract

Cited by 169 (3 self)
 Add to MetaCart
Domainspecific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Webwide search engines. Unfortunately these portals are difficult and timeconsuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domainspecific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Reinforcement Learning as Classification: Leveraging Modern Classifiers
 in Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques. ..."
Abstract

Cited by 60 (4 self)
 Add to MetaCart
The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques.
Programming backgammon using selfteaching neural nets
 Artificial Intelligence
, 2002
"... TDGammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. Starting from random initial play, TDGammon’s selfteaching methodology results in a surprisingly strong program: without lookahead, its positional judgement ..."
Abstract

Cited by 41 (1 self)
 Add to MetaCart
TDGammon is a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. Starting from random initial play, TDGammon’s selfteaching methodology results in a surprisingly strong program: without lookahead, its positional judgement rivals that of human experts, and when combined with shallow lookahead, it reaches a level of play that surpasses even the best human players. The success of TDGammon has also been replicated by several other programmers; at least two other neural net programs also appear to be capable of superhuman play. Previous papers on TDGammon have focused on developing a scientific understanding of its reinforcement learning methodology. This paper views machine learning as a tool in a programmer’s toolkit, and considers how it can be combined with other programming techniques to achieve and surpass worldclass backgammon play. Particular emphasis is placed on programming shallowdepth search algorithms, and on TDGammon’s doubling algorithm, which is described in print here for
MonteCarlo Planning in Large POMDPs
 In Advances in Neural Information Processing Systems 23
, 2010
"... This paper introduces a MonteCarlo algorithm for online planning in large POMDPs. The algorithm combines a MonteCarlo update of the agent’s belief state with a MonteCarlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
This paper introduces a MonteCarlo algorithm for online planning in large POMDPs. The algorithm combines a MonteCarlo update of the agent’s belief state with a MonteCarlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a wellknown benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 10 18 and 10 56 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs. 1
Learning domainspecific control knowledge from random walks
 In Proceedings of the fourteenth international
, 2004
"... We describe and evaluate a system for learning domainspecific control knowledge. In particular, given a planning domain, the goal is to output a control policy that performs well on “long random walk ” problem distributions. The system is based on viewing planning domains as very large Markov decisi ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
We describe and evaluate a system for learning domainspecific control knowledge. In particular, given a planning domain, the goal is to output a control policy that performs well on “long random walk ” problem distributions. The system is based on viewing planning domains as very large Markov decision processes and then applying a recent variant of approximate policy iteration that is bootstrapped with a new technique based on random walks. We evaluate the system on the AIPS2000 planning domains (among others) and show that often the learned policies perform well on problems drawn from the long–randomwalk distribution. In addition, we show that these policies often perform well on the original problem distributions from the domains involved. Our evaluation also uncovers limitations of our current system that point to future challenges.
Bidding algorithms for simultaneous auctions: A case study
 In Proceedings of Third ACM Conference on Electronic Commerce
, 2001
"... ..."
Scheduling StraightLine Code Using Reinforcement Learning and Rollouts
 IN PROCEEDINGS OF NEURAL INFORMATION PROCESSING SYMPOSIUM
, 1999
"... The execution order of a block of computer instructions on a pipelined machine can make a difference in its running time by a factor of two or more. In order to achieve the best possible speed, compilers use heuristic schedulers appropriate to each specific architecture implementation. However, th ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
The execution order of a block of computer instructions on a pipelined machine can make a difference in its running time by a factor of two or more. In order to achieve the best possible speed, compilers use heuristic schedulers appropriate to each specific architecture implementation. However, these heuristic schedulers are timeconsuming and expensive to build. We present empirical results using both rollouts and reinforcement learning to construct heuristics for scheduling basic blocks. In simulation, the rollout scheduler outperformed a commercial scheduler, and the reinforcement learning scheduler performed almost as well as the commercial scheduler.
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
Value Function Based Production Scheduling
 In International Conference on Machine Learning
, 1998
"... Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard schedu ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Production scheduling, the problem of sequentially configuring a factory to meet forecasted demands, is a critical problem throughout the manufacturing industry. The requirement of maintaining product inventories in the face of unpredictable demand and stochastic factory output makes standard scheduling models, such as jobshop, inadequate. Currently applied algorithms, such as simulated annealing and constraint propagation, must employ adhoc methods such as frequent replanning to cope with uncertainty. In this paper, we describe a Markov Decision Process (MDP) formulation of production scheduling which captures stochasticity in both production and demands. The solution to this MDP is a value function which can be used to generate optimal scheduling decisions online. A simple example illustrates the theoretical superiority of this approach over replanningbased methods. We then describe an industrial application and two reinforcement learning methods for generating an approximate valu...