Results 1  10
of
55
Bandit based MonteCarlo Planning
 In: ECML06. Number 4212 in LNCS
, 2006
"... Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algo ..."
Abstract

Cited by 433 (7 self)
 Add to MetaCart
Abstract. For large statespace Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find nearoptimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide MonteCarlo planning. In finitehorizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives. 1
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 154 (7 self)
 Add to MetaCart
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 129 (14 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
A Comparison of Reinforcement Learning Methods for Automatic Guided Vehicle Scheduling
"... ingly being used in manufacturing plants for transportation tasks. Optimal scheduling of AGVs is a difficult problem. A learning AGV is very attractive in a manufacturing plant since it is hard to manually optimize the scheduling algorithm to each new situation. In this paper we compare four rein ..."
Abstract
 Add to MetaCart
reinforcement learning methods for scheduling AGVs. Qlearning[Watkins and Dayan 921 and Rlearning[Schwartz 931 do not use action models. Qlearning optimizes the discounted total reward, while Rlearning optimizes the average undiscounted reward per step. ARTDP[Barto et al. to appear] is a discounted method
Autoexploratory Average Reward Reinforcement Learning
 Artificial Intelligence
, 1996
"... We introduce a modelbased average reward Reinforcement Learning method called Hlearning and compare it with its discounted counterpart, Adaptive RealTime Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to Hlearning, which automatically explores the unexp ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...
Explanationbased learning and reinforcement learning: A unified view
 In Proceedings Twelfth International Conference on Machs’ne Learning
, 1995
"... Abstract. In speeduplearning problems, where full descriptions of operators are known, both explanationbased learning (EBL) and reinforcement learning (RL) methods can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
Abstract. In speeduplearning problems, where full descriptions of operators are known, both explanationbased learning (EBL) and reinforcement learning (RL) methods can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. Most RL methods perform this propagation on a statebystate basis, while EBL methods compute the weakest preconditions of operators, and hence, perform this propagation on a regionbyregion basis. Barto, Bradtke, and Singh (1995) have observed that many algorithms for reinforcement learning can be viewed as asynchronous dynamic programming. Based on this observation, this paper shows how to develop dynamic programming versions of EBL, which we call regionbased dynamic programming or ExplanationBased Reinforcement Learning (EBRL). The paper compares batch and online versions of EBRL to batch and online versions of pointbased dynamic programming and to standard EBL. The results show that regionbased dynamic programming combines the strengths of EBL (fast learning and the ability to scale to large state spaces) with the strengths of reinforcement learning algorithms (learning of optimal policies). Results are shown in chess endgames and in synthetic maze tasks.
Exploration of MultiState Environments: Local Measures and BackPropagation of Uncertainty
, 1998
"... . This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a ..."
Abstract

Cited by 52 (1 self)
 Add to MetaCart
. This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular congurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and backpropagating them with the dynamic programming or temporal dierence mechanisms. This allows reproducing globalscale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the eciency of these propositions. Keywords: ...
Multicriteria Reinforcement Learning
, 1998
"... We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the ordertopology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the mediumterm multicriteria RL often converges to better solutions (measured by the first criterion) than their singlecriterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function
 In Saitta
, 1996
"... Almost all the work in Averagereward Reinforcement Learning (ARL) so far has focused on tablebased methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a modelbased ARL method called Hlearning to address the scaleup problem. We extend Hlear ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
Almost all the work in Averagereward Reinforcement Learning (ARL) so far has focused on tablebased methods which do not scale to domains with large state spaces. In this paper, we propose two extensions to a modelbased ARL method called Hlearning to address the scaleup problem. We extend Hlearning to learn action models and reward functions in the form of Bayesian networks, and approximate its value function using local linear regression. We test our algorithms on several scheduling tasks for a simulated Automatic Guided Vehicle (AGV) and show that they are effective in significantly reducing the space requirement of Hlearning and making it converge faster. To the best of our knowledge, our results are the first in applying function approximation to ARL. 1 Introduction Most Reinforcement Learning (RL) methods optimize the discounted total reward received by an agent (Barto, Bradtke, & Singh, 1995; Watkins & Dayan, 1992). However, in many realworld domains, the natural criterio...
Potentialbased shaping in modelbased reinforcement learning
 In Proceedings of AAAI Conference on Artificial Intelligence
, 2008
"... Potentialbased shaping was designed as a way of introducing background knowledge into modelfree reinforcementlearning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find nearoptimal behavior. A ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
Potentialbased shaping was designed as a way of introducing background knowledge into modelfree reinforcementlearning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find nearoptimal behavior. An orthogonal way of decreasing experience complexity is to use a modelbased learning approach, building and exploiting an explicit transition model. In this paper, we show how potentialbased shaping can be redefined to work in the modelbased setting to produce an algorithm that shares the benefits of both ideas.
Results 1  10
of
55