Results 11  20
of
377
Technical update: Leastsquares temporal difference learning
 Machine Learning
, 2002
"... Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a ste ..."
Abstract

Cited by 128 (2 self)
 Add to MetaCart
(Show Context)
Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto’s work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
LeastSquares Temporal Difference Learning
 In Proceedings of the Sixteenth International Conference on Machine Learning
, 1999
"... TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule ..."
Abstract

Cited by 118 (0 self)
 Add to MetaCart
TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data efficiency. This paper extends Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
R.: Incremental multistep Qlearning
, 1996
"... Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinfor ..."
Abstract

Cited by 112 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q(A)learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Map Learning and HighSpeed Navigation in RHINO
, 1998
"... This chapter surveys basic methods for learning maps and high speed autonomous navigation for indoor mobile robots. The methods have been developed in our lab over the past few years, and most of them have been tested thoroughly in various indoor environments. The chapter is targeted towards researc ..."
Abstract

Cited by 108 (32 self)
 Add to MetaCart
(Show Context)
This chapter surveys basic methods for learning maps and high speed autonomous navigation for indoor mobile robots. The methods have been developed in our lab over the past few years, and most of them have been tested thoroughly in various indoor environments. The chapter is targeted towards researchers and engineers who attempt to build reliable mobile robot navigation software.
A Bayesian framework for reinforcement learning
 In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed tha ..."
Abstract

Cited by 106 (1 self)
 Add to MetaCart
The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed that the learning process estimates online the full posterior distribution over models. To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained. This Bayesian method always converges to the optimal policy for a stationary process with discrete states. 1.
Model based Bayesian Exploration
 In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence
, 1999
"... Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future deci ..."
Abstract

Cited by 103 (0 self)
 Add to MetaCart
Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future decision quality arising from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper we investigate ways to represent and reason about this uncertainty in algorithms where the system attempts to learn a model of its environment. We explicitly represent uncertainty about the parameters of the model and build probability distributions over Qvalues based on these. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation. 1 Introduction Rei...
Learning Maps for Indoor Mobile Robot Navigation
 ARTIFICIAL INTELLIGENCE (ACCEPTED FOR PUBLICATION)
, 1997
"... Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits ..."
Abstract

Cited by 91 (10 self)
 Add to MetaCart
(Show Context)
Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits efficient planning and problem solving in largescale indoor environments. Topological maps, on the other hand, can be used much more efficiently, yet accurate and consistent topological maps are often difficult to learn and maintain in largescale environments, particularly if momentary sensor data is highly ambiguous. This paper describes an approach that integrates both paradigms: gridbased and topological. Gridbased maps are learned using artificial neural networks and naive Bayesian integration. Topological maps are generated on top of the gridbased maps, by partitioning the latter into coherent regions. By combining both paradigms, the approach presented here gains advantages from both worlds: accuracy/consistency and efficiency. The paper gives results for autonomous exploration, mapping and operation of a mobile robot in populated multiroom environments.
Multitime Models for Temporally Abstract Planning
 In Advances in Neural Information Processing Systems 10
, 1997
"... Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem ba ..."
Abstract

Cited by 88 (9 self)
 Add to MetaCart
Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current modelbased reinforcement learning is based on onestep models that cannot represent commonsense higherlevel actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multitime model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framewo...
Accelerating Reinforcement Learning through Implicit Imitation
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2003
"... Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments ..."
Abstract

Cited by 78 (0 self)
 Add to MetaCart
(Show Context)
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments
How to Dynamically Merge Markov Decision Processes
, 1997
"... We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to e#ciently find good solutions for doing the tasks in parallel. We formu ..."
Abstract

Cited by 76 (2 self)
 Add to MetaCart
We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to e#ciently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoreticallysound dynamic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem. Every day, we are faced with the problem of doing multiple tasks in parallel, each of which competes for our attention and resource. If we are running a job shop, we must decide which machines to allocate to which jobs, and in what order, so that no jobs miss their deadlines. If we are a mail delivery robot, we must find the intended recipients of the mail while simul...