Results 11  20
of
377
Technical update: Leastsquares temporal difference learning
 Machine Learning
, 2002
"... Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a ste ..."
Abstract

Cited by 131 (2 self)
 Add to MetaCart
(Show Context)
Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto’s work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
Variable Resolution Discretization in Optimal Control
 MACHINE LEARNING
, 2001
"... The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which nearoptimal policie ..."
Abstract

Cited by 128 (3 self)
 Add to MetaCart
The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which nearoptimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kdtrie. We then consider topdown approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new nonlocal measures, inuence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficientlyalculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficientlycalculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the longterm reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, nonlinear, nonminimum phase, and two dimensional problem of the "Car on the hill". It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number...
LeastSquares Temporal Difference Learning
 In Proceedings of the Sixteenth International Conference on Machine Learning
, 1999
"... TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule ..."
Abstract

Cited by 116 (0 self)
 Add to MetaCart
TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data efficiency. This paper extends Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
R.: Incremental multistep Qlearning
, 1996
"... Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinfor ..."
Abstract

Cited by 111 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q(A)learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
A Bayesian framework for reinforcement learning
 In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed tha ..."
Abstract

Cited by 109 (1 self)
 Add to MetaCart
The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed that the learning process estimates online the full posterior distribution over models. To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained. This Bayesian method always converges to the optimal policy for a stationary process with discrete states. 1.
Map Learning and HighSpeed Navigation in RHINO
, 1998
"... This chapter surveys basic methods for learning maps and high speed autonomous navigation for indoor mobile robots. The methods have been developed in our lab over the past few years, and most of them have been tested thoroughly in various indoor environments. The chapter is targeted towards researc ..."
Abstract

Cited by 108 (32 self)
 Add to MetaCart
(Show Context)
This chapter surveys basic methods for learning maps and high speed autonomous navigation for indoor mobile robots. The methods have been developed in our lab over the past few years, and most of them have been tested thoroughly in various indoor environments. The chapter is targeted towards researchers and engineers who attempt to build reliable mobile robot navigation software.
Model based Bayesian Exploration
 In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence
, 1999
"... Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future deci ..."
Abstract

Cited by 102 (0 self)
 Add to MetaCart
Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future decision quality arising from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper we investigate ways to represent and reason about this uncertainty in algorithms where the system attempts to learn a model of its environment. We explicitly represent uncertainty about the parameters of the model and build probability distributions over Qvalues based on these. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation. 1 Introduction Rei...
Learning Maps for Indoor Mobile Robot Navigation
 ARTIFICIAL INTELLIGENCE (ACCEPTED FOR PUBLICATION)
, 1997
"... Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits ..."
Abstract

Cited by 92 (10 self)
 Add to MetaCart
(Show Context)
Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits efficient planning and problem solving in largescale indoor environments. Topological maps, on the other hand, can be used much more efficiently, yet accurate and consistent topological maps are often difficult to learn and maintain in largescale environments, particularly if momentary sensor data is highly ambiguous. This paper describes an approach that integrates both paradigms: gridbased and topological. Gridbased maps are learned using artificial neural networks and naive Bayesian integration. Topological maps are generated on top of the gridbased maps, by partitioning the latter into coherent regions. By combining both paradigms, the approach presented here gains advantages from both worlds: accuracy/consistency and efficiency. The paper gives results for autonomous exploration, mapping and operation of a mobile robot in populated multiroom environments.
Multitime Models for Temporally Abstract Planning
 In Advances in Neural Information Processing Systems 10
, 1997
"... Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem ba ..."
Abstract

Cited by 85 (9 self)
 Add to MetaCart
Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current modelbased reinforcement learning is based on onestep models that cannot represent commonsense higherlevel actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multitime model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framewo...
Accelerating Reinforcement Learning through Implicit Imitation
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2003
"... Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments ..."
Abstract

Cited by 79 (0 self)
 Add to MetaCart
(Show Context)
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments