Results 11  20
of
257
LeastSquares Temporal Difference Learning
 In Proceedings of the Sixteenth International Conference on Machine Learning
, 1999
"... TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule ..."
Abstract

Cited by 95 (0 self)
 Add to MetaCart
TD() is a popular family of algorithms for approximate policy evaluation in large MDPs. TD() works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data efficiency. This paper extends Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
Model based Bayesian Exploration
 In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence
, 1999
"... Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future deci ..."
Abstract

Cited by 89 (0 self)
 Add to MetaCart
Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information  the expected improvement in future decision quality arising from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper we investigate ways to represent and reason about this uncertainty in algorithms where the system attempts to learn a model of its environment. We explicitly represent uncertainty about the parameters of the model and build probability distributions over Qvalues based on these. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation. 1 Introduction Rei...
Technical update: Leastsquares temporal difference learning
 Machine Learning
, 2002
"... Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a ste ..."
Abstract

Cited by 89 (2 self)
 Add to MetaCart
Abstract. TD(λ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD(λ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ = 0, the LeastSquares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto’s work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of λ = 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a modelbased reinforcement learning technique.
Incremental MultiStep QLearning
 Machine Learning
, 1996
"... . This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement le ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q()learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. Keywords: reinforcement learning, temporal difference learning 1. Introduction The incremental multistep Qlearning (Q()learning) method is a new direct (or modelfree) algorithm that extends the onestep Qlearning algorithm (Watkins 1989) by combining it with TD() ret...
Learning Maps for Indoor Mobile Robot Navigation
 ARTIFICIAL INTELLIGENCE (ACCEPTED FOR PUBLICATION)
, 1997
"... Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits ..."
Abstract

Cited by 83 (12 self)
 Add to MetaCart
Autonomous robots must be able to learn and maintain models of their environments. Research on mobile robot navigation has produced two major paradigms for mapping indoor environments: gridbased and topological. While gridbased methods produce accurate metric maps, their complexity often prohibits efficient planning and problem solving in largescale indoor environments. Topological maps, on the other hand, can be used much more efficiently, yet accurate and consistent topological maps are often difficult to learn and maintain in largescale environments, particularly if momentary sensor data is highly ambiguous. This paper describes an approach that integrates both paradigms: gridbased and topological. Gridbased maps are learned using artificial neural networks and naive Bayesian integration. Topological maps are generated on top of the gridbased maps, by partitioning the latter into coherent regions. By combining both paradigms, the approach presented here gains advantages from both worlds: accuracy/consistency and efficiency. The paper gives results for autonomous exploration, mapping and operation of a mobile robot in populated multiroom environments.
Multitime Models for Temporally Abstract Planning
 In Advances in Neural Information Processing Systems 10
, 1997
"... Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem ba ..."
Abstract

Cited by 77 (8 self)
 Add to MetaCart
Planning Doina Precup, Richard S. Sutton University of Massachusetts Amherst, MA 01003 fdprecupjrichg@cs.umass.edu Abstract Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current modelbased reinforcement learning is based on onestep models that cannot represent commonsense higherlevel actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multitime model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framewo...
A Bayesian framework for reinforcement learning
 In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed tha ..."
Abstract

Cited by 74 (1 self)
 Add to MetaCart
The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed that the learning process estimates online the full posterior distribution over models. To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained. This Bayesian method always converges to the optimal policy for a stationary process with discrete states. 1.
How to Dynamically Merge Markov Decision Processes
, 1997
"... We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to e#ciently find good solutions for doing the tasks in parallel. We formu ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
We are frequently called upon to perform multiple tasks that compete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to e#ciently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoreticallysound dynamic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem. Every day, we are faced with the problem of doing multiple tasks in parallel, each of which competes for our attention and resource. If we are running a job shop, we must decide which machines to allocate to which jobs, and in what order, so that no jobs miss their deadlines. If we are a mail delivery robot, we must find the intended recipients of the mail while simul...
TD Models: Modeling the World at a Mixture of Time Scales
 PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1995
"... Temporaldifference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of tem ..."
Abstract

Cited by 63 (18 self)
 Add to MetaCart
Temporaldifference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of temporal abstraction within a single structure. Such multiscale TD models can be used in modelbased reinforcementlearning architectures and dynamic programming methods in place of conventional Markov models. This enables planning at higher and varied levels of abstraction, and, as such, may prove useful in formulating methods for hierarchical or multilevel planning and reinforcement learning. In this paper we treat only the prediction problemthat of learning a model and value function for the case of fixed agent behavior. Within this context, we establish the theoretical foundations of multiscale models and derive TD algorithms for learning them. Two small computational experim...
A Feedback Control Structure for Online Learning Tasks
 Robotics and Autonomous Systems
, 1997
"... This paper addresses adaptive control architectures for systems that respond autonomously to changing tasks. Such systems often have many sensory and motor alternatives and behavior drawn from these produces varying quality solutions. The objective is then to ground behavior in control laws which, c ..."
Abstract

Cited by 59 (19 self)
 Add to MetaCart
This paper addresses adaptive control architectures for systems that respond autonomously to changing tasks. Such systems often have many sensory and motor alternatives and behavior drawn from these produces varying quality solutions. The objective is then to ground behavior in control laws which, combined with resources, enumerate closedloop behavioral alternatives. Use of such controllers leads to analyzable and predictable composite systems, permitting the construction of abstract behavioral models. Here, discrete event system and reinforcement learning techniques are employed to constrain the behavioral alternatives and to synthesize behavior online. To illustrate this, a quadruped robot learning a turning gait subject to safety and kinematic constraints is presented. Keywords: Control Composition, DEDS, Reinforcement Learning, Walking. 1 Introduction Behavior generation in complex sensorimotor systems can be viewed as a scheduling problem in which a policy for engaging resour...