Results 1  10
of
28
Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition
 Journal of Artificial Intelligence Research
, 2000
"... This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. Th ..."
Abstract

Cited by 440 (6 self)
 Add to MetaCart
(Show Context)
This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consisten...
Stochastic Dynamic Programming with Factored Representations
, 1997
"... Markov decision processes(MDPs) have proven to be popular models for decisiontheoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, statebased specifications and computations. To alleviate the combinatorial problems associated with such methods, we propo ..."
Abstract

Cited by 190 (10 self)
 Add to MetaCart
(Show Context)
Markov decision processes(MDPs) have proven to be popular models for decisiontheoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, statebased specifications and computations. To alleviate the combinatorial problems associated with such methods, we propose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decisiontree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decisiontree representations of policies and value functions. This generally obviates the need for statebystate computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decisiontheoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems,
The MAXQ Method for Hierarchical Reinforcement Learning
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchi ..."
Abstract

Cited by 146 (5 self)
 Add to MetaCart
This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. Conditions under which the MAXQ decomposition can represent the optimal value function are derived. The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary "flat" Q learning. Finally, the paper discusses some interesting issues that arise in hierarchical reinforcement learning including the hierarchical credit assignment problem and nonhierarchical execution of the MAXQ hierarchy. 1 Introduction Hierarchical approaches to reinforcement learning (RL) problems promise ma...
MachineLearning Research  Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract

Cited by 144 (1 self)
 Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
An Overview of MAXQ Hierarchical Reinforcement Learning
 IN ABSTRACTION, REFORMULATION, AND APPROXIMATION
, 2000
"... . Reinforcement learning addresses the problem of learning optimal policies for sequential decisionmaking problems involving stochastic operators and numerical reward functions rather than the more traditional deterministic operators and logical goal predicates. In many ways, reinforcement lear ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
. Reinforcement learning addresses the problem of learning optimal policies for sequential decisionmaking problems involving stochastic operators and numerical reward functions rather than the more traditional deterministic operators and logical goal predicates. In many ways, reinforcement learning research is recapitulating the development of classical research in planning and problem solving. After studying the problem of solving "flat" problem spaces, researchers have recently turned their attention to hierarchical methods that incorporate subroutines and state abstractions. This paper gives an overview of the MAXQ value function decomposition and its support for state abstraction and action abstraction. 1 Introduction Reinforcement learning studies the problem of a learning agent that interacts with an unknown, stochastic, but fullyobservable environment. This problem can be formalized as a Markov decision process (MDP), and reinforcement learning research has develop...
An Integrated Approach to Hierarchy and Abstraction for POMDPs
, 2002
"... This paper presents an algorithm for planning in structured partially observable Markov Decision Processes (POMDPs). The new algorithm, named PolCA (for PolicyContingent Abstraction) uses an actionbased decomposition to partition complex POMDP problems into a hierarchy of smaller subproblems. Low ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
This paper presents an algorithm for planning in structured partially observable Markov Decision Processes (POMDPs). The new algorithm, named PolCA (for PolicyContingent Abstraction) uses an actionbased decomposition to partition complex POMDP problems into a hierarchy of smaller subproblems. Lowlevel subtasks are solved rst, and their partial policies are used to model abstract actions in the context of higherlevel subtasks. At all levels of the hierarchy, subtasks need only consider a reduced action, state and observation space. The reduced action set is provided by a designer, whereas the reduced state and observations sets are discovered automatically on a subtaskpersubtask basis. This typically results in lowerlevel subtasks having few, but highresolution, state/observations features, whereas highlevel subtasks tend to have many, but lowresolution, state/observation features. This paper presents a detailed overview of PolCA in the context of a POMDP hierarchical planning and execution algorithm. It also includes theoretical results demonstrating that in the special case of fully observable MDPs, the algorithm converges to a recursively optimal solution. Experimental results included in the paper demonstrate the usefulness of the approach on a range of problems, and show favorable performance compared to competing functionapproximation POMDP algorithms. Finally, the paper presents a realworld implementation and deployment of a robotic system which uses PolCA in the context of a highlevel robot behavior control task.
Hierarchical Optimization of PolicyCoupled SemiMarkov Decision Processes
 In Proceedings of the Sixteenth International Conference on Machine Learning
, 1999
"... One general strategy for approximately solving large Markov decision processes is "divideandconquer": the original problem is decomposed into subproblems which interact with each other, but yet can be solved independently by taking into account the nature of the interaction. In this pap ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
One general strategy for approximately solving large Markov decision processes is "divideandconquer": the original problem is decomposed into subproblems which interact with each other, but yet can be solved independently by taking into account the nature of the interaction. In this paper we focus on a class of "policycoupled" semiMarkov decision processes (SMDPs), which arise in many nonstationary realworld multiagent tasks, such as manufacturing and robotics. The nature of the interaction among subproblems (agents) is more subtle than that studied previously: the components of a subSMDP, namely the available states and actions, transition probabilities and rewards, depend on the policies used in solving the "neighboring" subSMDPs. This "stronglycoupled" interaction among subproblems causes the approach of solving each subSMDP in parallel to fail. We present a novel approach whereby many variants of each subSMDP are solved, explicitly taking into account the different mod...
Algorithms for Partially Observable Markov Decision Processes
 HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
, 2001
"... Partially Observable Markov Decision Process (POMDP) is a general sequential decisionmaking model where the effects of actions are... ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
Partially Observable Markov Decision Process (POMDP) is a general sequential decisionmaking model where the effects of actions are...
Highlevel robot behavior control using POMDPs
 In AAAI Workshop notes
, 2002
"... This paper describes a robot controller which uses probabilistic decisionmaking techniques at the highestlevel of behavior control. The POMDPbased robot controller has the ability to incorporate noisy and partial sensor information, and can arbitrate between information gathering and perform ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
This paper describes a robot controller which uses probabilistic decisionmaking techniques at the highestlevel of behavior control. The POMDPbased robot controller has the ability to incorporate noisy and partial sensor information, and can arbitrate between information gathering and performancerelated actions. The complexity of the robot control domain requires a POMDP model that is beyond the capability of current exact POMDP solvers, therefore we present a hierarchical variant of the POMDP model which exploits structure in the problem domain to accelerate planning. This POMDP controller is implemented and tested onboard a mobile robot in the context of an interactive service task. During the course of experiments conducted in an assisted living facility, the robot successfully demonstrated that it could autonomously provide guidance and information to elderly residents with mild physical and cognitive disabilities.
Hierarchical learning of navigational behaviors in an autonomous robot using a predictive sparse distributed memory
 Machine Learning
, 1998
"... Abstract. We describe a general framework for learning perceptionbased navigational behaviors in autonomous mobile robots. A hierarchical behaviorbased decomposition of the control architecture is used to facilitate efficient modular learning. Lower level reactive behaviors such as collision detec ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We describe a general framework for learning perceptionbased navigational behaviors in autonomous mobile robots. A hierarchical behaviorbased decomposition of the control architecture is used to facilitate efficient modular learning. Lower level reactive behaviors such as collision detection and obstacle avoidance are learned using a stochastic hillclimbing method while higher level goaldirected navigation is achieved using a selforganizing sparse distributed memory. The memory is initially trained by teleoperating the robot on a small number of paths within a given domain of interest. During training, the vectors in the sensory space as well as the motor space are continually adapted using a form of competitive learning to yield basis vectors that efficiently span the sensorimotor space. After training, the robot navigates from arbitrary locations to a desired goal location using motor output vectors computed by a saliencybased weighted averaging scheme. The pervasive problem of perceptual aliasing in finiteorder Markovian environments is handled by allowing both current as well as the set of immediately preceding perceptual inputs to predict the motor output vector for the current time instant. We describe experimental and simulation results obtained using a mobile robot equipped with bump sensors, photosensors and infrared receivers, navigating within an enclosed obstacleridden arena. The results indicate that the method performs successfully in a number of navigational tasks exhibiting varying degrees of perceptual aliasing.