## Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition (2000)

### Cached

### Download Links

Venue: | Journal of Artificial Intelligence Research |

Citations: | 395 - 6 self |

### BibTeX

@ARTICLE{Dietterich00hierarchicalreinforcement,

author = {Thomas G. Dietterich},

title = {Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition},

journal = {Journal of Artificial Intelligence Research},

year = {2000},

volume = {13},

pages = {227--303}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics---as a subroutine hierarchy---and a declarative semantics---as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consisten...

### Citations

2894 |
Applied Dynamic Programming
- BELLMAN, DREYFUS
- 1962
(Show Context)
Citation Context ...ans by interacting directly with the external environment. The basic methods in reinforcement learning are based on the classical dynamic programming algorithms that were developed in the late 1950s (=-=Bellman, 1957-=-; Howard, 1960). However, reinforcement learning methods offer two important advantages over classical dynamic programming. First, the methods are online. This permits them to focus their attention on... |

1408 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ... Note that all optimal policies are "greedy" with respect to the backed-up value of the available actions. Closely related to the value function is the so-called action-value function, or Q =-=function (Watkins, 1989-=-). This function, Q (s; a), gives the expected cumulative reward of performing action a in state s and then following policysthereafter. The Q function also satisfies a Bellman equation: Q (s; a) = X ... |

943 | Introduction to Reinforcement Learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ment learning. c fl2000 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved. Dietterich 1. Introduction The area of Reinforcement Learning (Bertsekas & Tsitsiklis, 1996; Sutton & =-=Barto, 1998-=-) studies methods by which an agent can learn optimal or near-optimal plans by interacting directly with the external environment. The basic methods in reinforcement learning are based on the classica... |

862 | SOAR: An architecture for general intelligence - Laird, Newell, et al. - 1987 |

803 |
Nonlinear Programming. Athena Scientific
- Bertsekas
(Show Context)
Citation Context ...d SARSA(0) (Bertsekas & Tsitsiklis, 1996; Jaakkola et al., 1994). We will employ the following result from stochastic approximation theory, which we state without proof: Lemma 1 (Proposition 4.5 from =-=Bertsekas and Tsitsiklis, 1996-=-) Consider the iteration rt+1(i) :=(1, t(i))rt(i)+ t(i)((Urt)(i)+wt(i)+ut(i)): Let Ft = fr0(i);:::;rt(i);w0(i);:::;wt,1(i); 0(i);:::; t(i); 8ig be the entire history of the iteration. If (a) The t(i) ... |

645 |
Markov Decision Processes
- Puterman
- 1994
(Show Context)
Citation Context ....e., the state resulting from an action is a probabilistic function of the previous state and the chosen action), the resulting sequential decision problem is known as a Markov Decision Problem (MDP; =-=Puterman, 1994-=-). Because Markov Decision Problems provide a very general model of sequential decision-making under uncertainty, they have provided the foundation for much recent work on probabilistic planning and l... |

641 |
Rete: A fast algorithm for the many pattern/ many object pattern match problem
- Forgy
- 1982
(Show Context)
Citation Context ... result of the state change are re-considered. It should be possible to develop an efficient bottom-up method similar to the RETE algorithm (and its successors) that is used in the SOAR architecture (=-=Forgy, 1982-=-; Tambe & Rosenbloom, 1994). The third thing that must be specified to complete our definition of MAXQ-0 is the exploration policy,sx . We require thatsx be an ordered GLIE policy. Definition 9 An ord... |

563 |
Dynamic Programming and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...ing directly with the external environment. The basic methods in reinforcement learning are based on the classical dynamic programming algorithms that were developed in the late 1950s (Bellman, 1957; =-=Howard, 1960-=-). However, reinforcement learning methods offer two important advantages over classical dynamic programming. First, the methods are online. This permits them to focus their attention on the parts of ... |

563 | Learning to act using real-time dynamic programming - Barto, Bradtke, et al. - 1995 |

504 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
(Show Context)
Citation Context ...n (starting at MaxRoot) to choose the next action. This permits us to follow the greedy policy with respect to the learned MAXQ decomposition. Because this constitutes one step of policy improvement (=-=Bertsekas, 1995-=-), it is guaranteed to give a better policy than the hierarchical policy if the hierarchical policy is not optimal. This informally proves the following: Theorem 4 For all states s, the value of the p... |

469 |
Planning in a hierarchy of abstraction spaces
- Sacerdoti
- 1974
(Show Context)
Citation Context ...assical planning has shown that hierarchical methods such as hierarchical task networks (Currie & Tate, 1991), macro actions (Fikes, Hart, & Nilsson, 1972; Korf, 1985), and state abstraction methods (=-=Sacerdoti, 1974; Kno-=-block, 1990) can provide exponential reductions in the computational cost of finding good plans. However, all of the basic algorithms for probabilistic planning and reinforcement learning are "fl... |

452 |
Learning and Executing Generalized Robot Plans
- Fikes, Hart, et al.
- 1972
(Show Context)
Citation Context ...method for incorporating hierarchies into these algorithms. Research in classical planning has shown that hierarchical methods such as hierarchical task networks (Currie & Tate, 1991), macro actions (=-=Fikes, Hart, & Nilsson, 1972-=-; Korf, 1985), and state abstraction methods (Sacerdoti, 1974; Knoblock, 1990) can provide exponential reductions in the computational cost of finding good plans. However, all of the basic algorithms ... |

344 | O-Plan: the open planning architecture
- Currie, Tate
- 1991
(Show Context)
Citation Context ... is the lack of a fully satisfactory method for incorporating hierarchies into these algorithms. Research in classical planning has shown that hierarchical methods such as hierarchical task networks (=-=Currie & Tate, 1991-=-), macro actions (Fikes, Hart, & Nilsson, 1972; Korf, 1985), and state abstraction methods (Sacerdoti, 1974; Knoblock, 1990) can provide exponential reductions in the computational cost of finding goo... |

336 | Prioritized sweeping: Reinforcement learning with less data and less time - Moore, Atkeson - 1993 |

304 | On-line Q-learning using connectionist systems
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...erge to the cumulative reward of the optimal policy for the MDP. In this paper, we will make use of two well-known learning algorithms: Q learning (Watkins, 1989; Watkins & Dayan, 1992) and SARSA(0) (=-=Rummery & Niranjan, 1994-=-). Both of these algorithms maintain a tabular representation of the action-value function Q(s,a). Every entry of the table is initialized arbitrarily. In Q learning, after the algorithm has observed ... |

273 | Reward, motivation, and reinforcement learning
- Dayan, Balleine
- 2002
(Show Context)
Citation Context ...paths determines the cost of learning and planning, because information about future rewards must be propagated backward along these paths. Many researchers (Singh, 1992a; Lin, 1993; Kaelbling, 1993; =-=Dayan & Hinton, 1993-=-; Hauskrecht, Meuleau, Boutilier, Kaelbling, & Dean, 1998; Parr & Russell, 1998; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcement learning and hier... |

258 | Residual algorithms: Reinforcement learning with function approximation - Baird - 1995 |

251 | Reinforcement learning with hierarchies of machines
- Parr, Russell
- 1997
(Show Context)
Citation Context ...uture rewards must be propagated backward along these paths. Many researchers (Singh, 1992a; Lin, 1993; Kaelbling, 1993; Dayan & Hinton, 1993; Hauskrecht, Meuleau, Boutilier, Kaelbling, & Dean, 1998; =-=Parr & Russell, 1998-=-; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcement learning and hierarchical 2probabilistic planning. This research has explored many different po... |

231 | Exploiting structure in policy construction - Boutilier, Dearden, et al. - 1995 |

214 | On the convergence of stochastic iterative dynamic programming `algorithms - Jaakkola, Jordan, et al. - 1994 |

196 |
Reinforcement Learning for Robots Using Neural Networks
- Lin
- 1993
(Show Context)
Citation Context ...ong, and the length of these paths determines the cost of learning and planning, because information about future rewards must be propagated backward along these paths. Many researchers (Singh, 1992; =-=Lin, 1993-=-; Kaelbling, 1993; Dayan & Hinton, 1993; Hauskrecht, et al., 1998; Parr & Russell, 1998; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcement learning ... |

166 | Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning 8:323–339
- Singh
- 1992
(Show Context)
Citation Context ...te are very long, and the length of these paths determines the cost of learning and planning, because information about future rewards must be propagated backward along these paths. Many researchers (=-=Singh, 1992-=-; Lin, 1993; Kaelbling, 1993; Dayan & Hinton, 1993; Hauskrecht, et al., 1998; Parr & Russell, 1998; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcemen... |

155 |
Decision theoretic planning: Structural assumptions and computational leverage
- Boutilier, Dean, et al.
- 1999
(Show Context)
Citation Context ...te space, which makes them prohibitively expensive for most AI problems. Hence, recent research has focused on methods that can exploit structure within the planning problem to work more efficiently (=-=Boutilier, Dean, & Hanks, 1999-=-). The area of Reinforcement Learning (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 1998) studies methods for learning optimal or near-optimal plans by interacting directly with the external environm... |

149 |
Macro-operators: A weak method for learning
- Korf
- 1985
(Show Context)
Citation Context ...orcement learning algorithms. Research in classical planning has shown that hierarchical methods such as hierarchical task networks (Currie & Tate, 1991), macro actions (Fikes, Hart, & Nilsson, 1972; =-=Korf, 1985-=-), and state abstraction methods (Sacerdoti, 1974; Knoblock, 1990) can provide exponential reductions in the computational cost of finding good plans. However, all of the basic algorithms for probabil... |

129 | Hierarchical solution of Markov decision processes using macroactions
- Hauskrecht, Meuleau, et al.
- 1998
(Show Context)
Citation Context ...cost of learning and planning, because information about future rewards must be propagated backward along these paths. Many researchers (Singh, 1992; Lin, 1993; Kaelbling, 1993; Dayan & Hinton, 1993; =-=Hauskrecht, et al., 1998-=-; Parr & Russell, 1998; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcement learning and hierarchical probabilistic planning. This research has explor... |

127 | The MAXQ method for hierarchical reinforcement learning - Dietterich - 1998 |

124 | Convergence results for single-step on-policy reinforcement-learning algorithms - Singh, Jaakkola, et al. |

117 | Hierarchical Control and learning for Markov decision processes - Parr - 1998 |

112 | Decomposition techniques for planning in stochastic domains.In - Dean, Lin - 1995 |

86 | Learning Abstraction Hierarchies for Problem Solving
- Knoblock
- 1990
(Show Context)
Citation Context ...has shown that hierarchical methods such as hierarchical task networks (Currie & Tate, 1991), macro actions (Fikes, Hart, & Nilsson, 1972; Korf, 1985), and state abstraction methods (Sacerdoti, 1974; =-=Knoblock, 1990) can prov-=-ide exponential reductions in the computational cost of finding good plans. However, all of the basic algorithms for probabilistic planning and reinforcement learning are "flat" methods---th... |

79 | Multi-time models for temporally abstract planning - Precup, Sutton - 1998 |

74 |
Probabilistic Inference in Intelligent Systems. Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...archy. Our philosophy in developing MAXQ (which we share with other reinforcement learning researchers, notably Parr and Russell) has been to draw inspiration from the development of Belief Networks (=-=Pearl, 1988-=-). Belief networks were first introduced as a formalism in which the knowledge engineer would describe the structure of the networks and domain experts would provide the necessary probability estimate... |

63 | TD models: Modeling the world at a mixture of time scales
- Sutton
- 1995
(Show Context)
Citation Context ...he tree can typically be given highly abstracted views of the state. This improves the ability to transfer such subtasks from one problem to another. In a series of papers, Sutton and his colleagues (=-=Sutton, 1995; Precup &-=- Sutton, 1998; Sutton et al., 1998) have studied a kind of subtask that they call an "option." An option is exactly the same as a subtask in the MAXQ hierarchy, although in their work, they ... |

57 | Between mdps and semi-mdps: Learning, planning and representing knowledge at multiple temporal scales - Sutton, Precup, et al. - 1998 |

48 | Flexible decomposition algorithms for weakly coupled Markov decision problems - Parr - 1998 |

38 | Approximating value trees in structured dynamic programming - Boutilier, Dearden - 1996 |

27 |
Hierarchical reinforcement learning: Preliminary results
- Kaelbling
- 1993
(Show Context)
Citation Context ...e length of these paths determines the cost of learning and planning, because information about future rewards must be propagated backward along these paths. Many researchers (Singh, 1992; Lin, 1993; =-=Kaelbling, 1993-=-; Dayan & Hinton, 1993; Hauskrecht, et al., 1998; Parr & Russell, 1998; Sutton, Precup, & Singh, 1998) have experimented with different methods of hierarchical reinforcement learning and hierarchical ... |

22 | Improved switching among temporally abstract actions - Sutton, Singh, et al. - 1999 |

20 |
Investigating production system representations for non-combinatorial match
- Tambe, Rosenbloom
- 1994
(Show Context)
Citation Context ...change are re-considered. It should be possible to develop an efficient bottom-up method similar to the RETE algorithm (and its successors) that is used in the SOAR architecture (Forgy, 1982; Tambe & =-=Rosenbloom, 1994-=-). The third thing that must be specified to complete our definition of MAXQ-0 is the exploration policy,sx . We require thatsx be an ordered GLIE policy. Definition 9 An ordered GLIE policy is a GLIE... |

18 | Multivalue-functions: Efficient automatic action hierarchies for multiple goal MDPs - Moore, Baird, et al. - 1999 |

16 | Hierarchical explanation-based reinforcement learning
- Tadepalli, Dietterich
- 1997
(Show Context)
Citation Context ...sks. 7.4 Other Domains In addition to the three domains discussed above, we have developed MAXQ graphs for Singh’s (1992b) “flag task”, the treasure hunter task described by Tadepalli and Dietterich (=-=Tadepalli & Dietterich, 1997-=-), and Dayan and Hinton’s (1993) Fuedal-Q learning task. All of these tasks can be easily and naturally placed into the MAXQ framework—indeed, all of them fit more easily than the Parr and Russell maz... |

15 | Module Based Reinforcement Learning for a Real Robot - Kalmar, Szepesvari, et al. |

5 |
Economic principles of multi-agent systems (Editorial
- Boutilier, Shoham, et al.
- 1997
(Show Context)
Citation Context ...es for constructing robust, autonomous agents that are able to achieve good performance in complex, real-world environments. One fruitful line of research views agents from an “economic” perspective (=-=Boutilier, Shoham, & Wellman, 1997-=-): An agent interacts with an environment and receives real-valued rewards and penalties. The agent’s goal is to maximize the total reward it receives. The economic view makes it easy to formalize tra... |

1 | Multi-value-functions: E cient automatic action hierarchies for multiple goal MDPs - Moore, Baird, et al. - 1999 |

1 |
Hierarchical reinforcement learning: Preliminary results
- Kaufmann, Francisco, et al.
- 1993
(Show Context)
Citation Context ... large penalty. For each of these parallel problems, we can apply standard Q learning to learn ~ C based on ~ R u : ~ C u;v (s) := (1 \Gamma ff) ~ C u;v (s) + ff[ ~ R u (s 0 ) + max a ~ Q u;a (s 0 )] =-=(5)-=- where ~ Q u;v (s) = MAX v (s) + ~ C u;v (s): (6) Note that MAX v (s) is the true expected cost of taking action v. The policy can now be defined bysu (s) = argmax v ~ Q u;v (s). The resulting MAXQ hi... |