## Reinforcement learning with hierarchies of machines (1998)

### Cached

### Download Links

Venue: | Advances in Neural Information Processing Systems 10 |

Citations: | 251 - 11 self |

### BibTeX

@INPROCEEDINGS{Parr98reinforcementlearning,

author = {Ronald Parr and Stuart Russell},

title = {Reinforcement learning with hierarchies of machines},

booktitle = {Advances in Neural Information Processing Systems 10},

year = {1998},

pages = {1043--1049},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a new approach to reinforcement learning in which the policies considered by the learning process are constrained by hierarchies of partially specified machines. This allows for the use of prior knowledge to reduce the search space and provides a framework in which knowledge can be transferred across problems and in which component solutions can be recombined to solve larger and more complicated problems. Our approach can be seen as providing a link between reinforcement learning and “behavior-based ” or “teleo-reactive ” approaches to control. We present provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrate their effectiveness on a problem with several thousand states. 1

### Citations

2961 |
A robust layered control system for a mobile robot
- Brooks
- 1986
(Show Context)
Citation Context ...tions. These approaches can be emcompassed within our framework by encoding the events or behaviors as machines. The design of hierarchically organized,“layered” controllers was popularized by Brooks =-=[4]-=-. His designs use a somewhat different means of passing control,but our analysis and theorems apply equally well to his machine description language. The “teleo-reactive” agent designs of Benson and N... |

1329 |
Markov Decision Processes: Discrete Stochastic Programming
- Puterman
- 1994
(Show Context)
Citation Context ... standard RL algorithms. We conclude with a discussion of related approaches and ongoing work. 2 Markov Decision Processes We assume the reader is familiar with the basic concepts of MDPs (see, e.g., =-=[10]-=-). We will use the following notation: An MDP is a 4-tuple, (S, A, T, R) where S is a set of states, A is a set of actions, T is a transition model mapping S \Theta A \Theta S into probabilities in [0... |

682 |
Tsitsiklis, Parallel and distributed computation: Numerical Methods
- Bertsekas, N
- 1989
(Show Context)
Citation Context ...rformance guarantees and oscillation or divergence in reinforcement learning. Moreover, state aggregation may be hard to apply effectively in many cases. Dean and Lin [8] and Bertsekas and Tsitsiklis =-=[2]-=-, showed that some MDPs are loosely coupled and hence amenable to divide-and-conquer algorithms. A machine-like language was used in [13] to partition an MDP into decoupled subproblems. In problems th... |

395 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
(Show Context)
Citation Context ...es an explicit subgoal structure, with fixed values for each subgoal achieved, in order to achieve a hierarchical decomposition of the state space. Dietterich extends and generalizes this approach in =-=[9]-=-. Singh has investigated a number of approaches to subgoal based decomposition in reinforcement learning (e.g. [17] and [16]). Subgoals seem natural in some domains, but they may require a significant... |

344 | O-plan: The open planning architecture
- Currie, Tate
- 1991
(Show Context)
Citation Context ...complexity of decision making from exponential to linear in the size of the problem. For example, hierarchical task network (HTN) planners can generate solutions containing tens of thousands of steps =-=[5]-=-, whereas “flat” planners can manage only tens of steps. HTN planners are successful because they use a plan library that describes the decomposition of high-level activities into lower-level activiti... |

273 | Reward, motivation, and reinforcement learning
- Dayan, Balleine
- 2002
(Show Context)
Citation Context ...ke language was used in [13] to partition an MDP into decoupled subproblems. In problems that are amenable to decoupling, this could approaches could be used in combinated with HAMs. Dayan and Hinton =-=[6]-=- have proposed feudal RL which specifies an explicit subgoal structure, with fixed values for each subgoal achieved, in order to achieve a hierarchical decomposition of the state space. Dietterich ext... |

215 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ...ement signal in HAMQ-learning is the same as the expected reinforcement signal that would be received if the agent were acting directly in the transformed model of Theorem 1 above. Thus, Theorem 1 of =-=[11]-=- can be applied to prove the convergence of the HAMQ-learning agent, provided that we enforce suitable constraints on the exploration strategy and the update parameter decay rate. We ran some experime... |

196 |
Reinforcement Learning for Robots Using Neural Networks
- Lin
- 1993
(Show Context)
Citation Context ...ate sequences of state transitions together to permit reasoning about temporally extended events, and which can thereby form a behavioral hierarchy as in [14] and [15]. Lin’s somewhat informal scheme =-=[12]-=- also allows agents to treat entire policies as single actions. These approaches can be emcompassed within our framework by encoding the events or behaviors as machines. The design of hierarchically o... |

166 | Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning 8:323–339
- Singh
- 1992
(Show Context)
Citation Context ...decomposition of the state space. Dietterich extends and generalizes this approach in [9]. Singh has investigated a number of approaches to subgoal based decomposition in reinforcement learning (e.g. =-=[17]-=- and [16]). Subgoals seem natural in some domains, but they may require a significant amount of outside knowledge about the domain and establishing the relationship between the value of subgoals with ... |

121 | Reinforcement learning with soft state aggregation
- Singh, Jaakkola, et al.
- 1996
(Show Context)
Citation Context ...ng. 2 2 Speedup techniques such as eligibility traces could be applied to get better Q-learning results; such methods apply equally well to HAMQ-learning.s6 Related work State aggregation (see, e.g., =-=[18]-=- and [7]) clusters “similar” states together and assigns them the same value, effectively reducing the state space. This is orthogonal to our approach and could be combined with HAMs. However, aggrega... |

117 | Reinforcement learning methods for continuous-time markov decision problems
- Bradtke, Duff
- 1994
(Show Context)
Citation Context ...quire a significant amount of outside knowledge about the domain and establishing the relationship between the value of subgoals with respect to the overall problem can be difficult. Bradtke and Duff =-=[3]-=- proposed an RL algorithm for SMDPs. Sutton [19] proposes temporal abstractions, which concatenate sequences of state transitions together to permit reasoning about temporally extended events, and whi... |

112 | Decomposition techniques for planning in stochastic domains.In
- Dean, Lin
- 1995
(Show Context)
Citation Context ...roperty leading to the loss of performance guarantees and oscillation or divergence in reinforcement learning. Moreover, state aggregation may be hard to apply effectively in many cases. Dean and Lin =-=[8]-=- and Bertsekas and Tsitsiklis [2], showed that some MDPs are loosely coupled and hence amenable to divide-and-conquer algorithms. A machine-like language was used in [13] to partition an MDP into deco... |

79 | Multi-time models for temporally abstract planning
- Precup, Sutton
- 1998
(Show Context)
Citation Context ...temporal abstractions, which concatenate sequences of state transitions together to permit reasoning about temporally extended events, and which can thereby form a behavioral hierarchy as in [14] and =-=[15]-=-. Lin’s somewhat informal scheme [12] also allows agents to treat entire policies as single actions. These approaches can be emcompassed within our framework by encoding the events or behaviors as mac... |

56 | Model reduction techniques for computing approximately optimal solutions for markov decision processes
- Dean, Givan, et al.
(Show Context)
Citation Context ...peedup techniques such as eligibility traces could be applied to get better Q-learning results; such methods apply equally well to HAMQ-learning.s6 Related work State aggregation (see, e.g., [18] and =-=[7]-=-) clusters “similar” states together and assigns them the same value, effectively reducing the state space. This is orthogonal to our approach and could be combined with HAMs. However, aggregation sho... |

44 | Reacting, planning and learning in an autonomous agent
- Benson, Nilsson
- 1995
(Show Context)
Citation Context ...esigns use a somewhat different means of passing control,but our analysis and theorems apply equally well to his machine description language. The “teleo-reactive” agent designs of Benson and Nilsson =-=[1]-=- are even closer to our HAM language. Both of these approaches assume that the agent is completely specified, albeit self-modifiable. The idea of partial behavior descriptions can be traced at least t... |

41 | Roles of macro-actions in accelerating reinforcement learning
- McGovern, Sutton, et al.
- 1997
(Show Context)
Citation Context ...proposes temporal abstractions, which concatenate sequences of state transitions together to permit reasoning about temporally extended events, and which can thereby form a behavioral hierarchy as in =-=[14]-=- and [15]. Lin’s somewhat informal scheme [12] also allows agents to treat entire policies as single actions. These approaches can be emcompassed within our framework by encoding the events or behavio... |

25 | Scaling reinforcement learning algorithms by learning variable temporal resolution models
- Singh
- 1992
(Show Context)
Citation Context ...tion of the state space. Dietterich extends and generalizes this approach in [9]. Singh has investigated a number of approaches to subgoal based decomposition in reinforcement learning (e.g. [17] and =-=[16]-=-). Subgoals seem natural in some domains, but they may require a significant amount of outside knowledge about the domain and establishing the relationship between the value of subgoals with respect t... |

5 |
Temporal abstraction in reinforcement learning
- Sutton
- 1995
(Show Context)
Citation Context ... about the domain and establishing the relationship between the value of subgoals with respect to the overall problem can be difficult. Bradtke and Duff [3] proposed an RL algorithm for SMDPs. Sutton =-=[19]-=- proposes temporal abstractions, which concatenate sequences of state transitions together to permit reasoning about temporally extended events, and which can thereby form a behavioral hierarchy as in... |

4 | Exploiting Structure for Planning and Control
- Lin
- 1997
(Show Context)
Citation Context ...ly in many cases. Dean and Lin [8] and Bertsekas and Tsitsiklis [2], showed that some MDPs are loosely coupled and hence amenable to divide-and-conquer algorithms. A machine-like language was used in =-=[13]-=- to partition an MDP into decoupled subproblems. In problems that are amenable to decoupling, this could approaches could be used in combinated with HAMs. Dayan and Hinton [6] have proposed feudal RL ... |

1 |
Synthesizing efficient agents from partial programs
- Hsu
- 1991
(Show Context)
Citation Context ...AM language. Both of these approaches assume that the agent is completely specified, albeit self-modifiable. The idea of partial behavior descriptions can be traced at least to Hsu’s partial programs =-=[10]-=-, which were used with a deterministic logical planner. 7 Conclusions and future work We have presented HAMs as a principled means of constraining the set of policies that are considered for a Markov ... |

1 |
of macro-actions in accelerating reinforcement learning
- Roles
- 1997
(Show Context)
Citation Context ...proposes temporal abstractions, which concatenate sequences of state transitions together to permit reasoning about temporally extended events, and which can thereby form a behavioral hierarchy as in =-=[14]-=- and [15]. Lin's somewhat informal scheme [12] also allows agents to treat entire policies as single actions. These approaches can be emcompassed within our framework by encoding the events or behavio... |

1 |
Synthesizing efficient agents from partial programs
- Yung-JenHsu
- 1991
(Show Context)
Citation Context ...AM language. Both of these approaches assume that the agent is completely specified, albeit self-modifiable. The idea of partial behavior descriptions can be traced at least to Hsu's partial programs =-=[7]-=-, which were used with a deterministic logical planner. 7 Conclusions and future work We have presented Hierarchies of Abstract Machines (HAMs) as a principled means of constraining the set of policie... |