## Recent advances in hierarchical reinforcement learning (2003)

### Cached

### Download Links

- [www.fias.uni-frankfurt.de]
- [www-anw.cs.umass.edu]
- [fias.uni-frankfurt.de]
- [www-all.cs.umass.edu]
- [www.cs.iastate.edu]
- [www.cs.umass.edu]
- [www.cs.umass.edu]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [www.cs.utexas.edu]
- [people.cs.umass.edu]
- [www.cs.umass.edu]
- [people.cs.umass.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 164 - 23 self |

### BibTeX

@ARTICLE{Barto03recentadvances,

author = {Andrew G. Barto},

title = {Recent advances in hierarchical reinforcement learning},

journal = {},

year = {2003},

volume = {13},

pages = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

A preliminary unedited version of this paper was incorrectly published as part of Volume

### Citations

3803 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...sing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting. 1 1 Introduction Reinforcement learning (RL) =-=[5, 72]-=- is an active area of machine learning research that is also receiving attention from the fields of decision theory, operations research, and control engineering. RL algorithms address the problem of ... |

1326 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...he adjustable parameters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning =-=[82, 83]-=- and Sarsa [59, 70]. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectiv... |

1307 | Reinforcement learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...ent interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., =-=[2, 5, 32, 72]-=-). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable approximate solutions can still be unacceptable. As a result, RL researchers are investig... |

832 | Planning and acting in partially observable stochastic domains
- Kaelbling, Littman, et al.
- 1998
(Show Context)
Citation Context ...proach is formalized in terms of Partially observable Markov decision processes (POMDPs), where agents learn policies over belief states, i.e., probability distributions over the underlying state set =-=[31]-=-. It can be shown that belief states satisfy the Markov property and consequently yield a new (and more complex) MDP over information states. Belief states can be recursively updated using the transit... |

746 |
Neuro-Dynamic Programming
- Bertsekas, Tsitsiklis
- 1996
(Show Context)
Citation Context ...sing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting. 1. Introduction Reinforcement learning (RL) (=-=Bertsekas and Tsitsiklis, 1996-=-; Sutton and Barto 1998) is an active area of machine learning research that is also receiving attention from the ®elds of decision theory, operations research, and control engineering. RL algorithms ... |

615 | Some studies in machine learning using the game of checkers
- Samuel
- 1959
(Show Context)
Citation Context ...state of understanding rather than the intuition underlying the origination of these methods. Indeed, DP-based learning originated at least as far back as Samuel's famous checkers player of the 1950s =-=[61, 60]-=-, which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the c... |

569 |
Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence
- Weiss
- 2000
(Show Context)
Citation Context ... where the two agents, A1 and A2, will maximize their performance at the task if they learn to coordinate with each other. Here, we want to design learning algorithms for cooperative multiagent tasks =-=[84]-=-, where the agents learn the coordination skills by trial and error. The key idea here is that coordination skills are learned more e#ciently if agents learn to synchronize using a hierarchical repres... |

532 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...ent interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., =-=[2, 5, 32, 72]-=-). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable approximate solutions can still be unacceptable. As a result, RL researchers are investig... |

497 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...s the utility of A2 picking up trash from T1 if A1 is also picking up from the same bin, and so on). The proposed approach di#ers significantly from previous work in multiagent reinforcement learning =-=[38, 74]-=- in using hierarchical task structure to accelerate learning, and as well in its use of concurrent activities. To illustrate the use of this decomposition in learning multiagent coordination, for the ... |

459 |
A model for reasoning about persistence and causation
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...nal structure [17]. Much work in artificial intelligence has focused on exploiting this structure to develop compact representations of single-step actions (e.g., the Dynamic Bayes Net representation =-=[13]-=-). A natural question to consider is how to extend these single-step compact models into compact models of temporally-extended activities, such as options. The problem is a bit subtle since even if al... |

427 | Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning
- Sutton, Precup, et al.
- 1999
(Show Context)
Citation Context ...erally defined for a subset of the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options =-=[73]-=-, skills [80], behaviors [9, 27], or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extensio... |

422 |
Learning and executing generalized robot plans
- Fikes, Hart, et al.
- 1972
(Show Context)
Citation Context ...ficial intelligence researchers have addressed the need for large-scale planning and problem solving by introducing various forms of abstraction into problem solving and planning systems, e.g., refs. =-=[18, 37]. Abstraction allows-=- a system to ignore details that are irrelevant for the task at hand. One of the simplest types of abstraction is the idea of a "macro-operator," or just a "macro," which is a sequ... |

373 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...heir properties. Here we briefly describe this well-known framework, with a few twists characteristic of how it is used in RL research; additional details can be found in many references (e.g., refs. =-=[4, 5, 55, 58, 72]-=-). A finite MDP models the following type of problem. At each stage in a sequence of stages, an agent (the controller) observes a system's state s, contained in a finite set S, and executes an action ... |

371 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
(Show Context)
Citation Context ...es to hierarchical RL: the options formalism of Sutton, Precup, and Singh [73], the hierarchies of abstract machines (HAMs) approach of Parr and Russell [48, 49], and the MAXQ framework of Dietterich =-=[14]-=-. Although these approaches were developed relatively independently, they have many elements in common. In particular, they all rely on the theory of semi-Markov decision processes to provide a formal... |

368 | Practical Issues in Temporal Difference Learning
- Tesauro
- 1992
(Show Context)
Citation Context ...ing function approximation methods for accumulating value function information, RL algorithms have produced good results on problems that pose significant challenges for standard methods (e.g., refs. =-=[11, 75]-=-). However, current RL methods by no means completely circumvent the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of ... |

353 | Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse
- Sutton
- 1996
(Show Context)
Citation Context ...eters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa =-=[59, 70]-=-. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectively replaced by a s... |

289 | On-line Q-learning using connectionist systems
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ...eters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. [11, 40, 64, 75]). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa =-=[59, 70]-=-. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected maximum action-value of the successor state on the right-hand side of (5) respectively replaced by a s... |

287 | The complexity of decentralized control of Markov decision processes
- Bernstein, Givan, et al.
- 2002
(Show Context)
Citation Context ...f observing y if action a was performed and resulted in state s. However, mapping belief states to optimal actions is known to be intractable, particularly in the decentralized multiagent formulation =-=[3]-=-. Also, learning a perfect model of the underlying POMDP is a challenging task. An empirically more e#ective (but theoretically less powerful) approach is to use finite memory models as linear chains ... |

278 |
Reinforcement Learning with Selective Perception and Hidden State
- Mccallum
- 1996
(Show Context)
Citation Context ...f the underlying POMDP is a challenging task. An empirically more e#ective (but theoretically less powerful) approach is to use finite memory models as linear chains or nonlinear trees over histories =-=[42]-=-. However, such finite memory structures can be defeated by long sequences of mostly irrelevant observations and actions that conceal a critical past observation. We briefly summarize three multiscale... |

268 | Transition network grammars for natural language analysis
- Woods
- 1970
(Show Context)
Citation Context ... do not need to be treated as part of the program state, a point we gloss over in our discussion. This kind of machine hierarchy is an instance of a Recurisve Transition Network as discussed by Woods =-=[88]-=-. 11 can be applied to reduce(H # M) to approximate optimal policies for H # M. The important strength of an RL method like SMDP Q-learning in this context is that it can be applied to reduce(H # M) w... |

263 | Tractable inference for complex stochastic processes
- Boyen, Koller
- 1998
(Show Context)
Citation Context ...generally does not hold over an extended activity. One approach that Rohanimanesh and Mahadevan [56] have been studying is how to exploit results from approximation of structured stochastic processes =-=[6]-=- to develop structured ways of approximating the next-state predictions of temporally-extended activities. The key idea is that by clustering the state variables into disjoint subsets, and keeping tra... |

248 |
Introduction to Stochastic Dynamic Programming
- Ross
- 1983
(Show Context)
Citation Context ...heir properties. Here we briefly describe this well-known framework, with a few twists characteristic of how it is used in RL research; additional details can be found in many references (e.g., refs. =-=[4, 5, 55, 58, 72]-=-). A finite MDP models the following type of problem. At each stage in a sequence of stages, an agent (the controller) observes a system's state s, contained in a finite set S, and executes an action ... |

248 | Multiagent reinforcement learning: Independent vs. cooperative agents
- Tan
- 1993
(Show Context)
Citation Context ...s the utility of A2 picking up trash from T1 if A1 is also picking up from the same bin, and so on). The proposed approach di#ers significantly from previous work in multiagent reinforcement learning =-=[38, 74]-=- in using hierarchical task structure to accelerate learning, and as well in its use of concurrent activities. To illustrate the use of this decomposition in learning multiagent coordination, for the ... |

241 | Reinforcement learning with hierarchies of machines
- Parr, Russell
- 1997
(Show Context)
Citation Context ... particularly that of Iba [28], who proposed a method for discovering macro-operators in problem solving. Related ideas have been studied by Digney [15, 16]. 4.2 Hierarchies of Abstract Machines Parr =-=[48, 49]-=- developed an approach to hierarchically structuring MDP policies called Hierarchies of Abstract Machines or HAMs. Like the options formalism, HAMs exploit the theory of SMDPs, but the emphasis is on ... |

236 | The Hierarchical Hidden Markov Model: Analysis and applications
- Fine, Singer, et al.
- 1998
(Show Context)
Citation Context ...lanning algorithms scale poorly with model size. Theocharous et al. [79] developed a hierarchical POMDP formalism, termed H-POMDPs (Figure 7), by extending the hierarchical hidden Markov model (HHMM) =-=[19]-=- to include rewards and temporally-extended 19 . . . . . . * ? T-junction corner dead end D3 D1 D3 D2 D1 D3 * ? d3 d2 d3 d3 d2 d2 abstraction level: navigation abstraction level: traversal abstraction... |

227 |
TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation 6(2
- Tesauro
- 1994
(Show Context)
Citation Context ...motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon =-=[75, 76]-=-. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods in many applications, the amount of time they can take to form acceptable... |

217 | An analysis of temporal-difference learning with function approximation (Technical Report LIDS-P-2322
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ... of the agent. The agent’s policy does not need high precision in 4sstates that are rarely visited. Feature 3 is the least understood aspect of RL, but results exist for the linear case (notably ref=-=. [81]-=-) and numerous examples illustrate how function approximation schemes that are nonlinear in the adjustable parameters (e.g., multilayer nerual networks) can be effective for difficult problems (e.g., ... |

209 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1988
(Show Context)
Citation Context ...missible state-action pairs are updated infinitely often, and # k decays with increasing k while obeying the usual stochastic approximation conditions, then {Q k } converges to Q # with probability 1 =-=[29, 5]-=-. As long as these conditions are satisfied, the policy followed by the agent during learning is irrelevant. Of course, when Q-learning is being used, the agent's policy does matter since one is usual... |

191 | Toward a Modern Theory of Adaptive Network
- SUTTON, BARTO
- 1981
(Show Context)
Citation Context ...rs player of the 1950s [61, 60], which, however, made no reference to the DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis =-=[45, 33, 34, 71]-=-. Much of the current interest is attributable to Werbos [85, 86, 87], Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several ... |

184 | A unified framework for hybrid control: Model and optimal control theory
- Branicky, Borkar, et al.
- 1998
(Show Context)
Citation Context ...avior of the plant and intervenes when its state enters a set of boundary states. Intervention takes the form of switching to a new low-level regulator. This is not unlike many hybrid control methods =-=[8]-=- except that the low-level process is formalized as a finite MDP and the supervisor's task as a finite SMDP. The supervisor's decisions occur whenever the plant reaches a boundary state, which e#ectiv... |

169 |
DERVISH- An Office-Navigating Robot
- Nourbakhsh, Powers, et al.
- 1995
(Show Context)
Citation Context ...tes to actions provide good performance in robot navigation (e.g, the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the “peak” of the belief state distributi=-=on) [35, 63, 47]-=-. Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the... |

133 | Learning topological maps with weak local odometric information
- Shatkay, Kaelbling
- 1997
(Show Context)
Citation Context ...tion (e.g., the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the ``peak'' of the belief state distribution) (Koenig and Simmons, 1997; Nourbakhsh et al., 1995; =-=Shatkay and Kaelbling, 1997-=-). Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of th... |

121 | Reinforcement learning for dynamic channel allocation in cellular telephone systems
- Singh, Bertsekas
- 1997
(Show Context)
Citation Context ...nd numerous examples illustrate how function approximation schemes that are nonlinear in the adjustable parameters (e.g., multilayer nerual networks) can be e#ective for di#cult problems (e.g., refs. =-=[11, 40, 64, 75]-=-). Of the many RL algorithms, perhaps the most widely used are Q-learning [82, 83] and Sarsa [59, 70]. Qlearning is based on the DP backup (5) but with the expected immediate reward and the expected m... |

116 | Automatic discovery of subgoals in reinforcement learning using diverse density
- McGovern, Barto
- 2001
(Show Context)
Citation Context ...lues, and Dietterich [14], whose approach we discuss in Section 4.3, proposes a similar scheme using pseudo-reward functions. A natural question, then, is how are useful subgoals determined? McGovern =-=[43, 44]-=- developed a method for automatically identifying potentially useful subgoals by detecting regions that the agent visits frequently on successful trajectories but not on unsuccessful trajectories. An ... |

114 | Reinforcement learning methods for continuous-time Markov decision problems
- Bradtke, Duff
- 1995
(Show Context)
Citation Context ... r t+i is the immediate reward received at time step t + i. The return accumulated during the waiting time must be bounded, and it can be computed recursively during the waiting time. Bradtke and Du# =-=[7]-=- showed how to do this for continuous-time SMDPs, Parr [48] proved that it converges under essentially the same conditions required for Q-learning convergence, and Das et al. [12] developed the averag... |

112 | Convergence results for single-step on-policy reinforcement-learning algorithms
- Singh, Jaakkola, et al.
(Show Context)
Citation Context ...called this algorithm Sarsa due to its dependence on s, a, r, s # , and a # . Eq. (9) is actually a special case called Sarsa(0).) Unlike Q-learning, here the agent's policy does matter. Singh et al. =-=[65]-=- show that if the policy has the property that each action is executed infinitely often in every state that is visited infinitely often, and it is greedy with respect to the current action-value funct... |

109 | Hierarchical Control and Learning for Markov Decision Processes”, PhD. thesis
- Parr
- 1998
(Show Context)
Citation Context ... ideas of RL, and then we review three approaches to hierarchical RL: the options formalism of Sutton, Precup, and Singh [73], the hierarchies of abstract machines (HAMs) approach of Parr and Russell =-=[48, 49]-=-, and the MAXQ framework of Dietterich [14]. Although these approaches were developed relatively independently, they have many elements in common. In particular, they all rely on the theory of semi-Ma... |

105 |
A heuristic approach to the discovery of macrooperators
- Iba
- 1989
(Show Context)
Citation Context ... knowledge transfer as previously-discovered options are reused in related tasks. This approach builds on previous work in artificial intelligence that addresses abstraction, particularly that of Iba =-=[28]-=-, who proposed a method for discovering macro-operators in problem solving. Related ideas have been studied by Digney [15, 16]. 4.2 Hierarchies of Abstract Machines Parr [48, 49] developed an approach... |

102 | Programmable Reinforcement Learning Agents
- Andre
- 2003
(Show Context)
Citation Context ...d the state transition probabilities, P (s # |s, a), s, s # # S, together comprise what RL researchers often call the one-step 2 model of action a. A (stationary, stochastic) policy # : S # s#S A s # =-=[0, 1]-=-, with #(s, a) = 0 for a ## A s , specifies that the agent executes action a # A s with probability #(s, a) whenever it observes state s. For any policy # and s # S, V # (s) denotes the expected infin... |

102 | Scaling Reinforcement Learning toward RoboCup Soccer
- Stone, Sutton
- 2001
(Show Context)
Citation Context ...g problem. They demonstrated that the policies learned for this problem were better than standard heuristics used in industry, such as the "go to the nearest free machine" heuristic. Stone a=-=nd Sutton [68] applied t-=-he framework of options to a "keep away" task in Robot soccer. This task involves a set of players from one team passing the ball between them and keeping the ball in their possession agains... |

102 | Finding structure in reinforcement learning
- Thrun, Schwartz
- 1995
(Show Context)
Citation Context ...d for a subset of the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills =-=[80]-=-, behaviors [9, 27], or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the... |

99 | Average Reward Reinforcement Learning: Foundations, Algorithms and Empirical Results
- MAHADEVAN
- 1996
(Show Context)
Citation Context ...e simplest class of MDPs, and here we restrict attention to discounted problems. However, RL algorithms have also been developed for MDPs with other definitions of return, such as average reward MDPs =-=[39, 62]-=-. Playing important roles in many RL algortihms are action-value functions, which assign values to admissible state-action pairs. Given a policy #, the value of (s, a), a # A s , denoted Q # (s, a), i... |

98 | Xavier: A robot navigation architecture based on partially observable Markov decision process models
- G
- 1998
(Show Context)
Citation Context ...tes to actions provide good performance in robot navigation (e.g, the most-likely-state (MLS) heuristic assumes the agent is in the state corresponding to the "peak" of the belief state dist=-=ribution) [35, 63, 47]-=-. Such heuristics work much better in H-POMDPs because they can be applied at multiple levels, and belief states over abstract states usually have lower entropy (Figure 8). For a detailed study of the... |

95 |
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
- SCHWARTZ
- 1993
(Show Context)
Citation Context ...e simplest class of MDPs, and here we restrict attention to discounted problems. However, RL algorithms have also been developed for MDPs with other definitions of return, such as average reward MDPs =-=[39, 62]-=-. Playing important roles in many RL algortihms are action-value functions, which assign values to admissible state-action pairs. Given a policy #, the value of (s, a), a # A s , denoted Q # (s, a), i... |

89 |
Achieving Artificial Intelligence through Building Robots, AI-Laboratory
- Brooks
- 1986
(Show Context)
Citation Context ...f the state set. The partial policies must also have well-defined termination conditions. These partial policies are sometimes called temporally-extended actions, options [73], skills [80], behaviors =-=[9, 27]-=-, or the more control-theoretic modes [22]. When not discussing a specific formalism, we will use the term activity, as suggested by Harel [23]. For MDPs, this extension adds to the sets of admissible... |

82 | Discovering hierarchy in reinforcement learning with HEXQ
- Hengst
- 2002
(Show Context)
Citation Context ...ly discussed in Section 4.1 automated methods for identifying useful subtoals [15, 16, 43, 44] which address some aspects of this problem. Another approach called HEXQ was recently proposed by Hengst =-=[24]-=-. It exploits a factored state representation and sorts state variables into an ordered list, beginning with the variable that changes most rapidly. HEXQ builds a task hierarchy, consisting of one lev... |

80 | Elevator group control using multiple reinforcement learning agents
- Crites, Barto
- 1998
(Show Context)
Citation Context ...ing function approximation methods for accumulating value function information, RL algorithms have produced good results on problems that pose significant challenges for standard methods (e.g., refs. =-=[11, 75]-=-). However, current RL methods by no means completely circumvent the curse of dimensionality: the exponential growth of the number of parameters to be learned with the size of any compact encoding of ... |

80 |
Approximate Dynamic Programming for Real-Time Control and Neural Modeling
- Werbos
- 1992
(Show Context)
Citation Context ... DP literature existing at that time. Other early RL research was explicitly motivated by animal behavior and its neural basis [45, 33, 34, 71]. Much of the current interest is attributable to Werbos =-=[85, 86, 87]-=-, Watkins [82], and Tesauro's backgammonplaying system TD-Gammon [75, 76]. Additional information about RL can be found in several references (e.g., [2, 5, 32, 72]). Despite the utility of RL methods ... |

77 | Multi-time models for temporally abstract planning - Precup, Sutton - 1998 |

76 |
Singular Perturbation Methods in Control, Analysis and Design
- Kokotovic, Khali, et al.
- 1986
(Show Context)
Citation Context ...roblems. Further work is required for understanding how to build task hierarchies in such cases, and how to integrate this approach to related systems approaches such as singular perturbation methods =-=[36, 46]-=-. 6.3 Dynamic Abstraction Systems such as those outlined in this article naturally provide opportunities for using di#erent state representations depending on the activity that is currently executing.... |