## Hierarchical Multiagent Reinforcement Learning (2004)

### Cached

### Download Links

Citations: | 19 - 5 self |

### BibTeX

@MISC{Ghavamzadeh04hierarchicalmultiagent,

author = {Mohammad Ghavamzadeh and Sridhar Mahadevan},

title = { Hierarchical Multiagent Reinforcement Learning},

year = {2004}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multiagent tasks. We introduce a hierarchical multiagent reinforcement learning (RL) framework and propose a hierarchical multiagent RL algorithm called Cooperative HRL. In our approach, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform subtasks, which order to do them in, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. Since coordination at high levels allows for increased cooperation skills as agents do not get confused by low-level details, we usually define cooperative subtasks at the high levels of the hierarchy. This hierarchical approach allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn

### Citations

1329 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...rdless of the behavior of the other agents. On the other hand, best-response learners seek to learn the best response to the other agents. Although not an explicitly multi-agent algorithm, Q-learning =-=[42]-=- was one of the first algorithms applied to multi-agent problems [8, 40]. WoLF-PHC [6], joint-state/joint-action learners [5], and the gradient ascent learner in [35] are other examples of a best-resp... |

583 |
Markov Decision Processes
- Puterman
- 1990
(Show Context)
Citation Context ...ey can take variable stochastic amount of time. Thus, semi-Markov decision processes (SMDPs) have become the preferred language for modeling temporally extended actions. SemiMarkov decision processes =-=[14, 30]-=- extend the MDP model in several aspects. Decisions are only made at discrete points in time. The state of the system may change continually between decisions, unlike MDPs where state changes are only... |

575 |
Multiagent Systems: a modern approach to distributed arti intelligence
- Weiss
- 2001
(Show Context)
Citation Context ...avamzadeh, Sridhar Mahadevan, and Rajbala Makar 1. Introduction A multi-agent system is a system in which several interacting, intelligent agents pursue some set of goals or perform some set of tasks =-=[43]-=-. In these systems, decisions of an agent usually depend on the behavior of the other agents, which is often not predictable. It makes learning and adaptation a necessary component of an agent. Multi-... |

499 | Markov Games as a Framework for Multi-agent Reinforcement Learning
- Littman
- 1994
(Show Context)
Citation Context ...to multiple agents. These algorithms can be summarized by broadly grouping them into two categories: equilibria learners and best-response learners. Equilibria learners such as Nash-Q [15], Minimax-Q =-=[21]-=-, and Friendor-Foe-Q [22] seek to learn an equilibrium of the game by iteratively computing intermediate equilibria. Under certain conditions, they guarantee convergence to their part of an equilibriu... |

436 | Behavior-Based Formation Control for Multirobot Teams
- Balch, Arkin
- 1998
(Show Context)
Citation Context ...ty has also been addressed in multi-agent robotics. Multi-robot learning methods usually reduce the complexity of the problem by not modeling joint states or actions explicitly, such as work by Balch =-=[2]-=- and Mataric [25], among others. In such systems, each robot maintains its position in a formation depending on the locations of the other robots, so there is some implicit communication or sensing of... |

428 | Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
- Sutton, Precup, et al.
- 1999
(Show Context)
Citation Context ...ework for scaling RL to large domains by using the task structure to restrict the space of policies [3]. Several alternative frameworks for hierarchical RL (HRL) have been proposed, including options =-=[38]-=-, HAMs [28], and MAXQ [9]. The key idea underlying our approach is that coordination skills are learned much more efficiently if the agents have a hierarchical representation of the task structure. Al... |

372 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
(Show Context)
Citation Context ...rge domains by using the task structure to restrict the space of policies [3]. Several alternative frameworks for hierarchical RL (HRL) have been proposed, including options [38], HAMs [28], and MAXQ =-=[9]-=-. The key idea underlying our approach is that coordination skills are learned much more efficiently if the agents have a hierarchical representation of the task structure. Algorithms for learning tas... |

289 | The complexity of decentralized control of markov decision processes
- Bernstein, Zilberstein, et al.
(Show Context)
Citation Context ...buted agents to base their decisions on their local observations. This model is called decentralized POMDP (DEC-POMDP) and it has been shown that the decision problem for a DEC-POMDP is NEXP-complete =-=[4]-=-. One way to address partial observability in distributed multi-agent domains is to use communication to exchange required information. However, since communication can be costly, in addition to its n... |

284 | Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm
- Hu, Wellman
- 1998
(Show Context)
Citation Context ...rocesses (MDPs) to multiple agents. These algorithms can be summarized by broadly grouping them into two categories: equilibria learners and best-response learners. Equilibria learners such as Nash-Q =-=[15]-=-, Minimax-Q [21], and Friendor-Foe-Q [22] seek to learn an equilibrium of the game by iteratively computing intermediate equilibria. Under certain conditions, they guarantee convergence to their part ... |

253 |
Game Theory
- Owen
- 1995
(Show Context)
Citation Context ...hey have started to attract interest in AI, where their integration with existing methods constitutes a promising area of research. The game theoretic concepts of stochastic games and Nash equilibria =-=[10, 27]-=- are the foundation for much of the recent research in multi-agent learning. Learning algorithms use stochastic games as a natural extension of Markov decision processes (MDPs) to multiple agents. The... |

250 | Multi-agent reinforcement learning: Independent vs. cooperative agents
- Tan
- 1993
(Show Context)
Citation Context ...esponse learners seek to learn the best response to the other agents. Although not an explicitly multi-agent algorithm, Q-learning [42] was one of the first algorithms applied to multi-agent problems =-=[8, 40]-=-. WoLF-PHC [6], joint-state/joint-action learners [5], and the gradient ascent learner in [35] are other examples of a best-response learner. If an algorithm in which best-response learners playing wi... |

227 |
Competitive Markov decision processes
- Filar, Vrieze
- 1996
(Show Context)
Citation Context ...hey have started to attract interest in AI, where their integration with existing methods constitutes a promising area of research. The game theoretic concepts of stochastic games and Nash equilibria =-=[10, 27]-=- are the foundation for much of the recent research in multi-agent learning. Learning algorithms use stochastic games as a natural extension of Markov decision processes (MDPs) to multiple agents. The... |

224 | Graphical models for game theory
- Kearns, Littman, et al.
- 2001
(Show Context)
Citation Context ...ce in multi-agent systems and game theory [18, 19]. The previous work established algorithms for computing Nash equilibria in one-stage games, including efficient algorithms for computing approximate =-=[16]-=- and exact [23] Nash equilibria in tree-structured games, and convergent heuristics for computing Nash equilibria in general graphs [26, 41]. The curse of dimensionality has also been addressed in mul... |

155 | Multi-agent influence diagrams for representing and solving games
- Koller, Milch
- 2003
(Show Context)
Citation Context ...onality in multi-agent systems. The goal is to transfer the representational and computational benefits that graphical models provide to probabilistic inference in multi-agent systems and game theory =-=[18, 19]-=-. The previous work established algorithms for computing Nash equilibria in one-stage games, including efficient algorithms for computing approximate [16] and exact [23] Nash equilibria in tree-struct... |

144 | Sequential optimality and coordination in multiagent systems
- Boutilier
- 1999
(Show Context)
Citation Context ...ther agents. Although not an explicitly multi-agent algorithm, Q-learning [42] was one of the first algorithms applied to multi-agent problems [8, 40]. WoLF-PHC [6], joint-state/joint-action learners =-=[5]-=-, and the gradient ascent learner in [35] are other examples of a best-response learner. If an algorithm in which best-response learners playing with each other converges, it must be to a Nash equilib... |

136 | Reinforcement learning in the multi-robot domain
- Matarić
- 1997
(Show Context)
Citation Context ...icated problem, because the behavior of the other agents can be changing as they also adapt to achieve their own goals. It usually makes the environment non-stationary and often non-Markovian as well =-=[25]-=-. Robosoccer, disaster rescue, and ecommerce are examples of challenging multi-agent domains that need robust learning algorithms for coordination among multiple agents or for effectively responding t... |

121 | Policy recognition in the abstract hidden markov model
- Bui, Venkatesh, et al.
- 2002
(Show Context)
Citation Context ...n. Saria and Mahadevan presented a theoretical framework for online probabilistic plan recognition in cooperative multi-agent systems [33]. Their model extends the abstract hidden Markov model (AHMM) =-=[7]-=- to cooperative multi-agent domains. We believe that the model presented by Saria and Mahadevan can be combined with the learning algorithms proposed in this paper to reduce communication by learning ... |

119 |
Friend-or-foe q-learning in generalsum games
- Littman
(Show Context)
Citation Context ... algorithms can be summarized by broadly grouping them into two categories: equilibria learners and best-response learners. Equilibria learners such as Nash-Q [15], Minimax-Q [21], and Friendor-Foe-Q =-=[22]-=- seek to learn an equilibrium of the game by iteratively computing intermediate equilibria. Under certain conditions, they guarantee convergence to their part of an equilibrium solution regardless of ... |

109 | Hierarchical Control and Learning for Markov Decision Processes”, PhD. thesis
- Parr
- 1998
(Show Context)
Citation Context ...caling RL to large domains by using the task structure to restrict the space of policies [3]. Several alternative frameworks for hierarchical RL (HRL) have been proposed, including options [38], HAMs =-=[28]-=-, and MAXQ [9]. The key idea underlying our approach is that coordination skills are learned much more efficiently if the agents have a hierarchical representation of the task structure. Algorithms fo... |

91 | Nash convergence of gradient dynamics in general-sum games
- Singh, Kearns, et al.
- 2000
(Show Context)
Citation Context ...multi-agent algorithm, Q-learning [42] was one of the first algorithms applied to multi-agent problems [8, 40]. WoLF-PHC [6], joint-state/joint-action learners [5], and the gradient ascent learner in =-=[35]-=- are other examples of a best-response learner. If an algorithm in which best-response learners playing with each other converges, it must be to a Nash equilibrium [6]. The RL framework has been well-... |

85 | Coordinated Reinforcement Learning
- Guestrin, Lagoudakis, et al.
- 2002
(Show Context)
Citation Context ...ssed using value function based RL [34] as well as policy gradient based RL [29]. Another approach is to exploit the structure in a multi-agent problem using factored value functions. Guestrin et al. =-=[13]-=- integrate these ideas in collaborative multi-agent domains. They use value function approximation and approximate the joint value function as a linear combination of local value functions, each of wh... |

80 |
Modeling and Analysis of Manufacturing Systems
- Askin, Standridge
- 1993
(Show Context)
Citation Context ... as in an SMDP, the set of states of the system being controlled, the reward function mapping S → IR, and the state and action dependent multi-step transition probability function P : S × IN × S × A →=-=[0, 1]-=- (where IN is the set of natural numbers). The term P (s ′ , N|s,�a) denotes the probability that the joint action �a will cause the system to transition from state s to state s ′ in N time steps. Sin... |

80 | Elevator group control using multiple reinforcement learning agents
- Crites, Barto
- 1998
(Show Context)
Citation Context ...esponse learners seek to learn the best response to the other agents. Although not an explicitly multi-agent algorithm, Q-learning [42] was one of the first algorithms applied to multi-agent problems =-=[8, 40]-=-. WoLF-PHC [6], joint-state/joint-action learners [5], and the gradient ascent learner in [35] are other examples of a best-response learner. If an algorithm in which best-response learners playing wi... |

64 | Communication decisions in multi-agent cooperation: Model and experiments
- Xuan, Lesser, et al.
- 2001
(Show Context)
Citation Context ... to use communication to exchange required information. However, since communication can be costly, in addition to its normal actions, each agent needs to decide about communication with other agents =-=[44, 45]-=-. Pynadath and Tambe [31] extended DEC-POMDP by including communication decisions in the aamas.tex; 18/02/2006; 13:54; p.5s6 Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar model, and propo... |

55 | Team-partitioned, opaquetransition reinforcement learning
- Stone, Veloso
- 1999
(Show Context)
Citation Context ...g of states and actions of the other agents. There has also been work on reducing the parameters needed for Q-learning in multi-agent domains, by learning action values over a set of derived features =-=[36]-=-. These derived features are domain specific, and have to be encoded by hand, or constructed by a supervised learning algorithm. In a cooperative multi-agent setting, it is usually necessary for each ... |

53 | Distributed Value Functions
- Schneider, Wong, et al.
- 1999
(Show Context)
Citation Context ...of information that is available to each agent and hope to maximize the global payoff by solving local optimization problems for each agent. This idea has been addressed using value function based RL =-=[34]-=- as well as policy gradient based RL [29]. Another approach is to exploit the structure in a multi-agent problem using factored value functions. Guestrin et al. [13] integrate these ideas in collabora... |

41 | Multi-agent algorithms for solving graphical games
- Vickrey, Koller
- 2002
(Show Context)
Citation Context ...games, including efficient algorithms for computing approximate [16] and exact [23] Nash equilibria in tree-structured games, and convergent heuristics for computing Nash equilibria in general graphs =-=[26, 41]-=-. The curse of dimensionality has also been addressed in multi-agent robotics. Multi-robot learning methods usually reduce the complexity of the problem by not modeling joint states or actions explici... |

39 | Learning to improve coordinated actions in cooperative distributed problem-solving environments - Sugawara, Lesser - 1998 |

32 | Game networks
- Mura, P
- 2000
(Show Context)
Citation Context ...onality in multi-agent systems. The goal is to transfer the representational and computational benefits that graphical models provide to probabilistic inference in multi-agent systems and game theory =-=[18, 19]-=-. The previous work established algorithms for computing Nash equilibria in one-stage games, including efficient algorithms for computing approximate [16] and exact [23] Nash equilibria in tree-struct... |

32 | Hierarchical multi-agent reinforcement learning
- Makar, Mahadevan, et al.
- 2001
(Show Context)
Citation Context ...t systems. Our approach differs from the previous work in one key respect, namely the use of task hierarchies to scale multi-agent reinforcement learning (RL). We originally proposed this approach in =-=[24]-=-, and subsequently extended it in [12]. Hierarchical methods constitute a general framework for scaling RL to large domains by using the task structure to restrict the space of policies [3]. Several a... |

32 |
Multi-agent policies: from centralized ones to decentralized ones
- Xuan, Lesser
- 2002
(Show Context)
Citation Context ... to use communication to exchange required information. However, since communication can be costly, in addition to its normal actions, each agent needs to decide about communication with other agents =-=[44, 45]-=-. Pynadath and Tambe [31] extended DEC-POMDP by including communication decisions in the aamas.tex; 18/02/2006; 13:54; p.5s6 Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar model, and propo... |

25 | Scaling up average reward reinforcement learning by approximating the domain models and the value function
- Tadepalli, Ok
- 1996
(Show Context)
Citation Context ... and their combinations are generally used to schedule AGVs [17, 20]. However, the heuristics perform poorly when the constraints on the movement of the AGVs are reduced. Previously, Tadepalli and Ok =-=[39]-=- studied a single-agent AGV scheduling task using flat average-reward RL. However, the multi-agent AGV task we study is more complex. Figure 5 shows the layout of the AGV scheduling domain used in the... |

23 | Nash propagation for loopy graphical games
- Ortiz, Kearns
- 2002
(Show Context)
Citation Context ...games, including efficient algorithms for computing approximate [16] and exact [23] Nash equilibria in tree-structured games, and convergent heuristics for computing Nash equilibria in general graphs =-=[26, 41]-=-. The curse of dimensionality has also been addressed in multi-agent robotics. Multi-robot learning methods usually reduce the complexity of the problem by not modeling joint states or actions explici... |

20 | An efficient exact algorithm for singly connected graphical games
- Littman, Kearns, et al.
- 2001
(Show Context)
Citation Context ...nt systems and game theory [18, 19]. The previous work established algorithms for computing Nash equilibria in one-stage games, including efficient algorithms for computing approximate [16] and exact =-=[23]-=- Nash equilibria in tree-structured games, and convergent heuristics for computing Nash equilibria in general graphs [26, 41]. The curse of dimensionality has also been addressed in multi-agent roboti... |

18 |
Dynamic Probabilistic Systems: Semi-Markov and Decision Processes
- Howard
- 1971
(Show Context)
Citation Context ...ey can take variable stochastic amount of time. Thus, semi-Markov decision processes (SMDPs) have become the preferred language for modeling temporally extended actions. SemiMarkov decision processes =-=[14, 30]-=- extend the MDP model in several aspects. Decisions are only made at discrete points in time. The state of the system may change continually between decisions, unlike MDPs where state changes are only... |

16 | Probabilistic plan recognition in multiagent systems
- Saria, Mahadevan
(Show Context)
Citation Context ...teammates using recent observations instead of direct communication. Saria and Mahadevan presented a theoretical framework for online probabilistic plan recognition in cooperative multi-agent systems =-=[33]-=-. Their model extends the abstract hidden Markov model (AHMM) [7] to cooperative multi-agent domains. We believe that the model presented by Saria and Mahadevan can be combined with the learning algor... |

14 | Learning to take concurrent actions
- Rohanimanesh, Mahadevan
- 2002
(Show Context)
Citation Context ...e decision epochs and as a result, depends on the termination scheme T . � Three termination strategies τany, τall, and τcontinue for temporally extended joint actions were introduced and analyzed in =-=[32]-=-. In τany termination scheme, the next decision epoch is when the first action within the joint action currently being executed terminates, where the rest of the actions that did not terminate are int... |

9 | Hierarchical policy gradient algorithms
- Ghavamzadeh, Mahadevan
- 2003
(Show Context)
Citation Context ...inuous state and/or action spaces, using a mixture of policy gradient-based RL and value functionaamas.tex; 18/02/2006; 13:54; p.40sHierarchical Multi-Agent Reinforcement Learning 41 based RL methods =-=[11]-=-. We believe that the algorithms proposed in this paper can be combined with the algorithms presented in [11] to be used in multi-agent domains with continuous state and/or action. The success of the ... |

8 | Learning to communicate and act using hierarchical reinforcement learning
- Ghavamzadeh, Mahadevan
(Show Context)
Citation Context ...he previous work in one key respect, namely the use of task hierarchies to scale multi-agent reinforcement learning (RL). We originally proposed this approach in [24], and subsequently extended it in =-=[12]-=-. Hierarchical methods constitute a general framework for scaling RL to large domains by using the task structure to restrict the space of policies [3]. Several alternative frameworks for hierarchical... |

7 |
Composite Dispatching Rules for Multiple-Vehicle
- Lee
- 1996
(Show Context)
Citation Context ...he warehouse or some other locations. The pick-up point is the machine or workstation’s output buffer. Any FMS using AGVs faces the problem of optimally scheduling the paths of the AGVs in the system =-=[20]-=-. For example, a move request occurs when a part finishes at a workstation. If more than one vehicle is empty, the vehicle which would service this request needs to be selected. Also, when a vehicle b... |

5 |
M (2002) Multiagent learning using a variable learning rate
- Bowling, Veloso
(Show Context)
Citation Context ...eek to learn the best response to the other agents. Although not an explicitly multi-agent algorithm, Q-learning [42] was one of the first algorithms applied to multi-agent problems [8, 40]. WoLF-PHC =-=[6]-=-, joint-state/joint-action learners [5], and the gradient ascent learner in [35] are other examples of a best-response learner. If an algorithm in which best-response learners playing with each other ... |

4 |
2000, `Learning to Cooperate via Policy Search
- Peshkin, Kim, et al.
(Show Context)
Citation Context ...agent and hope to maximize the global payoff by solving local optimization problems for each agent. This idea has been addressed using value function based RL [34] as well as policy gradient based RL =-=[29]-=-. Another approach is to exploit the structure in a multi-agent problem using factored value functions. Guestrin et al. [13] integrate these ideas in collaborative multi-agent domains. They use value ... |

4 |
2002, ‘The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models
- Pynadath, Tambe
(Show Context)
Citation Context ...ange required information. However, since communication can be costly, in addition to its normal actions, each agent needs to decide about communication with other agents [44, 45]. Pynadath and Tambe =-=[31]-=- extended DEC-POMDP by including communication decisions in the aamas.tex; 18/02/2006; 13:54; p.5s6 Mohammad Ghavamzadeh, Sridhar Mahadevan, and Rajbala Makar model, and proposed a framework called co... |

4 |
Learning to Improve Coordinated Actions
- Sugawara, Lesser
- 1998
(Show Context)
Citation Context ... learned much more efficiently if the agents have a hierarchical representation of the task structure. Algorithms for learning task-level coordination have already been developed in nonMDP approaches =-=[37]-=-, however to the best of our knowledge, our work has been the first attempt to use task-level coordination in an MDP setting. The use of hierarchy speeds up learning in multi-agent domains by making i... |