## Planning, learning and coordination in multiagent decision processes (1996)

### Cached

### Download Links

- [www.cs.toronto.edu]
- [www.cs.toronto.edu]
- [www.cs.toronto.edu]
- [www.cs.utoronto.ca]
- [www.cs.utoronto.ca]
- [www.cs.toronto.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (TARK96 |

Citations: | 96 - 1 self |

### BibTeX

@INPROCEEDINGS{Boutilier96planning,learning,

author = {Craig Boutilier},

title = {Planning, learning and coordination in multiagent decision processes},

booktitle = {In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (TARK96},

year = {1996},

pages = {195--210}

}

### Years of Citing Articles

### OpenURL

### Abstract

There has been a growing interest in AI in the design of multiagent systems, especially in multiagent cooperative planning. In this paper, we investigate the extent to which methods from single-agent planning and learning can be applied in multiagent settings. We survey a number of different techniques from decision-theoretic planning and reinforcement learning and describe a number of interesting issues that arise with regard to coordinating the policies of individual agents. To this end, we describe multiagent Markov decision processes as a general model in which to frame this discussion. These are special n-person cooperative games in which agents share the same utility function. We discuss coordination mechanisms based on imposed conventions (or social laws) as well as learning methods for coordination. Our focus is on the decomposition of sequential decision processes so that coordination can be learned (or imposed) locally, at the level of individual states. We also discuss the use of structured problem representations and their role in the generalization of learned conventions and in approximation. 1

### Citations

2611 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ... discounted problems have been well-studied. While algorithms such as modified policy iteration [43] are often used in practice, an especially simple algorithm is value iteration, based on Bellman 's =-=[4] "pri-=-nciple of optimality." We discuss value iteration because of its simplicity, as well as the close relationship it bears to the RL techniques we describe below. We start with a random value functi... |

1227 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...ts of the MDP that are unknown. Instead the agent learns the optimal value function and optimal policy (more or less) directly. Two popular (and related) methods are temporal difference (TD) learning =-=[53]-=- and Q-learning [56]. It is best to think of TD-methods as learning the value function for a fixed policy; thus it must be combined with another RL method that can use the value function to do policy ... |

1198 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...ecision making might take. Since we are interested in planning under uncertainty, with competing objectives and (potentially) indefinite or infinite horizon, we adopt Markov decision processes (MDPs) =-=[26, 42]-=- as our underlying (single agent) decision model. MDPs have been used as the basis for much work in decision-theoretic planning (DTP) [20, 17, 7, 55, 9], and techniques for computing optimal policies ... |

727 |
The Evolution of Cooperation. Basic
- Axelrod
- 1984
(Show Context)
Citation Context ...ponse to the population at large, as well limited memory models. Such models have been studied quite extensively in game theory as well, including experimental work and formal analysis of convergence =-=[2, 35, 59, 24]-=-. The notion of fictitious play [41] offers a very simple learning model in which agents keep track of the frequency with which opponents use particular strategies, and at any point in time adopt a be... |

602 |
Game Theory: Analysis of Conflict
- Myerson
- 1991
(Show Context)
Citation Context ...ames is which the payoff function is the same for all agents. MMDPs are a form of stochastic game [41]; but it is most closely related to the general framework for repeated games discussed by Myerson =-=[39]-=- (which themselves are generalizations of partially observable MDPs [52, 1]). We will have occasion to exploit both perspectives: MMDPs as a generalization of (single-agent) MDPs; and MMDPs as a speci... |

526 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...dimensionality. Much emphasis in DTP research has been placed on the issue of speeding up computation, and several solutions proposed, including restricting search to local regions of the state space =-=[17, 21, 3, 55]-=- or reducing the state space via abstraction or clustering of states [7]. Both approaches reduce the state space in a way that allows MDP solution techniques to be used, and generate approximately opt... |

516 |
Dynamic Programming and and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...ecision making might take. Since we are interested in planning under uncertainty, with competing objectives and (potentially) indefinite or infinite horizon, we adopt Markov decision processes (MDPs) =-=[26, 42]-=- as our underlying (single agent) decision model. MDPs have been used as the basis for much work in decision-theoretic planning (DTP) [20, 17, 7, 55, 9], and techniques for computing optimal policies ... |

498 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...tions (or social laws) might be imposed by the system designed so that optimal joint action is assured [31, 48]; or a coordinated policy (or conventions) might be learned through repeated interaction =-=[47, 46, 32]-=-. We focus here primarily on imposed conventions and learned coordination of behavior, especially in sequential decision processes. Ultimately, we are interested in the extent to which models, represe... |

473 | Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...ar that this approach will generally be wildly infeasible, for it requires recomputation of an optimal policy after each sample (nor does it address the issue of exploration). Sutton's Dyna technique =-=[54]-=- adopts a less extreme approach. After a sampleshs; a; t; ri is obtained, the estimated model b Pr; b R is statistically updated and some number of Q-values are revised using the new model. In particu... |

457 |
A model for reasoning about persistence and causation
- T, Kanazawa
- 1989
(Show Context)
Citation Context ...on for MDPs. We assume that a set of atomic propositions P describes our system, inducing a state space of size 2 jPj , and use two-stage temporal or dynamic Bayesian networks to describe our actions =-=[18, 9]-=-. 25 For each action, we have a Bayes net with one set of nodes representing the system state prior to the action (one node for each variable), another set representing the world after the action has ... |

417 |
Convention: A Philosophical Study
- Lewis
- 1969
(Show Context)
Citation Context ...ing. For example, agents might communicate in order to determine task allocation [37, 57]; conventions (or social laws) might be imposed by the system designed so that optimal joint action is assured =-=[31, 48]-=-; or a coordinated policy (or conventions) might be learned through repeated interaction [47, 46, 32]. We focus here primarily on imposed conventions and learned coordination of behavior, especially i... |

402 |
A General Theory of Equilibrium Selection in Games
- J, Selten
- 1988
(Show Context)
Citation Context ...e. 9 Thus, individual agents must determine their own policies. Treating the MMDP as an n-person game, it is easy to see that determining an optimal joint policy is a problem of equilibrium selection =-=[39, 25]-=-. In particular, each optimal joint policy is a Nash equilibrium of the stochastic game: once the agents adopt the individual components of an optimal policy, there is no incentive to deviate from thi... |

336 |
The Optimal Control of Partially Observable Markov Processes
- Sondik
- 1971
(Show Context)
Citation Context ...a form of stochastic game [41]; but it is most closely related to the general framework for repeated games discussed by Myerson [39] (which themselves are generalizations of partially observable MDPs =-=[52, 1]-=-). We will have occasion to exploit both perspectives: MMDPs as a generalization of (single-agent) MDPs; and MMDPs as a specialization of n-person stochastic games. Definition A multiagent Markov deci... |

305 |
Planning and control
- Dean, Wellman
- 1991
(Show Context)
Citation Context ...finite horizon, we adopt Markov decision processes (MDPs) [26, 42] as our underlying (single agent) decision model. MDPs have been used as the basis for much work in decision-theoretic planning (DTP) =-=[20, 17, 7, 55, 9]-=-, and techniques for computing optimal policies have been adapted to AI planning tasks. Furthermore, MDPs form the foundation of most work in reinforcement learning (RL), in which agents learn optimal... |

295 |
The optimal control of partially observable Markov processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ... adopt an optimal policy that maximizes the expected rewards accumulated as it 1 Partially observable processes are much more realistic in many cases [13], but are much less tractable computationally =-=[51]-=-. We do not consider these here (but see the concluding section). 2 Thus we restrict attention to stationary policies. For the problems we consider, optimal stationary policies always exist. performs ... |

294 | The Evolution of Conventions
- Young
- 1993
(Show Context)
Citation Context ...ponse to the population at large, as well limited memory models. Such models have been studied quite extensively in game theory as well, including experimental work and formal analysis of convergence =-=[2, 35, 59, 24]-=-. The notion of fictitious play [41] offers a very simple learning model in which agents keep track of the frequency with which opponents use particular strategies, and at any point in time adopt a be... |

275 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...t is in state s. 2 Given an MDP, an agent ought to adopt an optimal policy that maximizes the expected rewards accumulated as it 1 Partially observable processes are much more realistic in many cases =-=[13]-=-, but are much less tractable computationally [51]. We do not consider these here (but see the concluding section). 2 Thus we restrict attention to stationary policies. For the problems we consider, o... |

254 | Reward, motivation, and reinforcement learning
- Dayan, Balleine
- 2002
(Show Context)
Citation Context ...ults. In a similar vein, Yanco and Stein [58] also reported experimental results with hierarchical Q-learning in order to learn cooperative communication between a leader and follower agent (see also =-=[16]). In both-=- works, convergence is achieved. A slightly different "semi-cooperative" application of Q-learning is reported in [45], where a set of agents used straightforward Qlearning in the presense o... |

253 | Probabilistic robot navigation in partially observable environments
- Simmons, Koenig
- 1995
(Show Context)
Citation Context ...planning by addressing these issues [20]. In particular, the theory of Markov decision processes (MDPs) has found considerable popularity recently both as a conceptual and computational model for DTP =-=[17, 7, 55, 9, 49]-=-. In addition, reinforcement learning [28] can be viewed as a means of learning to act optimally, or incrementally constructing an optimal plan through repeated interaction with the environment. Again... |

251 | Generalization in reinforcement learning: Safely approximating the value function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...research has been carried out in the RL community on generalization of learned values and action choices, as well as value function approximation (especially in continuous domains); see, for example, =-=[22, 14, 38, 12, 36, 50]-=-. The survey [28] provides a nice discussion of these issues. 6 Concluding Remarks We have surveyed a number of issues that arise in the application of single-agent planning and learning techniques to... |

250 |
Game Theory
- Owen
- 1982
(Show Context)
Citation Context ... existence of a joint utility function. But, in fact, they are nothing more than n-person stochastic games is which the payoff function is the same for all agents. MMDPs are a form of stochastic game =-=[41]-=-; but it is most closely related to the general framework for repeated games discussed by Myerson [39] (which themselves are generalizations of partially observable MDPs [52, 1]). We will have occasio... |

224 | Exploiting structure in policy construction
- Dearden, R, et al.
- 1995
(Show Context)
Citation Context ...finite horizon, we adopt Markov decision processes (MDPs) [26, 42] as our underlying (single agent) decision model. MDPs have been used as the basis for much work in decision-theoretic planning (DTP) =-=[20, 17, 7, 55, 9]-=-, and techniques for computing optimal policies have been adapted to AI planning tasks. Furthermore, MDPs form the foundation of most work in reinforcement learning (RL), in which agents learn optimal... |

224 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...e they have similar or identical value and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solutionof MDPs or the related methods used in RL =-=[44, 5, 36, 7, 9, 19, 22, 14, 38]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric, but generally assume that the states so grouped have the same ... |

215 | Rational Learning leads to Nash Equilibrium
- Kalain, Lehrer
- 1993
(Show Context)
Citation Context ...gies, and adopt appropriate best responses. He shows conditions under which such a method will converge to a pure strategy equilibrium (which include coordination games). The work of Kalai and Lehrer =-=[29] is also o-=-f considerable importance here. They model a "repeated game" as a true stochastic game so that performance during learning can be accounted for when determining a best response. They assume ... |

172 |
On the synthesis of useful social laws for artificial agent societies
- Shoham, Tennenholtz
- 1992
(Show Context)
Citation Context ...ing. For example, agents might communicate in order to determine task allocation [37, 57]; conventions (or social laws) might be imposed by the system designed so that optimal joint action is assured =-=[31, 48]-=-; or a coordinated policy (or conventions) might be learned through repeated interaction [47, 46, 32]. We focus here primarily on imposed conventions and learned coordination of behavior, especially i... |

169 | Reward Functions for Accelerated Learning
- Mataric
- 1994
(Show Context)
Citation Context ...r agents. In other words, other agents are simply treated as part of the environment. Recent work in applying Q-learning to multiagent systems seems to adopt just this approach. For instance, Mataric =-=[34]-=- describes experiments with mobile robots in which Q-learning is applied to a cooperative task with good results. In a similar vein, Yanco and Stein [58] also reported experimental results with hierar... |

148 | Learning to coordinate without sharing information
- Sen, Sekaran, et al.
- 1994
(Show Context)
Citation Context ...tions (or social laws) might be imposed by the system designed so that optimal joint action is assured [31, 48]; or a coordinated policy (or conventions) might be learned through repeated interaction =-=[47, 46, 32]-=-. We focus here primarily on imposed conventions and learned coordination of behavior, especially in sequential decision processes. Ultimately, we are interested in the extent to which models, represe... |

137 | Planning with Deadlines in Stochastic domains
- Dean, Kaelbling, et al.
- 1993
(Show Context)
Citation Context ...finite horizon, we adopt Markov decision processes (MDPs) [26, 42] as our underlying (single agent) decision model. MDPs have been used as the basis for much work in decision-theoretic planning (DTP) =-=[20, 17, 7, 55, 9]-=-, and techniques for computing optimal policies have been adapted to AI planning tasks. Furthermore, MDPs form the foundation of most work in reinforcement learning (RL), in which agents learn optimal... |

134 | Input generalization in delayed reinforcement learning: an algorithm and performance comparisons
- Chapman, Kaelbling
- 1991
(Show Context)
Citation Context ...e they have similar or identical value and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solutionof MDPs or the related methods used in RL =-=[44, 5, 36, 7, 9, 19, 22, 14, 38]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric, but generally assume that the states so grouped have the same ... |

122 |
Optimal Control of Markov Decision Processes with Incomplete State Estimation
- Astrom
- 1965
(Show Context)
Citation Context ...a form of stochastic game [41]; but it is most closely related to the general framework for repeated games discussed by Myerson [39] (which themselves are generalizations of partially observable MDPs =-=[52, 1]-=-). We will have occasion to exploit both perspectives: MMDPs as a generalization of (single-agent) MDPs; and MMDPs as a specialization of n-person stochastic games. Definition A multiagent Markov deci... |

113 | Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ...coffee shop across the street, can get wet if it is raining unless it has an umbrella, and is rewarded if it brings coffee when the user requests it, and penalized (to a lesser extent) if it gets wet =-=[9, 10]-=-. This network describes the action of fetching coffee. 28 See [22] for a similar approach to RL for goal-based, deterministic problems. W U R W U R HC 0.9 1.0 W HC WC 1 2 W W -3-2 -1 0 Tree Represent... |

111 | Reinforcement learning with soft state aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...research has been carried out in the RL community on generalization of learned values and action choices, as well as value function approximation (especially in continuous domains); see, for example, =-=[22, 14, 38, 12, 36, 50]-=-. The survey [28] provides a nice discussion of these issues. 6 Concluding Remarks We have surveyed a number of issues that arise in the application of single-agent planning and learning techniques to... |

110 | Decomposition Techniques for Planning in Stochastic Domains
- Dean, Lin
- 1995
(Show Context)
Citation Context ...e they have similar or identical value and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solutionof MDPs or the related methods used in RL =-=[44, 5, 36, 7, 9, 19, 22, 14, 38]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric, but generally assume that the states so grouped have the same ... |

89 | Planning under uncertainty: structural assumptions and computational leverage. New directions
- Boutilier, Dean, et al.
- 1996
(Show Context)
Citation Context ...a; a; \Delta) : a 2 Ag. Typical classical planning problems can be viewed as MDPs in which actions are deterministic and there are no competing objectives only a single goal (e.g., the reward is 0-1) =-=[6]-=-. A plan or policy is a mappings: S ! A, where (s) denotes the action an agent will perform whenever it is in state s. 2 Given an MDP, an agent ought to adopt an optimal policy that maximizes the expe... |

81 | Sequential equilibria
- Kreps, Wilson
- 1982
(Show Context)
Citation Context ...e this is often the most natural way to specify a problem. We do not address this issue here and assume the joint effects are given (but see Section 5). 7 Policies correspond to behavioral strategies =-=[30]-=- that are restricted to be stationary. 8 Not all randomized policies for the joint MDP can be modeled as a collection of randomized individual policies, since correlated action choice between two agen... |

79 | An adaptive communication protocol for cooperating mobile robots
- Yanco, Stein
- 1993
(Show Context)
Citation Context ...dopt just this approach. For instance, Mataric [34] describes experiments with mobile robots in which Q-learning is applied to a cooperative task with good results. In a similar vein, Yanco and Stein =-=[58]-=- also reported experimental results with hierarchical Q-learning in order to learn cooperative communication between a leader and follower agent (see also [16]). In both works, convergence is achieved... |

78 |
Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces
- Moore
- 1991
(Show Context)
Citation Context |

69 | Using abstractions for decision-theoretic planning with time constraints
- Boutilier, Dearden
- 1994
(Show Context)
Citation Context |

61 |
The convergence of td() for general
- Dayan
- 1992
(Show Context)
Citation Context ..., if an eligibility trace (essentially, an encoding of history) is recorded. This is the basis of TD(), where a parameterscaptures the degree to which past states are influenced by the current sample =-=[15]-=-. In addition, there are variants in which truncated eligibility traces are used. Q-learning [56] is a straightforward and elegant method for combining value function learning (as in TD-methods) with ... |

54 | Divide and conquer in multi-agent planning, in
- Ephrati, Rosenschein
- 1994
(Show Context)
Citation Context ...roblems is that of multiagent planning (or multiagent sequential decision making), that is, the problem of devising effectve action policies or strategies for a set of n agents whom share common ends =-=[23]-=-. The key aspect of this problem is coordinating the actions of the individual agents so that the shared goals are achieved efficiently. Of course, the problem of multiagent planning falls squarely wi... |

53 |
Emergent conventions in multi-agent systems: initial experimental results and observations (preliminary report
- Shoham, Tennenholtz
- 1992
(Show Context)
Citation Context ...tions (or social laws) might be imposed by the system designed so that optimal joint action is assured [31, 48]; or a coordinated policy (or conventions) might be learned through repeated interaction =-=[47, 46, 32]-=-. We focus here primarily on imposed conventions and learned coordination of behavior, especially in sequential decision processes. Ultimately, we are interested in the extent to which models, represe... |

52 |
Modified Policy Iteration Algorithms for Discounted Markov Decision ProblemsModified Policy Iteration Algorithms for Discounted Markov Decision Problems
- Puterman, Shin
- 1978
(Show Context)
Citation Context ...the MDP is the value function for any optimal policy. 3 Techniques for constructing optimal policies for discounted problems have been well-studied. While algorithms such as modified policy iteration =-=[43] are often-=- used in practice, an especially simple algorithm is value iteration, based on Bellman 's [4] "principle of optimality." We discuss value iteration because of its simplicity, as well as the ... |

50 |
Kaelbling. Learning in Embedded Systems
- Pack
- 1993
(Show Context)
Citation Context ...ve been adapted to AI planning tasks. Furthermore, MDPs form the foundation of most work in reinforcement learning (RL), in which agents learn optimal policies through experience with the environment =-=[27, 28]-=-. The extension of MDPs to the cooperative multiagent case is straightforward. Indeed, treating the collection of agents as a single agent with joint actions at its disposal allows one to compute (or ... |

49 |
Steady state learning and Nash equilibrium
- Fudenberg, Levine
- 1993
(Show Context)
Citation Context ...ponse to the population at large, as well limited memory models. Such models have been studied quite extensively in game theory as well, including experimental work and formal analysis of convergence =-=[2, 35, 59, 24]-=-. The notion of fictitious play [41] offers a very simple learning model in which agents keep track of the frequency with which opponents use particular strategies, and at any point in time adopt a be... |

36 | Approximating value trees in structured dynamic programming
- Boutilier, Dearden
- 1996
(Show Context)
Citation Context ... advantage of decomposing the coordination problem into state games. These representations can also be exploited in value function approximation and the construction of approximately optimal policies =-=[8]-=-. Generally speaking, value trees such as these can be pruned during computation to keep down computational costs, while sacrificing optimality. Error bounds on the resulting policies can be given as ... |

36 | Control strategies for a stochastic planner
- Tash, Russell
- 1994
(Show Context)
Citation Context |

32 | Multiagent coordination with learning classifier systems
- Sen, Sekaran
- 1995
(Show Context)
Citation Context ...cooperative communication between a leader and follower agent (see also [16]). In both works, convergence is achieved. A slightly different "semi-cooperative" application of Q-learning is re=-=ported in [45]-=-, where a set of agents used straightforward Qlearning in the presense of other agents with orthogonal and slightlyinteracting interests. In these experiments, Q-learning also converges, but not to a ... |

27 | Integrating planning and execution in stochastic domains
- Dearden, Boutilier
- 1994
(Show Context)
Citation Context ...dimensionality. Much emphasis in DTP research has been placed on the issue of speeding up computation, and several solutions proposed, including restricting search to local regions of the state space =-=[17, 21, 3, 55]-=- or reducing the state space via abstraction or clustering of states [7]. Both approaches reduce the state space in a way that allows MDP solution techniques to be used, and generate approximately opt... |

26 |
Adaptive Aggregation for Infinite Horizon Dynamic Programming
- Bertsekas, Castafion
- 1989
(Show Context)
Citation Context |

22 |
Moisés Goldszmidt. Exploiting structure in policy construction
- Boutilier, Dearden
- 1995
(Show Context)
Citation Context |