## Multiagent Planning with Factored MDPs (2001)

### Cached

### Download Links

- [www-2.cs.cmu.edu]
- [robotics.stanford.edu]
- [www.cs.cmu.edu]
- [ai.stanford.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In NIPS-14 |

Citations: | 143 - 16 self |

### BibTeX

@INPROCEEDINGS{Guestrin01multiagentplanning,

author = {Carlos Guestrin and Daphne Koller and Ronald Parr},

title = {Multiagent Planning with Factored MDPs},

booktitle = {In NIPS-14},

year = {2001},

pages = {1523--1530},

publisher = {The MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.

### Citations

457 |
A model for reasoning about persistence and causation
- T, Kanazawa
- 1989
(Show Context)
Citation Context ...s are trying to maximize the sum of an immediate reward and a value that they expect to receive one step in the future. We describe the dynamics of the system G using a dynamic decision network (DDN) =-=[4]-=-. Let X i denote the variable X i at the current time and X 0 i the variable at the next step. The transition graph of a DDN is a two-layer directed acyclic graph G whose nodes are fA 1 ; : : : ; A g ... |

287 | S.: The complexity of decentralized control of markov decision processes - Bernstein, Givan, et al. |

270 | Bucket elimination: A unifying framework for reasoning
- Dechter
- 1999
(Show Context)
Citation Context ...ular Q i . Fig. 1(a) shows the coordination graph for an example where Q = Q 1 (a 1 ; a 2 ) +Q 2 (a 2 ; a 4 ) +Q 3 (a 1 ; a 3 ) +Q 4 (a 3 ; a 4 ). A graph structure suggests the use of a cost network =-=[5]-=-, which can be solved using non-serial dynamic programming [1] or a variable elimination algorithm which is virtually identical to variable elimination in a Bayesian network. The key idea is that, rat... |

179 |
Nonserial Dynamic Programming
- BertelĂ¨, Brioschi
- 1972
(Show Context)
Citation Context ...le where Q = Q 1 (a 1 ; a 2 ) +Q 2 (a 2 ; a 4 ) +Q 3 (a 1 ; a 3 ) +Q 4 (a 3 ; a 4 ). A graph structure suggests the use of a cost network [5], which can be solved using non-serial dynamic programming =-=[1]-=- or a variable elimination algorithm which is virtually identical to variable elimination in a Bayesian network. The key idea is that, rather than summing all functions and then doing the maximization... |

147 |
Decision theoretic planning: Structural assumptions and computational leverage
- Boutilier, Dean, et al.
- 1999
(Show Context)
Citation Context ... error relative to that of the best possible approximation in the subspace, and some guidance as to selecting the ff's so as to improve the quality of the approximation. 5 Factored MDPs Factored MDPs =-=[2]-=- allow the representation of large structured MDPs by using a dynamic Bayesian network to represent the transition model. Our representation of the one-step transition dynamics in Section 3 is precise... |

140 | The linear programming approach to approximate dynamic programming
- Farias, Roy
- 2003
(Show Context)
Citation Context ... the LP to k but the number of constraints remains jXj \Theta jAj. There is, in general, no guarantee as to the quality of the approximation P k j=1 w j h j , but recent work of de Farias and Van Roy =-=[3]-=- provides some analysis of the error relative to that of the best possible approximation in the subspace, and some guidance as to selecting the ff's so as to improve the quality of the approximation. ... |

129 | Learning to cooperate via policy search
- Peshkin, Kim, et al.
- 2000
(Show Context)
Citation Context ...this approach impractical in most cases. Alternative approaches to this problem have used local optimization for the different agents, either via reward/value sharing [11, 13] or direct policy search =-=[10]-=-. We present a novel approach based on approximating the joint value function as a linear combination of local value functions, each of which relates only to the parts of the system controlled by a sm... |

108 | From influence diagrams to junction trees - Jensen, Jensen, et al. - 1994 |

97 |
Generalized polynomial approximations in Markovian decision processes
- Schweitzer, Seidmann
- 1985
(Show Context)
Citation Context ...on over H is a function V that can be written as V(x) = P k j=1 w j h j (x) for some coefficients w = (w 1 ; : : : ; w k ) 0 . The LP approach can be adapted to use this value function representation =-=[12]-=-: Variables: w 1 ; : : : ; w k ; Minimize: P j ff j w j ; Subject to: P k j=1 w j h j (x i )sR(x i ; a)+ fl P x 0 l P (x 0 l j x i ; a) P k j=1 w j h j (x 0 l ); 8x i 2 X; a 2 A: Where ff j = P x i ff... |

94 | Computing Factored Value Functions for Policies in Structured MDPs
- Koller, Parr
- 1999
(Show Context)
Citation Context ... (x; a)+ P x 0 P (x 0 j x; a)h j (x 0 )], i.e., the immediate reward plus the expected value of the next state. The expectation is a summation over an exponential number of future states. As shown in =-=[8]-=-, this can be simplified substantially. For example, if we consider the function h 1 (X 0 1 ) in Fig. 1(b), we can see that its expected value is a function only of X 1 ; A 1 ; A 2 . More generally, w... |

72 | Policy Iteration for Factored MDPs
- Koller, Parr
- 2000
(Show Context)
Citation Context ...nts use our message passing coordination algorithm at each step. (Here we have only one function h per agent, but our approach extends trivially to the case of multiple h functions.) In previous work =-=[9, 6]-=-, we presented algorithms for computing approximate value functions of this form for factored MDPs. These algorithms can circumvent the exponential blowup in the number of state variables, but explici... |

67 | Max-norm projections for factored MDPs
- Guestrin, Koller, et al.
- 2001
(Show Context)
Citation Context ...nts use our message passing coordination algorithm at each step. (Here we have only one function h per agent, but our approach extends trivially to the case of multiple h functions.) In previous work =-=[9, 6]-=-, we presented algorithms for computing approximate value functions of this form for factored MDPs. These algorithms can circumvent the exponential blowup in the number of state variables, but explici... |

52 | Distributed Value Functions
- Schneider, Wong, et al.
- 1999
(Show Context)
Citation Context ... the number of agents, rendering this approach impractical in most cases. Alternative approaches to this problem have used local optimization for the different agents, either via reward/value sharing =-=[11, 13]-=- or direct policy search [10]. We present a novel approach based on approximating the joint value function as a linear combination of local value functions, each of which relates only to the parts of ... |

36 | General principles of learning-based multi-agent systems
- Wolpert, Wheeler, et al.
- 1999
(Show Context)
Citation Context ... the number of agents, rendering this approach impractical in most cases. Alternative approaches to this problem have used local optimization for the different agents, either via reward/value sharing =-=[11, 13]-=- or direct policy search [10]. We present a novel approach based on approximating the joint value function as a linear combination of local value functions, each of which relates only to the parts of ... |