## Stochastic Dynamic Programming with Factored Representations (1997)

### Cached

### Download Links

Citations: | 156 - 10 self |

### BibTeX

@MISC{Boutilier97stochasticdynamic,

author = {Craig Boutilier and Richard Dearden and Moisés Goldszmidt},

title = { Stochastic Dynamic Programming with Factored Representations},

year = {1997}

}

### Years of Citing Articles

### OpenURL

### Abstract

Markov decision processes(MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we propose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decision-tree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decision-tree representations of policies and value functions. This generally obviates the need for state-by-state computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decision-theoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems,

### Citations

7319 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ... as several classical algorithms for solving MDPs. In Section 3, we define a particular compact representation of an MDP, using dynamic Bayesian networks [25, 29]---a special form of Bayesian network =-=[57]-=----to represent the dependence between variables before 3 More accurately, they produce solutions that are identical to their standard state-based counterparts, which may be ffl-optimal. 4 and after t... |

3987 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1999
(Show Context)
Citation Context ...MDPs) have been adopted as the model of choice for DTP problems in much recent work [12, 26, 28, 30, 61, 78], and have also provided the underlying foundations for most work in reinforcement learning =-=[48, 76, 77, 84]-=-. MDPs allow the introduction of uncertainty into the effects of actions, the modeling of uncertain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminati... |

3030 | Graph-based algorithms for boolean function manipulation
- Bryant
- 1986
(Show Context)
Citation Context ... representations may be suitable, and more compact, in certain circumstances. CPTs could sometimes be more compactly represented using rules [60, 64], decision lists [65] or boolean decision diagrams =-=[19]-=-. The algorithms we provide in the next section are designed to exploit the decision-tree representation, but we see no fundamental difficulties in developing similar algorithms to exploit these other... |

2756 |
Dynamic programming
- Bellman
- 1957
(Show Context)
Citation Context ...ain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminating process-oriented problems. 1 The foundations and the basic computational techniques for MDPs =-=[3, 5, 44, 62]-=- are well-understood and in certain cases can be used directly in DTP. These methods exploit the dynamic programming principle and allow MDPs to be solved in time polynomial in the size of the state a... |

1809 |
STRIPS: A new approach to the application of theorem proving to problem solving
- Fikes, Nilsson
- 1971
(Show Context)
Citation Context ...ctions in terms of state transitions is problematic. The intuition underlying the earliest representational mechanisms for reasoning about action and planning---the situation calculus [55] and STRIPS =-=[36]-=- being two important examples---is that actions can often be more compactly and more naturally specified by describing their effects on state variables. For example, in the STRIPS action representatio... |

1527 | Some philosophical problems from the standpoint of artificial intelligence
- McCarthy, Hayes
- 1969
(Show Context)
Citation Context ...the effects of actions in terms of state transitions is problematic. The intuition underlying the earliest representational mechanisms for reasoning about action and planning---the situation calculus =-=[55]-=- and STRIPS [36] being two important examples---is that actions can often be more compactly and more naturally specified by describing their effects on state variables. For example, in the STRIPS acti... |

1361 |
Decisions with Multiple Objective: Preference and Value Tradeoffs
- Keeney, Raiffa
- 1976
(Show Context)
Citation Context ... comprised of a number of independent components whose values are combined with some simple function to determine overall reward. These ideas are common in the study of multi-attribute utility theory =-=[49]. In our e-=-xample, the reward function can be broken into two additive, independent components: one component determines the "sub-reward" determined by HCO---0:9 if HCO, 0 if HCO; and the other determi... |

1349 | Reinforcement learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...MDPs) have been adopted as the model of choice for DTP problems in much recent work [12, 26, 28, 30, 61, 78], and have also provided the underlying foundations for most work in reinforcement learning =-=[48, 76, 77, 84]-=-. MDPs allow the introduction of uncertainty into the effects of actions, the modeling of uncertain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminati... |

1260 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...ain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminating process-oriented problems. 1 The foundations and the basic computational techniques for MDPs =-=[3, 5, 44, 62]-=- are well-understood and in certain cases can be used directly in DTP. These methods exploit the dynamic programming principle and allow MDPs to be solved in time polynomial in the size of the state a... |

645 |
Planning for Conjunctive Goals
- Chapman
(Show Context)
Citation Context ...n action describes very concisely the transitions induced by that action over a large number of states. Similarly, classical planning techniques such as regression planning [83] or nonlinear planning =-=[22, 54, 58, 66] exploit t-=-hese representations to great effect, never requiring that one search (or implement "shortest-path" dynamic programming techniques) explicitly through state space. Intuitively, such methods ... |

601 | Symbolic model checking : 1020 states and beyond - Burch, Clarke, et al. - 1990 |

542 |
Dynamic Programming and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...ain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminating process-oriented problems. 1 The foundations and the basic computational techniques for MDPs =-=[3, 5, 44, 62]-=- are well-understood and in certain cases can be used directly in DTP. These methods exploit the dynamic programming principle and allow MDPs to be solved in time polynomial in the size of the state a... |

489 | Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...MDPs) have been adopted as the model of choice for DTP problems in much recent work [12, 26, 28, 30, 61, 78], and have also provided the underlying foundations for most work in reinforcement learning =-=[48, 76, 77, 84]-=-. MDPs allow the introduction of uncertainty into the effects of actions, the modeling of uncertain exogenous events, the presence of multiple, prioritized objectives, and the solution of nonterminati... |

470 |
A model for reasoning about persistence and causation
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...that are used in the solution of MDPs, as well as several classical algorithms for solving MDPs. In Section 3, we define a particular compact representation of an MDP, using dynamic Bayesian networks =-=[25, 29]-=----a special form of Bayesian network [57]---to represent the dependence between variables before 3 More accurately, they produce solutions that are identical to their standard state-based counterpart... |

429 |
Guarded commands, nondeterminacy and formal derivation of programs
- Dijkstra
- 1975
(Show Context)
Citation Context ...cision trees that test the values of specific variables. The computational advantage provided by such an 2 Regression is also a concept of fundamental importance in program synthesis and verification =-=[24, 34]-=-. 3 approach is that value need only be computed once for each region instead of once per state. 1.2 State Aggregation and Function Approximation The approach we take to solving large MDPs is a specif... |

428 | UCPOP: A Sound Complete, Partial Order Planner for ADL
- Pemberthy, Weld
- 1992
(Show Context)
Citation Context ...n action describes very concisely the transitions induced by that action over a large number of states. Similarly, classical planning techniques such as regression planning [83] or nonlinear planning =-=[22, 54, 58, 66] exploit t-=-hese representations to great effect, never requiring that one search (or implement "shortest-path" dynamic programming techniques) explicitly through state space. Intuitively, such methods ... |

397 | Systematic nonlinear planning
- McAllester, Rosenblitt
- 1991
(Show Context)
Citation Context ...n action describes very concisely the transitions induced by that action over a large number of states. Similarly, classical planning techniques such as regression planning [83] or nonlinear planning =-=[22, 54, 58, 66] exploit t-=-hese representations to great effect, never requiring that one search (or implement "shortest-path" dynamic programming techniques) explicitly through state space. Intuitively, such methods ... |

388 |
Dynamic Programming: Deterministic and stochastic models
- Bertsekas
- 1987
(Show Context)
Citation Context |

388 |
Evaluating influence diagrams
- Shachter
- 1986
(Show Context)
Citation Context ...network with the choice of action represented as a variable, and the distributions over postaction variables conditioned on this action node. This type of representation, common in influence diagrams =-=[69]-=-, can sometimes be more compact than a set of individual networks for each action (for example, when a variable's value persists for most or all actions); see [15] for a discussion of the relative adv... |

377 | Learning Decision Lists
- Rivest
- 1987
(Show Context)
Citation Context ...o describe actions. However, other representations may be suitable, and more compact, in certain circumstances. CPTs could sometimes be more compactly represented using rules [60, 64], decision lists =-=[65]-=- or boolean decision diagrams [19]. The algorithms we provide in the next section are designed to exploit the decision-tree representation, but we see no fundamental difficulties in developing similar... |

344 |
The optimal control of partially observable Markov processes over the infinite horizon: discounted costs
- Sondik
- 1978
(Show Context)
Citation Context ...ertainty cannot be handled in the framework we adopt, specifically, partial observability, or uncertain knowledge about the state of the system being controlled. Partially observable MDPs (or POMDPs) =-=[52, 53, 75, 73]-=- can be used in such cases. We will make further remarks on POMDPs at the end of this article. 2 In this paper, we develop similar techniques for solving certain classes of large MDPs. We first descri... |

336 | Universal plans for reactive robots in unpredictable environments
- Schoppers
- 1987
(Show Context)
Citation Context ... action choice does not depend on the stage of the decision problem. For the problems we consider, optimal stationary, Markovian policies always exist. In a sense,sis a conditional and universal plan =-=[67]-=-, specifying an action 5 We could model the applicability conditions for actions using preconditions in a way that fits within our framework below. However, we prefer to think of actions as action att... |

326 |
Symbolic Model Checking: 10 States and Beyond
- Burch, Clarke, et al.
- 1992
(Show Context)
Citation Context ...h nodes will be understood to refer to variables at time t, not t + 1. 15 Deterministic, goal-based regression algorithms have been developed for such representations in many circumstances; e.g., see =-=[20]-=- for a discussionof regressionusing booleandecision diagrams. Decision-theoretic generalizationsof these techniques,using ideas developed in the following section, should prove useful. 17 Reward HCO W... |

305 |
The optimal control of partially observable Markov decision processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...ertainty cannot be handled in the framework we adopt, specifically, partial observability, or uncertain knowledge about the state of the system being controlled. Partially observable MDPs (or POMDPs) =-=[52, 53, 75, 73]-=- can be used in such cases. We will make further remarks on POMDPs at the end of this article. 2 In this paper, we develop similar techniques for solving certain classes of large MDPs. We first descri... |

299 | Context-specific independence in Bayesian networks
- Boutilier, Friedman, et al.
- 1996
(Show Context)
Citation Context ...fl-optimal. 4 and after the occurrence of actions. In addition, we use decision trees to represent the conditionalprobability matrices quantifying the network to exploit context-specific independence =-=[14]-=-, that is, independence given a particular variable assignment. We note that this representation is somewhat related to the probabilistic variants of STRIPS operators introduced in [40] and augmented ... |

299 | Probabilistic Horn abduction and Bayesian networks
- Poole
- 1993
(Show Context)
Citation Context ...ic context). Algorithms for detecting these context-specific independencies using CPT representations such as decision trees and decision graphs are described in [14]. Related notions can be found in =-=[38, 59, 70]-=-. We note that asymmetric representations of conditional distributions in influence diagrams have also been proposed and investigated in [74]. 13 We adopt the convention that, for boolean variables, l... |

283 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...tion effects, the planning (or plan-executing) agent can observe the exact outcome of any action it has taken and knows the precise state of the system at any time. Partially observable MDPs (POMDPs) =-=[21, 53, 73]-=- are much more computationally demanding than fully observable MDPs. However, we will make a few remarks on the application of our techniques to POMDPs at the conclusion of this article. 4 4 See [16] ... |

281 |
Automatic Verification of Finite- state Concurrent Systems using Temporal Logic Specifications
- Clarke, Emerson, et al.
- 1986
(Show Context)
Citation Context ...cision trees that test the values of specific variables. The computational advantage provided by such an 2 Regression is also a concept of fundamental importance in program synthesis and verification =-=[24, 34]-=-. 3 approach is that value need only be computed once for each region instead of once per state. 1.2 State Aggregation and Function Approximation The approach we take to solving large MDPs is a specif... |

271 | Algebraic decision diagrams and their applications
- Bahar, Frohm, et al.
- 1992
(Show Context)
Citation Context ... value functions and policies could greatly improve the applicability of SPI. Both of these facts have been confirmed in subsequent work [43] that extends SPI using algebraic decision diagrams (ADDs) =-=[2]-=-. This improved structured representation and implementation (SPUDD) has been tested on the problems described above and has proven the benefit of decision-theoretic regression to be more substantial ... |

263 | Generalization in reinforcement learning: Safely approximating the value function
- BOYAN, MOORE
- 1995
(Show Context)
Citation Context ...ee with little loss in accuracy, in contrast to pruning for the purpose of preventing overfitting [64]. 37 The approximation is thus careful enough to avoid the problems of approximation described in =-=[18]-=-. 57 abstraction, decision-theoretic regression groups together states that have identical value or policy choice at various points in the dynamic programming computations required to solve an MDP. We... |

262 | An algorithm for probabilistic planning
- Kushmerick, Hanks, et al.
- 1995
(Show Context)
Citation Context ... performed in different states, we will adopt dynamic Bayesian networks as our representation scheme. We note that other representations are possible, such as the stochastic STRIPS rules described in =-=[40, 41, 50]-=-. However, we will see below that the Bayesian network methodology offers certain advantages. 3.1.1 The Basic Graphical Model Formally, we assume that the system state can be characterized by a finite... |

239 |
Td-gammon, a self-teaching backgammon program, achieves masterlevel play
- Tesauro
- 1994
(Show Context)
Citation Context ...ings of states stay fixed or can they change during computation) . Other compact representations of value functions have also been proposed, such as linear function representations or neural networks =-=[1, 6, 80, 81]-=-. These techniques do not seek to exploit regions of uniformity in value functions, but rather compact functions of state features that reflect value. As such they are distinguished from strict aggreg... |

221 |
The nonlinear Nature of Plans
- Sacerdoti
- 1990
(Show Context)
Citation Context |

188 |
Constructing optimal binary decision trees is NPcomplete
- Hyafil, Rivest
- 1976
(Show Context)
Citation Context ...s strongly influenced by the variable ordering in the tree. Again this issue arises in research on classification [64, 82]. Finding the smallest decision tree representing a given function is NP-hard =-=[46]-=-, but in [13] we discuss certain feasible heuristics suitable for reordering an R-tree to make it smaller and/or more amenable to pruning. 3. Termination of SVI requires care when approximations are i... |

184 |
A survey of algorithmic methods for partially observable Markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ...ertainty cannot be handled in the framework we adopt, specifically, partial observability, or uncertain knowledge about the state of the system being controlled. Partially observable MDPs (or POMDPs) =-=[52, 53, 75, 73]-=- can be used in such cases. We will make further remarks on POMDPs at the end of this article. 2 In this paper, we develop similar techniques for solving certain classes of large MDPs. We first descri... |

180 | Algorithms for sequential decision making
- Littman
- 1996
(Show Context)
Citation Context |

167 | Locally weighted learning for control
- Atkeson, Moore, et al.
- 1997
(Show Context)
Citation Context ...ings of states stay fixed or can they change during computation) . Other compact representations of value functions have also been proposed, such as linear function representations or neural networks =-=[1, 6, 80, 81]-=-. These techniques do not seek to exploit regions of uniformity in value functions, but rather compact functions of state features that reflect value. As such they are distinguished from strict aggreg... |

165 | Transfer of learning by composing solutions for elemental sequential tasks
- Singh
- 1992
(Show Context)
Citation Context ...ethods, and other structured value function representations (e.g., those that support some type of functional decomposition of the value function such as neural networks [6, 80] or additive structure =-=[9, 31, 37, 47, 56, 72, 71]-=-). This should prove possible because the structure assumed by SPI can be exploited in a way that is orthogonal to the types of structure assumed by many other solution methods. One example of this is... |

151 |
Decision theoretic planning: Structural assumptions and computational leverage
- Boutilier, Dean, et al.
(Show Context)
Citation Context ...going processes. These make MDPs ideal models for many decisiontheoretic planning problems (for further discussion of the desirable features of MDPs from the perspective of modeling DTP problems, see =-=[11, 17, 28, 35]-=-). In this section, we describe the basic MDP model and consider several classical solution procedures. Primarily for reasons of presentation, we do not consider action costs in our formulation of MDP... |

140 | Planning with deadlines in stochastic domains
- Dean, Kaelbling, et al.
- 1993
(Show Context)
Citation Context ...here exist multiple, often conflicting, objectives whose desirability can be quantified. Markov decision processes (MDPs) have been adopted as the model of choice for DTP problems in much recent work =-=[12, 26, 28, 30, 61, 78]-=-, and have also provided the underlying foundations for most work in reinforcement learning [48, 76, 77, 84]. MDPs allow the introduction of uncertainty into the effects of actions, the modeling of un... |

139 | Feature-based methods for large scale dynamic programming
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...tate aggregation method. Other types of state aggregation techniques have been proposed, in which states with similar characteristics are grouped together. Such methods are reported in, for instance, =-=[4, 68, 81]-=-, and can vary as to whether states are statically or dynamically aggregated (that is, do the groupings of states stay fixed or can they change during computation) . Other compact representations of v... |

138 | Input generalization in delayed reinforcement learning: An algorithm and performance comparisons
- Chapman, Kaelbling
- 1991
(Show Context)
Citation Context ...tions, denoting the action to be performed at any state consistent with the labeling of the corresponding branch. Tree representations of policies are sometimes used in reinforcement learning as well =-=[23]-=-, though in a somewhat different fashion. Examples of a policy and value tree are given in Figure 4. In our implementation of decision-theoretic regression and structured dynamic programming algorithm... |

124 | The MAXQ method for hierarchical reinforcement learning, in
- Dietterich
- 1998
(Show Context)
Citation Context ...ethods, and other structured value function representations (e.g., those that support some type of functional decomposition of the value function such as neural networks [6, 80] or additive structure =-=[9, 31, 37, 47, 56, 72, 71]-=-). This should prove possible because the structure assumed by SPI can be exploited in a way that is orthogonal to the types of structure assumed by many other solution methods. One example of this is... |

122 | Decision tree induction based on efficient tree restructuring
- Utgoff, Berkman, et al.
- 1997
(Show Context)
Citation Context ...me tradeoffs between tree size and solution quality [13]. 37 2. The ability to prune is strongly influenced by the variable ordering in the tree. Again this issue arises in research on classification =-=[64, 82]-=-. Finding the smallest decision tree representing a given function is NP-hard [46], but in [13] we discuss certain feasible heuristics suitable for reordering an R-tree to make it smaller and/or more ... |

115 | Computing optimal policies for partially observable decision processes using compact representations
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ..., 73] are much more computationally demanding than fully observable MDPs. However, we will make a few remarks on the application of our techniques to POMDPs at the conclusion of this article. 4 4 See =-=[16]-=- for more detailed investigations of this type. 5 We refer the reader to [5, 11, 62] for further material on MDPs. 2.1 The Basic Model A Markov decision process can be defined as a tuple hS; A; T; Ri,... |

112 | Dynamic programming and influence diagrams
- Tatman, Shachter
- 1990
(Show Context)
Citation Context ...les is exploited. Some work on influence diagrams has considered the use of reward nodes such as these, which are combined using some function (e.g., summation) to determine overall value (see, e.g., =-=[79]-=-). If action costs need to be modeled (i.e., reward has the form R(s; a)), a node representing the chosen action can be included, as they are in influence diagrams, or a separate reward function can b... |

112 |
Achieving Several Goals Simultaneously
- Waldinger
- 1977
(Show Context)
Citation Context ... STRIPS representation of an action describes very concisely the transitions induced by that action over a large number of states. Similarly, classical planning techniques such as regression planning =-=[83] or nonlin-=-ear planning [22, 54, 58, 66] exploit these representations to great effect, never requiring that one search (or implement "shortest-path" dynamic programming techniques) explicitly through ... |

107 | Model minimization in markov decision processes
- Dean, Givan
- 1997
(Show Context)
Citation Context ...here exist multiple, often conflicting, objectives whose desirability can be quantified. Markov decision processes (MDPs) have been adopted as the model of choice for DTP problems in much recent work =-=[12, 26, 28, 30, 61, 78]-=-, and have also provided the underlying foundations for most work in reinforcement learning [48, 76, 77, 84]. MDPs allow the introduction of uncertainty into the effects of actions, the modeling of un... |

103 |
The algebraic structure theory of sequential machines
- Hartmanis, Stearns
- 1966
(Show Context)
Citation Context ...two approaches to state aggregation that bear similarity to our method. The first is the model minimization approach of Givan and Dean [26, 27, 39]. In this work, the notion of automaton minimization =-=[42, 51]-=- is extended to MDPs and is used to analyze abstraction techniques such as those presented in [30]. More closely related to the specific model we propose in the current paper is that of Dietterich and... |

81 | Solving very large weakly coupled Markov decision processes
- Meuleau, Hauskrecht, et al.
- 1998
(Show Context)
Citation Context ...e true (global) reward function. How best to exploit such utility independence in MDPs in general is still an open question, though it has received some attention. For discussion of these issues, see =-=[9, 37, 56, 71]-=-. 3.3 Value Function and Policy Representation It is clear that value functions and policies can also be represented using decision trees (or other compact function representations). Again, these expl... |