## Approximating value trees in structured dynamic programming (1996)

### Cached

### Download Links

Citations: | 38 - 13 self |

### BibTeX

@MISC{Boutilier96approximatingvalue,

author = {Craig Boutilier and Richard Dearden},

title = {Approximating value trees in structured dynamic programming},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose and examine a method of approximate dynamic programming for Markov decision processes based on structured problem representations. We assume an MDP is represented using a dynamic Bayesian network, and construct value functions using decision trees as our function representation. The size of the representation is kept within acceptable limits by pruning these value trees so that leaves represent possible ranges of values, thus approximating the value functions produced during optimization. We propose a method for detecting convergence,prove errors bounds on the resulting approximately optimal value functions and policies, and describe some preliminary experimental results. 1

### Citations

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...y when the action-model and reward function are known. 1 In addition, we assume that the action model is specified using a compact and natural specification language, namely dynamic Bayesian networks =-=[18, 10]-=-. In previous work, we described a method for optimal policy construction that exploited the problem structure laid bare by the Bayes net representation [5]. Our algorithm built aggregations in a nonu... |

2901 |
Dynamic programming
- Bellman
- 1957
(Show Context)
Citation Context ... to be unknown quantities that must be learned (possibly implicitly). With a known action model and rewards, optimization methods based on dynamic programming can be used to produce an optimal policy =-=[1, 13, 20]-=-. But a serious problem for dynamic programming is the curse of dimensionality: the time (and space) required grows polynomially with the size of the state space, which itself grows exponentially with... |

1329 |
Markov Decision Processes: Discrete Stochastic Programming
- Puterman
- 1994
(Show Context)
Citation Context ... these ideas to reinforcement learning. 2 MDPs and Structured Representations We assume that the system to be controlled can be described as a fully-observable, discrete state Markov decision processs=-=[1, 13, 19]-=-, with a finite set of system states S. The controlling agent has available a finite set of actions A which cause stochastic state transitions: we write Pr(s; a; t) to denote the probability action a ... |

563 |
Dynamic Programming and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ... to be unknown quantities that must be learned (possibly implicitly). With a known action model and rewards, optimization methods based on dynamic programming can be used to produce an optimal policy =-=[1, 13, 20]-=-. But a serious problem for dynamic programming is the curse of dimensionality: the time (and space) required grows polynomially with the size of the state space, which itself grows exponentially with... |

485 |
A model for reasoning about persistence and causation
- Dean, Kanazawa
- 1989
(Show Context)
Citation Context ...y when the action-model and reward function are known. 1 In addition, we assume that the action model is specified using a compact and natural specification language, namely dynamic Bayesian networks =-=[18, 10]-=-. In previous work, we described a method for optimal policy construction that exploited the problem structure laid bare by the Bayes net representation [5]. Our algorithm built aggregations in a nonu... |

269 | Generalization in reinforcement learning: Safely approximating the value function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...uring computation with minimal effort. These will typically be much tighter than possible global bounds. Moreover, while approximation of value functions can sometimes lead to arbitrarily bad results =-=[8]-=-, maintaining accurate value ranges allows us to circumvent convergence problems. We show convergence, describe error bounds, and report on some preliminary experimental results. We conclude with a di... |

238 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...they have similar or identical values and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solution of MDPs or the related methods used in RL =-=[22, 2, 16, 4, 5, 11, 12, 9, 17]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric; but most methods generally assume that the states so grouped h... |

231 | Exploiting structure in policy construction
- Boutilier, Dearden, et al.
- 1995
(Show Context)
Citation Context ...they have similar or identical values and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solution of MDPs or the related methods used in RL =-=[22, 2, 16, 4, 5, 11, 12, 9, 17]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric; but most methods generally assume that the states so grouped h... |

193 |
Constructing optimal binary decision trees is NP-complete
- Hya, L, et al.
- 1976
(Show Context)
Citation Context ...ly influenced by the node ordering used in the value tree. Again, this issue arises in research on classification [21, 25]. Finding the smallest decision tree representing a given function is NP-hard =-=[14]-=-, but there are feasible heuristics one can use in our setting to reorder the tree to make it smaller and/or more amenable to pruning. Among these, one appears rather promising and is strongly related... |

139 | Input generalization in delayed reinforcement learning: An algorithm and performance comparisons
- Chapman, Kaelbling
- 1991
(Show Context)
Citation Context ...they have similar or identical values and/or action choice. These aggregates are treated as a single state in dynamic programming algorithms for the solution of MDPs or the related methods used in RL =-=[22, 2, 16, 4, 5, 11, 12, 9, 17]-=-. Such aggregations can be based on a number of different problem features, such as similarity of states according to some domain metric; but most methods generally assume that the states so grouped h... |

127 | Decision tree induction based on efficient tree restructuring
- Utgoff, Berkman, et al.
- 1997
(Show Context)
Citation Context ...these value trees. We first describe, in Section 4, an algorithm for pruning (and ordering) a single value tree, using methods adapted from those in the literature on classification by decision trees =-=[3, 25]-=-. In Section 5, we describe a structured version of value iteration that approximates the n-step optimal value functions it produces using the pruning method. These approximate value trees are labeled... |

117 | Computing optimal policies for partially observable decision processes using compact representations
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ...coffee shop across the street, can get wet if it is raining unless it has an umbrella, and is rewarded if it brings coffee when the user requests it, and penalized (to a lesser extent) if it gets wet =-=[5, 7]-=-. This network describes the action of fetching coffee. W U R W U R HC 0.9 1.0 W HC WC 1 2 W W -3-2 -1 0 Tree Representation HC T F T F T F T F 1.0 1.0 0.0 T F F T T F F T T T T T F F F F 1.0 0.0 R W ... |

112 | Decomposition techniques for planning in stochastic domains.In
- Dean, Lin
- 1995
(Show Context)
Citation Context |

112 | Dynamic programming and influence diagrams
- Tatman, Shachter
- 1990
(Show Context)
Citation Context ...atural problem structure. Most systems are characterized by a set of random variables or propositions that describe relevant features, and actions and rewards are specified in terms of these features =-=[15, 4, 24]-=-. In addition, since the state space grows exponentially with the number of features, explicit specification and computation over the state space can be problematic. We assume that a set of atomic pro... |

88 | An algorithm for probabilistic least-commitment planning
- Kushmerick, Hanks, et al.
- 1994
(Show Context)
Citation Context ...atural problem structure. Most systems are characterized by a set of random variables or propositions that describe relevant features, and actions and rewards are specified in terms of these features =-=[15, 4, 24]-=-. In addition, since the state space grows exponentially with the number of features, explicit specification and computation over the state space can be problematic. We assume that a set of atomic pro... |

84 |
Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces
- Moore
- 1991
(Show Context)
Citation Context |

70 | Using abstractions for decision-theoretic planning with time constraints
- Boutilier, Dearden
- 1994
(Show Context)
Citation Context |

68 | An upper bound on the loss from approximate optimal-value functions
- Singh, Yee
- 1994
(Show Context)
Citation Context ..., all approximate values should lie within t = 0:1, or 10%, of true value). There are two ways to implement such a tolerance: a) a fixed tolerancesset at t fi 1 \Gamma fi jR max \Gamma R min j 10 See =-=[23]-=- for discussion of policy error given an approximate value function. or b) a sliding tolerance, where the tree for the n-stage to go function V n is pruned using a tolerance of t n X i=0 fi i jR max \... |

56 |
Modified policy iteration algorithms for discounted markov decision problems
- Puterman, Shin
- 1978
(Show Context)
Citation Context ... to be unknown quantities that must be learned (possibly implicitly). With a known action model and rewards, optimization methods based on dynamic programming can be used to produce an optimal policy =-=[1, 13, 20]-=-. But a serious problem for dynamic programming is the curse of dimensionality: the time (and space) required grows polynomially with the size of the state space, which itself grows exponentially with... |

51 | Explanation-based learning and reinforcement learning: A unified view
- Dietterich, Flann
- 1995
(Show Context)
Citation Context |

27 |
Castañon, “Adaptive aggregation for infinite horizon dynamic programming
- Bertsekas, A
- 1989
(Show Context)
Citation Context |

22 |
Moisés Goldszmidt. Exploiting structure in policy construction
- Boutilier, Dearden
- 1995
(Show Context)
Citation Context |

20 | The Frame Problem and Bayesian Network Action Representation - Boutilier, Goldszmidt - 1996 |

3 |
C45: Programs for Machince Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...hin acceptable tolerances---indeed the size of the tree before pruning---may be strongly influenced by the node ordering used in the value tree. Again, this issue arises in research on classification =-=[21, 25]-=-. Finding the smallest decision tree representing a given function is NP-hard [14], but there are feasible heuristics one can use in our setting to reorder the tree to make it smaller and/or more amen... |

2 |
Iterative aggregation-disaggregationprocedures for discounted semi-Markov reward processes
- Schweitzer, Puterman, et al.
- 1985
(Show Context)
Citation Context |

1 |
Trading accuracy for simplicty in decision trees
- Bohanic, Bratko
- 1994
(Show Context)
Citation Context ...these value trees. We first describe, in Section 4, an algorithm for pruning (and ordering) a single value tree, using methods adapted from those in the literature on classification by decision trees =-=[3, 25]-=-. In Section 5, we describe a structured version of value iteration that approximates the n-step optimal value functions it produces using the pruning method. These approximate value trees are labeled... |