## Approximate Policy Iteration with a Policy Language Bias (2003)

### Cached

### Download Links

Venue: | Journal of Artificial Intelligence Research |

Citations: | 112 - 12 self |

### BibTeX

@INPROCEEDINGS{Fern03approximatepolicy,

author = {Alan Fern and Sungwook Yoon and Robert Givan},

title = {Approximate Policy Iteration with a Policy Language Bias},

booktitle = {Journal of Artificial Intelligence Research},

year = {2003},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We explore approximate policy iteration (API), replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve.

### Citations

2613 |
Applied Dynamic Programming
- Bellman, Dreyfus
- 1962
(Show Context)
Citation Context ...inistic and stochastic variants) by solving such domains as extremely large MDPs. 1 Introduction Dynamic-programming approaches to finding optimal control policies in Markov decision processes (MDPs) =-=[4, 14]-=- using explicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional [6, 11, 7, 12] as well as rel... |

595 | The FF Planning System: Fast Plan Generation through Heuristic Search
- Hoffmann, Nebel
(Show Context)
Citation Context ...ve policies for entire classical planning domains. 2) Each learned policy is a domain-specific planner that is fast and empirically compares well to the state-of-the-art domain-independent planner FF =-=[13]-=-. 3) API can improve on previously published control knowledge and on that learned by previous systems. Domains. We consider two deterministic domains with standard definitions and three stochastic do... |

368 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context ... of the training data to the entire state and action spaces. Due to API’s inductive nature, there are typically no guarantees for policy improvement—nevertheless, API often “converges” usefully, e.g. =-=[24, 26]-=-. We start API by providing it with an initial policy π0 and a real-valued heuristic functionH, where H(s) is interpreted as an estimate of the cost of state s (presumably with respect to the optimal... |

273 | Using temporal logics to express search control knowledge for planning. Artif. Intell
- Bacchus, Kabanza
- 2000
(Show Context)
Citation Context ...ootstraps from the heuristic guidance. We also demonstrate that our technique is able to iteratively improve policies that correspond to previously published hand-coded control knowledge (for TL-plan =-=[3]-=-) and policies learned by Yoon et al. [28]. Our technique gives a new way of using heuristics in planning domains, complementing traditional heuristic search strategies. 2 Approximate Policy Iteration... |

229 | Integrating planning and learning: The PRODIGY architecture
- Veloso, Carbonell, et al.
- 1995
(Show Context)
Citation Context ...r planning” systems [22] learn from small-problem solutions to improve the efficiency and/or quality of planning. Two primary approaches are to learn control knowledge for search-based planners, e.g. =-=[23, 27, 10, 15, 1]-=-, and, more closely related, to learn stand-alone control policies [17, 19, 28]. The former work is severely limited by the utility problem (see [21]), i.e., being “swamped” by low utility rules. Crit... |

171 | A sparse sampling algorithm for near-optimal planning in large Markov decision processes
- Kearns, Mansour, et al.
- 2002
(Show Context)
Citation Context ...representation, and later, in Section 3, detail a particular representation of planning domains as relational MDPs and the corresponding policy-space learning bias. Problem Setup. We follow and adapt =-=[16]-=- and [5]. We represent an MDP using a generative model 〈S, A, T, C, I〉, where S is a finite set of states, A is a finite set of actions, and T is a randomized “action-simulation” algorithm that, given... |

152 |
Explanation-based learning: A problem solving perspective
- Minton, Carbonell, et al.
- 1989
(Show Context)
Citation Context ...r planning” systems [22] learn from small-problem solutions to improve the efficiency and/or quality of planning. Two primary approaches are to learn control knowledge for search-based planners, e.g. =-=[23, 27, 10, 15, 1]-=-, and, more closely related, to learn stand-alone control policies [17, 19, 28]. The former work is severely limited by the utility problem (see [21]), i.e., being “swamped” by low utility rules. Crit... |

147 | Stochastic Dynamic Programming with Factored Representations
- Boutilier, Dearden, et al.
- 2000
(Show Context)
Citation Context ...sion processes (MDPs) [4, 14] using explicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional =-=[6, 11, 7, 12]-=- as well as relational [8] state-space representations. These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning comp... |

134 | Symbolic dynamic programming for first-order MDPs
- Boutilier, Reiter, et al.
- 2001
(Show Context)
Citation Context ...xplicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional [6, 11, 7, 12] as well as relational =-=[8]-=- state-space representations. These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning competitions [2]. These method... |

103 | Relational reinforcement learning
- Dzeroski, Raedt, et al.
- 1998
(Show Context)
Citation Context ...ing heuristic to avoid the need for access to small problems. Our learning approach is also not tied to having a base planner. The most closely related work is relational reinforcement learning (RRL) =-=[9]-=-, a form of online API that learns relational cost-function approximations. Q-cost functions are learned in the form of relational decision trees (Q-trees) and are used to learn corresponding policies... |

93 | Equivalence notions and model minimization in Markov decision processes
- Givan, Dean, et al.
(Show Context)
Citation Context ...sion processes (MDPs) [4, 14] using explicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional =-=[6, 11, 7, 12]-=- as well as relational [8] state-space representations. These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning comp... |

71 | Learning action strategies for planning domains
- Khardon
- 1999
(Show Context)
Citation Context ... is the complexity of typical cost functions for these problems, for which it is often more natural to specify a policy space. Recent work on inductive policy selection in relational planning domains =-=[17, 19, 28]-=-, has shown that useful policies can be learned using a policy-space bias, described by a generic knowledge representation language. Here, we incorporate that work into a practical approach to API for... |

67 |
AIPS-00 Planning Competition
- Bacchus, Kautz, et al.
- 1995
(Show Context)
Citation Context ...as relational [8] state-space representations. These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning competitions =-=[2]-=-. These methods typically calculate a sequence of cost functions. For familiar STRIPS planning domains (among others), useful cost functions can be difficult or impossible to represent compactly. The ... |

67 | Max-norm projections for factored MDPs
- Guestrin, Koller, et al.
- 2001
(Show Context)
Citation Context ...sion processes (MDPs) [4, 14] using explicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional =-=[6, 11, 7, 12]-=- as well as relational [8] state-space representations. These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning comp... |

60 | Reinforcement learning as classification: Leveraging modern classifiers
- Lagoudakis, Parr
- 2003
(Show Context)
Citation Context ...rk in relational reinforcement learning has been applied to STRIPS problems with much simpler goals than typical benchmark planning domains, and is discussed below in Section 5. 2 In concurrent work, =-=[18]-=- pursued a similar approach to API in attribute-value domains.We evaluate our API approach in several STRIPS planning domains, showing iterative policy improvement. Our technique solves entire planni... |

39 | Inductive policy selection for first-order MDPs
- Yoon, Fern, et al.
- 2002
(Show Context)
Citation Context ... is the complexity of typical cost functions for these problems, for which it is often more natural to specify a policy space. Recent work on inductive policy selection in relational planning domains =-=[17, 19, 28]-=-, has shown that useful policies can be learned using a policy-space bias, described by a generic knowledge representation language. Here, we incorporate that work into a practical approach to API for... |

38 | Learning declarative control rules for constraint-based planning
- Huang, Selman, et al.
(Show Context)
Citation Context ...r planning” systems [22] learn from small-problem solutions to improve the efficiency and/or quality of planning. Two primary approaches are to learn control knowledge for search-based planners, e.g. =-=[23, 27, 10, 15, 1]-=-, and, more closely related, to learn stand-alone control policies [17, 19, 28]. The former work is severely limited by the utility problem (see [21]), i.e., being “swamped” by low utility rules. Crit... |

36 | Approximating value trees in structured dynamic programming
- Boutilier, Dearden
- 1996
(Show Context)
Citation Context |

32 | Multi-strategy learning of search control for partial-order planning
- Estlin, Mooney
- 1996
(Show Context)
Citation Context |

20 | Using genetic programming to learn and improve control knowledge
- Aler, Borrajo, et al.
- 2002
(Show Context)
Citation Context |

15 |
Learning generalized policies in planning domains using concept languages
- Martin, Geffner
(Show Context)
Citation Context ... is the complexity of typical cost functions for these problems, for which it is often more natural to specify a policy space. Recent work on inductive policy selection in relational planning domains =-=[17, 19, 28]-=-, has shown that useful policies can be learned using a policy-space bias, described by a generic knowledge representation language. Here, we incorporate that work into a practical approach to API for... |

12 |
Dynamic Programming and Markov Decision Processes
- Howard
- 1960
(Show Context)
Citation Context ...inistic and stochastic variants) by solving such domains as extremely large MDPs. 1 Introduction Dynamic-programming approaches to finding optimal control policies in Markov decision processes (MDPs) =-=[4, 14]-=- using explicit (flat) state space representations break down when the state space becomes extremely large. More recent work extends these algorithms to use propositional [6, 11, 7, 12] as well as rel... |

2 |
Quantitative results on the utility of explanation-based learning
- Minton
- 1988
(Show Context)
Citation Context ...dge for search-based planners, e.g. [23, 27, 10, 15, 1], and, more closely related, to learn stand-alone control policies [17, 19, 28]. The former work is severely limited by the utility problem (see =-=[21]-=-), i.e., being “swamped” by low utility rules. Critically, our policy-language bias confronts this issue by preferring simpler policies. Regarding the latter, our work is novel in using API to iterati... |

1 | L DeRaedt &K Driessens. Relational reinforcement learning - Dzeroski |

1 | Reinforcement learning as classification: Leveraging large margin classifiers - Lagoudakis, Parr |

1 |
Taxonomic syntax for 1st-order inference
- McAllester, Givan
- 1993
(Show Context)
Citation Context ...19], decision lists of such rules were used as a language bias for learning policies. We use such lists, and represent the sets of objects needed using class expressions C written in taxonomic syntax =-=[20]-=-, defined by C ::= C0 | anything | ¬C | (R C) | C ∩ C, with R ::= R0 | R −1 | R ∩ R | R ∗ . Here, C0 is any one argument relation and R0 any binary relation from the predicates in P . One argument rel... |

1 |
Online policy improvement via monte-carlo search
- Tesauro, Galperin
- 1996
(Show Context)
Citation Context ...on mappings, and use a standard relational learner to learn these mappings. We inherit from familiar API methods a (sampled) policy-evaluation phase using simulation of the current policy, or rollout =-=[25]-=-, and an inductive policy-selection phase inducing an approximate next policy from sampled current policy values. 1 Recent work in relational reinforcement learning has been applied to STRIPS problems... |

1 |
Feature-based methods for large scale DP
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ... of the training data to the entire state and action spaces. Due to API’s inductive nature, there are typically no guarantees for policy improvement—nevertheless, API often “converges” usefully, e.g. =-=[24, 26]-=-. We start API by providing it with an initial policy π0 and a real-valued heuristic functionH, where H(s) is interpreted as an estimate of the cost of state s (presumably with respect to the optimal... |