## Exploration and apprenticeship learning in reinforcement learning (2005)

### Cached

### Download Links

Venue: | in Proc. 21st International Conference on Machine Learning |

Citations: | 67 - 2 self |

### BibTeX

@INPROCEEDINGS{Abbeel05explorationand,

author = {Pieter Abbeel and Andrew Y. Ng},

title = {Exploration and apprenticeship learning in reinforcement learning},

booktitle = {in Proc. 21st International Conference on Machine Learning},

year = {2005},

pages = {1--8},

publisher = {ICML}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E 3 (Kearns and Singh, 2002) learn near-optimal policies by using “exploration policies ” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies ” that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses. 1.

### Citations

3016 |
Convergence of Probability Measures
- Billingsley
- 1968
(Show Context)
Citation Context ...e X0, X1, · · · is a martingale provided, for all i ≥ 0, we have that E[Xi+1|Fi] = Xi. Due to space constraints we can not expand on these concepts here. We refer the reader to, e.g., (Durrett, 1995; =-=Billingsley, 1995-=-; Williams, 1991), for more details on martingales and stopping times.sExploration and Apprenticeship Learning in Reinforcement Learning Lemma 10. Let any µ > 0,δ > 0 be given. For the algorithm descr... |

848 | Introduction to Reinforcement Learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...m provides a powerful set of tools for modeling and solving control problems, and many algorithms exist for finding (near) optimal solutions for a given MDP (see, e.g., Bertsekas & Ttsitsiklis, 1996; =-=Sutton & Barto, 1998-=-). To apply these algorithms to control problems in which the dynamics are not known in advance, the parameters of the MDP typically need to be learned from observations of the system. A key problem i... |

741 | Nonlinear Programming. Athena Scientific - Bertsekas - 1999 |

239 | Apprenticeship Learning via Inverse Reinforcement Learning
- Abb-sel, Ng
- 2004
(Show Context)
Citation Context ... physical properties of the system being controlled, and thus easily specified. R (and H) is typically given by the task specification (or otherwise can be learned from a teacher demonstration, as in =-=Abbeel & Ng, 2004-=-). Finally, D is usually either known or can straightforwardly be estimated from data. Thus, in the sequel, we will assume that S, A,H,D and R are given, and focus exclusively on the problem of learni... |

236 | R-m.ax - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning
- Brafman, Tennenholtz
- 2003
(Show Context)
Citation Context ...te transition probabilities? The state-of-the-art answer to this problem is the E 3 -algorithm (Kearns & Singh, 2002) (and variants/extensions: Kearns & Koller, 1999; Kakade, Kearns & Langford, 2003; =-=Brafman & Tennenholtz, 2002-=-). These Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the author(s)/owner(s). algorithms guarantee that near-optimal perfo... |

236 | Near-Optimal Reinforcement Learning in Polynominal Time
- Kearns, Singh
- 1998
(Show Context)
Citation Context .... Ng ANG@CS.STANFORD.EDU Computer Science Department, Stanford University Stanford, CA 94305, USA Abstract We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E 3 (=-=Kearns and Singh, 2002-=-) learn near-optimal policies by using “exploration policies” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many sy... |

225 | Learning by watching: extracting reusable task knowledge from visual observation of human performance - Kuniyoshi, Inaba, et al. - 1994 |

218 |
Optimal Control: Linear Quadratic Methods
- Anderson, Moore
- 1989
(Show Context)
Citation Context ...(·) : R nS → R nS . Thus, the next-state is a linear function of some (possibly non-linear) features of the current state (plus noise). This generalizes the familiar LQR model from classical control (=-=Anderson & Moore, 1989-=-) to non-linear settings. For example, the (body-coordinates) helicopter model used in (Ng et al., 2004) was of this form, with a particular choice of non-linear φ, and the unknown parameters A and B ... |

136 | A robotic controller using learning by imitation - Hayes, Demiris - 1994 |

126 | ALVINN: An Autonomous Land Vehicle in a Neural Network - Pomerleau - 1989 |

125 | Autonomous helicopter flight via reinforcement learning
- Ng, Kim, et al.
- 2004
(Show Context)
Citation Context ...rrent state (plus noise). This generalizes the familiar LQR model from classical control (Anderson & Moore, 1989) to non-linear settings. For example, the (body-coordinates) helicopter model used in (=-=Ng et al., 2004-=-) was of this form, with a particular choice of non-linear φ, and the unknown parameters A and B were estimated from data. The process noise {wt}t is IID with wt ∼ N(0,σ 2 InS ). Here σ2 is a fixed, k... |

107 | Learning to fly - Sammut, Hurst, et al. - 1992 |

82 | Practical reinforcement learning in continuous spaces
- Smart, Kaelbling
- 2000
(Show Context)
Citation Context ...nd Kaelbling (2000) both give examples where learning is significantly faster when bootstrapping from a teacher. Their methods are somewhat related in spirit, but different in detail from ours (e.g., =-=Smart and Kaelbling, 2000-=-, uses model-free Q-learning, and does not learn the MDP parameters), and had no formal guarantees. Other examples include Sammut et al. (1992); Kuniyoshi, Inaba & Inoue (1994); Demiris & Hayes (1994)... |

68 | Efficient Reinforcement Learning in Factored MDPs
- Kearns, Koller
- 1999
(Show Context)
Citation Context ...t we manage to collect accurate statistics for their state transition probabilities? The state-of-the-art answer to this problem is the E 3 -algorithm (Kearns & Singh, 2002) (and variants/extensions: =-=Kearns & Koller, 1999-=-; Kakade, Kearns & Langford, 2003; Brafman & Tennenholtz, 2002). These Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 2005. Copyright 2005 by the au... |

32 | Learning Movement Sequences from Demonstration - Amit, Mataric - 2002 |

31 | Exploration in metric state spaces - Kakade, Kearns, et al. - 2003 |

13 | Robot learning by nonparametric regression - Schaal, Atkeson - 1994 |

12 | Online bounds for bayesian algorithms - Kakade, Ng - 2005 |

6 | Reinforcement learning: A survey. JAIR - Kaelbling, Littman, et al. - 1996 |

3 |
Probability–Theory and Examples
- Durrett
- 2005
(Show Context)
Citation Context ...le. The sequence X0, X1, · · · is a martingale provided, for all i ≥ 0, we have that E[Xi+1|Fi] = Xi. Due to space constraints we can not expand on these concepts here. We refer the reader to, e.g., (=-=Durrett, 1995-=-; Billingsley, 1995; Williams, 1991), for more details on martingales and stopping times.sExploration and Apprenticeship Learning in Reinforcement Learning Lemma 10. Let any µ > 0,δ > 0 be given. For ... |