## Learning for Control from Multiple Demonstrations

### Cached

### Download Links

Citations: | 41 - 6 self |

### BibTeX

@MISC{Coates_learningfor,

author = {Adam Coates and Pieter Abbeel and Andrew Y. Ng},

title = {Learning for Control from Multiple Demonstrations},

year = {}

}

### OpenURL

### Abstract

We consider the problem of learning to follow a desired trajectory when given a small number of demonstrations from a sub-optimal expert. We present an algorithm that (i) extracts the—initially unknown—desired trajectory from the sub-optimal expert’s demonstrations and (ii) learns a local model suitable for control along the learned trajectory. We apply our algorithm to the problem of autonomous helicopter flight. In all cases, the autonomous helicopter’s performance exceeds that of our expert helicopter pilot’s demonstrations. Even stronger, our results significantly extend the state-of-the-art in autonomous helicopter aerobatics. In particular, our results include the first autonomous tic-tocs, loops and hurricane, vastly superior performance on previously performed aerobatic maneuvers (such as in-place flips and rolls), and a complete airshow, which requires autonomous transitions between these and various other maneuvers. 1.

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...del we can directly estimate the multinomial parameters d in closed form; and we have a standard HMM parameter learning problem for the covariances Σ (·) , which can be solved using the EM algorithm (=-=Dempster et al., 1977-=-)—often referred to as Baum-Welch in the context of HMMs. Concretely, for our setting, the EM algorithm’s E-step computes the pairwise marginals over sequential hidden state variables by running a (ex... |

1410 |
A general method applicable to the search for similarities in the amino acid sequence of two proteins
- Needleman, Wunsch
- 1970
(Show Context)
Citation Context ... algorithm to find τ is known in the speech recognition literature as dynamic time warping (Sakoe & Chiba, 1978) and in the biological sequence alignment literature as the Needleman-Wunsch algorithm (=-=Needleman & Wunsch, 1970-=-). The fixed z we use, is the one that maximizes the likelihood of the observations for the current setting of parameters τ,d,Σ (·) . 5 In practice, rather than alternating between complete optimizati... |

430 | Dynamic programming algorithm optimization for spoken word recognition
- Sakoe, Chiba
- 1978
(Show Context)
Citation Context ...ic programming over the time-index assignments for each demonstration independently. The dynamic programming algorithm to find τ is known in the speech recognition literature as dynamic time warping (=-=Sakoe & Chiba, 1978-=-) and in the biological sequence alignment literature as the Needleman-Wunsch algorithm (Needleman & Wunsch, 1970). The fixed z we use, is the one that maximizes the likelihood of the observations for... |

311 | Robot Learning From Demonstration - Atkeson, Schaal - 1997 |

288 | Context-specific independence in Bayesian networks
- Boutilier, Friedman, et al.
- 1996
(Show Context)
Citation Context ...s τ introduces a very large set of dependencies between all the variables. However, when τ is known, the optimization problem in Eq. (11) greatly simplifies thanks to context specific independencies (=-=Boutilier et al., 1996-=-). When τ is fixed, we obtain a model such as the one shown in Figure 2. In this model we can directly estimate the multinomial parameters d in closed form; and we have a standard HMM parameter learni... |

239 | Apprenticeship Learning via Inverse Reinforcement Learning - Abb-sel, Ng - 2004 |

194 | Algorithms for inverse reinforcement learning - Ng, Russell - 2000 |

159 | Locally Weighted Learning for Control
- Atkeson, Moore, et al.
- 1997
(Show Context)
Citation Context ...ls aftersFigure 3. Our XCell Tempest autonomous helicopter. To construct an accurate nonlinear model to predict zt+1 from zt, using the aligned data, one could use locally weighted linear regression (=-=Atkeson et al., 1997-=-), where a linear model is learned based on a weighted dataset. Data points from our aligned demonstrations that are nearer to the current time index along the trajectory, t, and nearer the current st... |

134 | On learning, representing and generalizing a task in a humanoid robot - Calinon, Guenter, et al. - 2007 |

106 |
Differential Dynamic Programming
- Jacobson, Mayne
- 1970
(Show Context)
Citation Context ...lized trajectory and models using our algorithm, we attempted to fly the trajectory on the actual helicopter. Our helicopter uses a receding-horizon differential dynamic programming (DDP) controller (=-=Jacobson & Mayne, 1970-=-). DDP approximately solves general continuous state-space optimal control problems by taking advantage of the fact that optimal control problems with linear dynamics and a quadratic reward function (... |

104 | Maximum Margin Planning - Ratliff, Bagnell, et al. - 2006 |

89 | Model-Based Control of a Robot Manipulator - An, Atkeson, et al. - 1988 |

74 | An application of reinforcement learning to aerobatic helicopter flight
- Abeel, Coates, et al.
- 2007
(Show Context)
Citation Context ...cessfully extracts a good trajectory from the multiple sub-optimal demonstrations, and (ii) the resulting flight performance significantly extends the state of the art in aerobatic helicopter flight (=-=Abbeel et al., 2007-=-; Gavrilets et al., 2002). Most importantly, our resulting controllers are the first to perform as well, and often even better, than our expert pilot. We posted movies of our autonomous helicopter fli... |

54 | Bayesian Inverse Reinforcement Learning - Ramachandran, Amir - 2007 |

45 | Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods - Neu, Szepesvâri - 2007 |

38 | Autonomous inverted helicopter flight via reinforcement learning - Ng, Coates, et al. - 2004 |

34 | Using inaccurate models in reinforcement learning - Abeel, Quigley, et al. - 2006 |

28 | A.: Multiple alignment of continuous time series
- Listgarten, Neal, et al.
- 2005
(Show Context)
Citation Context ...upted by Gaussian noise. Listgarten et al. have used this same basic generative model (for the case where f(·) is the identity function) to align speech signals and biological data (Listgarten, 2006; =-=Listgarten et al., 2005-=-). We now augment the basic model to account for other sources of error which are important for modeling and control. 2.2.1. Learning Local Model Parameters For many systems, we can substantially impr... |

23 | Learning vehicular dynamics, with application to modelling helicopters - Abbeel, Ganapathi, et al. - 2006 |

21 |
Control logic for automated aerobatic flight of miniature helicopter
- Gavrilets, Martinos, et al.
- 2002
(Show Context)
Citation Context ...good trajectory from the multiple sub-optimal demonstrations, and (ii) the resulting flight performance significantly extends the state of the art in aerobatic helicopter flight (Abbeel et al., 2007; =-=Gavrilets et al., 2002-=-). Most importantly, our resulting controllers are the first to perform as well, and often even better, than our expert pilot. We posted movies of our autonomous helicopter flights at: http://heli.sta... |

3 | Analysis of Sibling Time Series Data: Alignment and Difference Detection
- Listgarten
- 2006
(Show Context)
Citation Context ...rved.) merely corrupted by Gaussian noise. Listgarten et al. have used this same basic generative model (for the case where f(·) is the identity function) to align speech signals and biological data (=-=Listgarten, 2006-=-; Listgarten et al., 2005). We now augment the basic model to account for other sources of error which are important for modeling and control. 2.2.1. Learning Local Model Parameters For many systems, ... |