## Learning from demonstration (1997)

### Cached

### Download Links

- [www-clmc.usc.edu]
- [www.cs.cmu.edu]
- [www-slab.usc.edu]
- [www-slab.usc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Advances in Neural Information Processing Systems 9 |

Citations: | 329 - 30 self |

### BibTeX

@INPROCEEDINGS{Schaal97learningfrom,

author = {Stefan Schaal},

title = {Learning from demonstration},

booktitle = {Advances in Neural Information Processing Systems 9},

year = {1997},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strategies how to approach a learning problem from instructions and/or demonstrations of other humans. For learning control, this paper investigates how learning from demonstration can be applied in the context of reinforcement learning. We consider priming the Q-function, the value function, the policy, and the model of the task dynamics as possible areas where demonstrations can speed up learning. In general nonlinear learning problems, only model-based reinforcement learning shows significant speed-up after a demonstration, while in the special case of linear quadratic regulator (LQR) problems, all methods profit from the demonstration. In an implementation of pole balancing on a complex anthropomorphic robot arm, we demonstrate that, when facing the complexities of real signal processing, model-based reinforcement learning offers the most robustness for LQR problems. Using the suggested methods, the robot learns pole balancing in just a single trial after a 30 second long demonstration of the human instructor. 1.

### Citations

498 |
Neuron-like adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...e building up too much momentum during pumping will overshoot the upright position. The (approximately) linear example, Figure 1b, is the well-known cart-pole balancing problem (Widrow & Smith, 1964; =-=Barto, Sutton, & Anderson, 1983-=-). For both tasks, the learner is given information about the one-step reward r (Figure 1), and both tasks are formulated as continuous state and continuous action problems. The goal of each task is t... |

494 | Integrated Architectures for learning, planning and reacting based on approximating dynamic programming - Sutton - 1990 |

173 |
Learning with Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ... THE NONLINEAR TASK: SWING-UP We applied reinforcement learning based on learning a value function (V-function) (Dyer & McReynolds, 1970) for the Swing-Up task, as the alternative method, Q–learning (=-=Watkins, 1989-=-), has yet received very limited research for continuous state-action spaces. The V–function assigns a scalar reward value V(x(t)) to each state x such that the entire V– function fulfills the consist... |

81 | Robot see, robot do : An overview of robot imitation - Bakker, Kuniyoshi - 1996 |

58 | Reinforcement learning applied to linear quadratic regulation - Bradtke - 1993 |

52 | Using local trajectory optimizers to speed up global optimization in dynamic programming
- Atkeson
- 1994
(Show Context)
Citation Context .... REINFORCEMENT LEARNING FROM DEMONSTRATION Two example tasks will be the basis of our investigation of learning from demonstration. The nonlinear task is the “pendulum swing-up with limited torque” (=-=Atkeson, 1994-=-; Doya, 1996), as shown in Figure 1a. The goal is to balance the pendulum in an upright position starting from hanging downward. As the maximal torque available is restricted such that the pendulum ca... |

42 | An approach to automatic robot programming based on inductive learning - Dufay, Latombe - 1984 |

31 | Temporal difference learning in continuous time and space
- DOYA
- 1996
(Show Context)
Citation Context ... LEARNING FROM DEMONSTRATION Two example tasks will be the basis of our investigation of learning from demonstration. The nonlinear task is the “pendulum swing-up with limited torque” (Atkeson, 1994; =-=Doya, 1996-=-), as shown in Figure 1a. The goal is to balance the pendulum in an upright position starting from hanging downward. As the maximal torque available is restricted such that the pendulum cannot be supp... |

29 | Toward an assembly plan from observation
- Ikeuchi, Suehiro
- 1994
(Show Context)
Citation Context ...g processes (e.g., LozanoPerez, 1982; Dufay & Latombe, 1983; Segre & DeJong, 1985). More recent approaches to programming by demonstration started to include more inductive learning components (e.g., =-=Ikeuchi, 1993-=-; Dillmann, Kaiser, & Ude, 1995). In the context of human skill learning, teaching by showing was investigated by Kawato, Gandolfo, Gomi, & Wada (1994) and Miyamoto et al. (1996) for a complex manipul... |

29 | Fast, robust adaptive control by learning only forward models - Moore - 1992 |

24 | Acquisition of elementary robot skills from human demonstration
- Dillmann, Kaiser, et al.
- 1995
(Show Context)
Citation Context ...g., LozanoPerez, 1982; Dufay & Latombe, 1983; Segre & DeJong, 1985). More recent approaches to programming by demonstration started to include more inductive learning components (e.g., Ikeuchi, 1993; =-=Dillmann, Kaiser, & Ude, 1995-=-). In the context of human skill learning, teaching by showing was investigated by Kawato, Gandolfo, Gomi, & Wada (1994) and Miyamoto et al. (1996) for a complex manipulation task to be learned by an ... |

13 |
Explanation-based manipulator learning: Acquisition of planning ability through observation
- Segre, DeJong
- 1985
(Show Context)
Citation Context ...o primitive assembly actions and spatial relationships between manipulator and environment, and subsequently submitted to symbolic reasoning processes (e.g., LozanoPerez, 1982; Dufay & Latombe, 1983; =-=Segre & DeJong, 1985-=-). More recent approaches to programming by demonstration started to include more inductive learning components (e.g., Ikeuchi, 1993; Dillmann, Kaiser, & Ude, 1995). In the context of human skill lear... |

11 | Teaching by showing in kendama based on optimization principle - Kawato, Gandolfo, et al. - 1994 |

8 |
From isolation to cooperation: An alternative of a system of experts
- Atkeson
- 1996
(Show Context)
Citation Context ...as in (6), and in state-predictive control with a Kalman filter to overcome the delays in visual information processing. The model was learned incrementally in real-time by an implementation of RFWR (=-=Schaal & Atkeson 1996-=-). Figure 6 shows the results of learning from scratch and learning from demonstration of the actual robot. Without a demonstration, it took about 10-20 trials before learning succeeded in reliable pe... |

7 | Task Planning - Lozano-Pérez - 1982 |

5 |
The computation and theory of opitmal control
- Dyer, McReynolds
- 1970
(Show Context)
Citation Context ...xθθ , F) . How can these demonstrations be used to speed up reinforcement learning? 2.1 THE NONLINEAR TASK: SWING-UP We applied reinforcement learning based on learning a value function (V-function) (=-=Dyer & McReynolds, 1970-=-) for the Swing-Up task, as the alternative method, Q–learning (Watkins, 1989), has yet received very limited research for continuous state-action spaces. The V–function assigns a scalar reward value ... |

2 |
Reinforcement learning with eligibility traces
- Singh
- 1996
(Show Context)
Citation Context ...ell studied in the dynamic programming literature in the context of linear quadratic regulation (LQR) (Dyer & McReynolds, 1970). 2.2.1 Q–Learning In contrast to V-learning, Q–learning (Watkins, 1989; =-=Singh & Sutton, 1996-=-) learns a more complicated value function, Q(x,u), which depends both on the state and the command. The analogue of the consistency equation (2) for Q–learning is: Q x( t), u( t) r x( t), u( t) γ arg... |