Results 1  10
of
33
Active Learning for Reward Estimation in Inverse Reinforcement Learning
, 2009
"... Inverse reinforcement learning addresses the general problem of recovering a reward function from samples of a policy provided by an expert/demonstrator. In this paper, we introduce active learning for inverse reinforcement learning. We propose an algorithm that allows the agent to query the demonst ..."
Abstract

Cited by 42 (14 self)
 Add to MetaCart
(Show Context)
Inverse reinforcement learning addresses the general problem of recovering a reward function from samples of a policy provided by an expert/demonstrator. In this paper, we introduce active learning for inverse reinforcement learning. We propose an algorithm that allows the agent to query the demonstrator for samples at specific states, instead of relying only on samples provided at “arbitrary” states. The purpose of our algorithm is to estimate the reward function with similar accuracy as other methods from the literature while reducing the amount of policy samples required from the expert. We also discuss the use of our algorithm in higher dimensional problems, using both Monte Carlo and gradient methods. We present illustrative results of our algorithm in several simulated examples of different complexities.
Relative Entropy Inverse Reinforcement Learning
"... We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
(Show Context)
We consider the problem of imitation learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is optimally acting in a Markov Decision Process (MDP). Most of the past work on IRL requires that a (near)optimal policy can be computed for different reward functions. However, this requirement can hardly be satisfied in systems with a large, or continuous, state space. In this paper, we propose a modelfree IRL algorithm, where the relative entropy between the empirical distribution of the stateaction trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent. We compare this new approach to wellknown IRL algorithms using learned MDP models. Empirical results on simulated car racing, gridworld and ballinacup problems show that our approach is able to learn good policies from a small number of demonstrations. 1
Inverse Optimal Control with LinearlySolvable MDPs
"... We present new algorithms for inverse optimal control (or inverse reinforcement learning, IRL) within the framework of linearlysolvable MDPs (LMDPs). Unlike most prior IRL algorithms which recover only the control policy of the expert, we recover the policy, the value function and the cost function. ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
We present new algorithms for inverse optimal control (or inverse reinforcement learning, IRL) within the framework of linearlysolvable MDPs (LMDPs). Unlike most prior IRL algorithms which recover only the control policy of the expert, we recover the policy, the value function and the cost function. This is possible because here the cost and value functions are uniquely defined given the policy. Despite these special properties, we can handle a wide variety of problems such as the grid worlds popular in RL and most of the nonlinear problems arising in robotics and control engineering. Direct comparisons to
Abstraction Levels for Robotic Imitation: Overview and Computational Approaches
, 2010
"... This chapter reviews several approaches to the problem of learning by imitation in robotics. We start by describing several cognitive processes identified in the literature as necessary for imitation. We then proceed by surveying different approaches to this problem, placing particular emphasys on m ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
This chapter reviews several approaches to the problem of learning by imitation in robotics. We start by describing several cognitive processes identified in the literature as necessary for imitation. We then proceed by surveying different approaches to this problem, placing particular emphasys on methods whereby an agent first learns about its own body dynamics by means of selfexploration and then uses this knowledge about its own body to recognize the actions being performed by other agents. This general approach is related to the motor theory of perception, particularly to the mirror neurons found in primates. We distinguish three fundamental classes of methods, corresponding to three abstraction levels at which imitation can be addressed. As such, the methods surveyed herein exhibit behaviors that range from raw sensorymotor trajectory matching to highlevel abstract task replication. We also discuss the impact that knowledge about the world and/or the demonstrator can have on the particular behaviors exhibited.
A Reduction from Apprenticeship Learning to Classification
"... We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
(Show Context)
We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widelyused in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the value of the apprentice’s policy and the expert’s policy is O ( √ ǫ). Further, we prove that this difference is onlyO(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a nearoptimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
V.: Feature construction for inverse reinforcement learning
 In: Procs. of NIPS 2010
"... function R, as well as example traces D from its optimal ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
function R, as well as example traces D from its optimal
Transferring Impedance Control Strategies Between Heterogeneous Systems via Apprenticeship Learning
"... Abstract — We present a novel method for designing controllers for robots with variable impedance actuators. We take an imitation learning approach, whereby we learn impedance modulation strategies from observations of behaviour (for example, that of humans) and transfer these to a robotic plant wit ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
(Show Context)
Abstract — We present a novel method for designing controllers for robots with variable impedance actuators. We take an imitation learning approach, whereby we learn impedance modulation strategies from observations of behaviour (for example, that of humans) and transfer these to a robotic plant with very different actuators and dynamics. In contrast to previous approaches where impedance characteristics are directly imitated, our method uses task performance as the metric of imitation, ensuring that the learnt controllers are directly optimised for the hardware of the imitator. As a key ingredient, we use apprenticeship learning to model the optimisation criteria underlying observed behaviour, in order to frame a correspondent optimal control problem for the imitator. We then apply local optimal feedback control techniques to find an appropriate impedance modulation strategy under the imitator’s dynamics. We test our approach on systems of varying complexity, including a novel, antagonistic series elastic actuator and a biologically realistic twojoint, sixmuscle model of the human arm. I.
Learning from Demonstration Using MDP Induced Metrics
"... In this paper we address the problem of learning a policy from demonstration. Assuming that the policy to be learned is the optimal policy for an underlying MDP, we propose a novel way of leveraging the underlying MDP structure in a kernelbased approach. Our proposed approach rests on the insight t ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
In this paper we address the problem of learning a policy from demonstration. Assuming that the policy to be learned is the optimal policy for an underlying MDP, we propose a novel way of leveraging the underlying MDP structure in a kernelbased approach. Our proposed approach rests on the insight that the MDP structure can be encapsulated into an adequate statespace metric. In particular we show that, using MDP metrics, we are able to cast the problem of learning from demonstration as a classi cation problem and attain similar generalization performance as methods based on inverse reinforcement learning at a much lower online computational cost. Our method is also able to attain superior generalization than other supervised learning methods that fail to consider the MDP structure.
Bootstrapping Apprenticeship Learning
"... We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of apprenticeship learning where the examples, demonstrated by an expert, cover only a small part of a large state space. Inverse Reinforcement Learning (IRL) provides an efficient tool for generalizing the demonstration, based on the assumption that the expert is maximizing a utility function that is a linear combination of stateaction features. Most IRL algorithms use a simple Monte Carlo estimation to approximate the expected feature counts under the expert’s policy. In this paper, we show that the quality of the learned policies is highly sensitive to the error in estimating the feature counts. To reduce this error, we introduce a novel approach for bootstrapping the demonstration by assuming that: (i), the expert is (near)optimal, and (ii), the dynamics of the system is known. Empirical results on gridworlds and car racing problems show that our approach is able to learn good policies from a small number of demonstrations. 1
A cascaded supervised learning approach to inverse reinforcement learning
 In European Conference on Machine Learning (ECML
, 2013
"... Abstract. This paper considers the Inverse Reinforcement Learning (IRL) problem, that is inferring a reward function for which a demonstrated expert policy is optimal. We propose to break the IRL problem down into two generic Supervised Learning steps: this is the Cascaded Supervised IRL (CSI) appr ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper considers the Inverse Reinforcement Learning (IRL) problem, that is inferring a reward function for which a demonstrated expert policy is optimal. We propose to break the IRL problem down into two generic Supervised Learning steps: this is the Cascaded Supervised IRL (CSI) approach. A classification step that defines a score function is followed by a regression step providing a reward function. A theoretical analysis shows that the demonstrated expert policy is nearoptimal for the computed reward function. Not needing to repeatedly solve a Markov Decision Process (MDP) and the ability to leverage existing techniques for classification and regression are two important advantages of the CSI approach. It is furthermore empirically demonstrated to compare positively to stateoftheart approaches when using only transitions sampled according to the expert policy, up to the use of some heuristics. This is exemplified on two classical benchmarks (the mountain car problem and a highway driving simulator). 1