Results 1  10
of
163
Apprenticeship Learning via Inverse Reinforcement Learning
 In Proceedings of the Twentyfirst International Conference on Machine Learning
, 2004
"... We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be di#cul ..."
Abstract

Cited by 239 (10 self)
 Add to MetaCart
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be di#cult to write down an explicit reward function specifying exactly how di#erent desiderata should be traded o#. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert 's unknown reward function.
Programmable reinforcement learning agents
, 2001
"... We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows f ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn’t specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills. 1
Maximum entropy inverse reinforcement learning
 In Proc. AAAI
, 2008
"... Recent research has shown the benefit of framing problems of imitation learning as solutions to Markov Decision Problems. This approach reduces learning to the problem of recovering a utility function that makes the behavior induced by a nearoptimal policy closely mimic demonstrated behavior. In th ..."
Abstract

Cited by 67 (15 self)
 Add to MetaCart
Recent research has shown the benefit of framing problems of imitation learning as solutions to Markov Decision Problems. This approach reduces learning to the problem of recovering a utility function that makes the behavior induced by a nearoptimal policy closely mimic demonstrated behavior. In this work, we develop a probabilistic approach based on the principle of maximum entropy. Our approach provides a welldefined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods. We develop our technique in the context of modeling realworld navigation and driving behaviors where collected data is inherently noisy and imperfect. Our probabilistic approach enables modeling of route preferences as well as a powerful new approach to inferring destinations and routes based on partial trajectories.
Learning an Agent's Utility Function by Observing Behavior
 In Proc. of the 18th Int’l Conf. on Machine Learning
, 2001
"... This paper considers the task of predicting the future decisions of an agent A based on his past decisions. We assume that A is rational  he uses the principle of maximum expected utility. We also assume that the probability distribution P he assigns to random events is known, so that we need only ..."
Abstract

Cited by 54 (0 self)
 Add to MetaCart
This paper considers the task of predicting the future decisions of an agent A based on his past decisions. We assume that A is rational  he uses the principle of maximum expected utility. We also assume that the probability distribution P he assigns to random events is known, so that we need only infer his utility function u to model his decision process. We consider the task of using A's previous decisions to learn about u. In particular, A's past decisions can be viewed as constraints on u. If we have a prior probability distribution p(u) over u (e.g., learned from a set of utility functions in the population), we can then condition on these constraints to obtain a posterior distribution q(u). We present an efficient Markov Chain Monte Carlo scheme to generate samples from q(u), which can be used to estimate not only a single "expected" course of action for A, but a distribution over possible courses of action. We show that this capability is particularly useful in a twoplayer setting where a second learning agent is trying to optimize her own payoff, which also depends on A's actions and utilities.
Bayesian inverse reinforcement learning
 in 20th Int. Joint Conf. Artificial Intelligence
, 2007
"... Inverse Reinforcement Learning (IRL) is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behaviour of an expert. IRL is motivated by situations where knowledge of the rewards is a goal by itself (as in preference elicitation) a ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
Inverse Reinforcement Learning (IRL) is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behaviour of an expert. IRL is motivated by situations where knowledge of the rewards is a goal by itself (as in preference elicitation) and by the task of apprenticeship learning (learning policies from an expert). In this paper we show how to combine prior knowledge and evidence from the expert’s actions to derive a probability distribution over the space of reward functions. We present efficient algorithms that find solutions for the reward learning and apprenticeship learning tasks that generalize well over these distributions. Experimental results show strong improvement for our methods over previous heuristicbased approaches. 1
Accelerating Reinforcement Learning through Implicit Imitation
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2003
"... Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments
Action understanding as inverse planning
 Cognition
, 2009
"... Humans are adept at inferring the mental states underlying other agents’ actions, such as goals, beliefs, desires, emotions and other thoughts. We propose a computational framework based on Bayesian inverse planning for modeling human action understanding. The framework represents an intuitive theor ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
Humans are adept at inferring the mental states underlying other agents’ actions, such as goals, beliefs, desires, emotions and other thoughts. We propose a computational framework based on Bayesian inverse planning for modeling human action understanding. The framework represents an intuitive theory of intentional agents ’ behavior based on the principle of rationality: the expectation that agents will plan approximately rationally to achieve their goals, given their beliefs about the world. The mental states that caused an agent’s behavior are inferred by inverting this model of rational planning using Bayesian inference, integrating the likelihood of the observed actions with the prior over mental states. This approach formalizes in precise probabilistic terms the essence of previous qualitative approaches to action understanding based on an “intentional stance ” (Dennett, 1987) or a “teleological stance ” (Gergely et al., 1995). In three psychophysical experiments using animated stimuli of agents moving in simple mazes, we assess how well different inverse planning models based on different goal priors can predict human goal inferences. The results provide quantitative evidence for an approximately rational inference mechanism in human goal inference within our simplified stimulus paradigm, and for the flexible nature of goal representations that human observers can adopt. We discuss the implications of our experimental results for human action understanding in realworld contexts, and suggest how our framework might be extended to capture other kinds of mental state inferences, such as inferences about beliefs, or inferring whether an entity is an intentional agent.
Apprenticeship learning using inverse reinforcement learning and gradient methods
 Proc. UAI
, 2007
"... In this paper we propose a novel gradient algorithm to learn a policy from an expert’s observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm’s aim is to find a reward function such that the resulting o ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
In this paper we propose a novel gradient algorithm to learn a policy from an expert’s observed behavior assuming that the expert behaves optimally with respect to some unknown reward function of a Markovian Decision Problem. The algorithm’s aim is to find a reward function such that the resulting optimal policy matches well the expert’s observed behavior. The main difficulty is that the mapping from the parameters to policies is both nonsmooth and highly redundant. Resorting to subdifferentials solves the first difficulty, while the second one is overcome by computing natural gradients. We tested the proposed method in two artificial domains and found it to be more reliable and efficient than some previous methods. 1
Learning to Search: Functional Gradient Techniques for Imitation Learning
 Autonomous Robots
, 2009
"... Programming robot behavior remains a challenging task. While it is often easy to abstractly define or even demonstrate a desired behavior, designing a controller that embodies the same behavior is difficult, time consuming, and ultimately expensive. The machine learning paradigm offers the promise o ..."
Abstract

Cited by 44 (18 self)
 Add to MetaCart
Programming robot behavior remains a challenging task. While it is often easy to abstractly define or even demonstrate a desired behavior, designing a controller that embodies the same behavior is difficult, time consuming, and ultimately expensive. The machine learning paradigm offers the promise of enabling “programming by demonstration ” for developing highperformance robotic systems. Unfortunately, many “behavioral cloning ” (Bain & Sammut, 1995; Pomerleau, 1989; LeCun et al., 2006) approaches that utilize classical tools of supervised learning (e.g. decision trees, neural networks, or support vector machines) do not fit the needs of modern robotic systems. These systems are often built atop sophisticated planning algorithms that efficiently reason far into the future; consequently, ignoring these planning algorithms in lieu of a supervised learning approach often leads to myopic and poorquality robot performance. While planning algorithms have shown success in many realworld applications ranging from legged locomotion (Chestnutt et al., 2003) to outdoor unstructured navigation (Kelly et al., 2004; Stentz, 2009), such algorithms rely on fully specified cost functions that map sensor readings and environment models to quantifiable costs. Such cost functions are usually manually designed and programmed. Recently, a set of techniques has been developed that explore learning these functions from expert human demonstration.
Learning for Control from Multiple Demonstrations
"... We consider the problem of learning to follow a desired trajectory when given a small number of demonstrations from a suboptimal expert. We present an algorithm that (i) extracts the—initially unknown—desired trajectory from the suboptimal expert’s demonstrations and (ii) learns a local model suit ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
We consider the problem of learning to follow a desired trajectory when given a small number of demonstrations from a suboptimal expert. We present an algorithm that (i) extracts the—initially unknown—desired trajectory from the suboptimal expert’s demonstrations and (ii) learns a local model suitable for control along the learned trajectory. We apply our algorithm to the problem of autonomous helicopter flight. In all cases, the autonomous helicopter’s performance exceeds that of our expert helicopter pilot’s demonstrations. Even stronger, our results significantly extend the stateoftheart in autonomous helicopter aerobatics. In particular, our results include the first autonomous tictocs, loops and hurricane, vastly superior performance on previously performed aerobatic maneuvers (such as inplace flips and rolls), and a complete airshow, which requires autonomous transitions between these and various other maneuvers. 1.