## Accelerating Reinforcement Learning through Implicit Imitation (2003)

### Cached

### Download Links

- [www.cs.toronto.edu]
- [www.cs.cmu.edu]
- [jair.org]
- [www.jair.org]
- [www.cs.washington.edu]
- [www.aaai.org]
- [www.aaai.org]
- [www.jair.org]
- DBLP

### Other Repositories/Bibliography

Venue: | JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH |

Citations: | 59 - 0 self |

### BibTeX

@ARTICLE{Price03acceleratingreinforcement,

author = {Bob Price and Craig Boutilier},

title = {Accelerating Reinforcement Learning through Implicit Imitation},

journal = {JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH},

year = {2003},

volume = {19},

pages = {569--629}

}

### Years of Citing Articles

### OpenURL

### Abstract

Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments

### Citations

4013 | Reinforcement Learning: an Introduction - Sutton, Barto - 1998 |

2784 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...(Puterman, 1994). The (optimal) value of a state V ∗ (s) isitsvalueVπ ∗(s) under any optimal policy π∗ . By solving an MDP, we refer to the problem of constructing an optimal policy. Value iteration (=-=Bellman, 1957-=-) is a simple iterative approximation algorithm for optimal policy construction. Given some arbitrary estimate V 0 of the true value function V ∗ ,weiteratively improve this estimate as follows: V n ∑... |

1357 | Reinforcement learning: a survey - Kaelbling, Littman - 1996 |

1276 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...del-free methods—those in which an estimate of the optimal value function or Q-function is learned directly, without recourse to a domain model—have attracted much attention. For example, TD-methods (=-=Sutton, 1988-=-) and Q-learning (Watkins & Dayan, 1992) have both proven to be among the more popular methods for reinforcement learning. Our methods can be modified to deal with model-free approaches, as we discuss... |

1268 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...btained by executing π beginning at s. A policy π∗ is optimal if, for all s ∈Sand all policies π, wehaveVπ∗(s) ≥ Vπ(s). We are guaranteed that such optimal (stationary) policies exist in our setting (=-=Puterman, 1994-=-). The (optimal) value of a state V ∗ (s) isitsvalueVπ ∗(s) under any optimal policy π∗ . By solving an MDP, we refer to the problem of constructing an optimal policy. Value iteration (Bellman, 1957) ... |

802 | Neoro-Dynamics Programming - Bertsekas, Tsikilis - 1996 |

664 | Game Theory: Analysis of Conflict - Myerson - 1991 |

561 |
Theory of Statistics
- Schervish
- 1995
(Show Context)
Citation Context ...tion over the parameters of the transition distribution Pr(s, a, ·), and then updatesthese with each experienced transition. For instance, we might assume a Dirichlet (Generalized Beta) distribution (=-=DeGroot, 1975-=-) with parameters n(s, a, t) associated with each possible successor state t. TheDirichletparameters are equal to the experience-based counts C(s, a, t) plus a “prior count” P (s, a, t) representingth... |

514 | Markov games as a framework for multiagent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...ng to multiagent systems offers unique opportunities and challenges. When agents are viewed as independently trying to achieve their own ends, interesting issues in the interaction of agent policies (=-=Littman, 1994-=-) must be resolved (e.g., by appeal to equilibrium concepts). However, the fact that agents may share information for mutual gain (Tan, 1993) or distribute their search for optimal policies and commun... |

392 | Dynamic Programming: Deterministic and Stochastic Models - Bertsekas - 1987 |

328 | Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13:103–130
- Moore, Atkeson
- 1993
(Show Context)
Citation Context ...ty equivalence approach, backups are applied until convergence. Thus prioritized sweeping can be viewed as a specific form of asynchronous value iteration, and has appealing computational properties (=-=Moore & Atkeson, 1993-=-). Under certainty equivalence, the agent acts as if the current approximation of the model is correct, even though the model is likely to be inaccurate early in the learning process. If the optimal p... |

325 | Learning from demonstration
- Schaal
- 1997
(Show Context)
Citation Context ...idual agent performance can be improved is by having a novice agent learn reasonable behaviorfromanexpertmentor. This type of learning can be brought about through explicit teaching or demonstration (=-=Atkeson & Schaal, 1997-=-; Lin, 1992; Whitehead, 1991a), by sharing of privileged information (Mataric, 1998), or through an explicit cognitive representation of imitation (Bakker & Kuniyoshi, 1996). In imitation, the agent’s... |

314 | Learning in Embedded Systems - Kaelbling - 1993 |

305 |
The optimal control of partially observable Markov decision processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...n of MDPs, though these pose no special complications. Finally, we make the assumption of full observability. Partially observable MDPs (POMDPs) (Cassandra, Kaelbling, & Littman, 1994; Lovejoy, 1991; =-=Smallwood & Sondik, 1973-=-) are much more computationally demanding than fully observable MDPs. Our imitation model will be based on a fully observable model, though some of the generalizations of our model mentioned in the co... |

290 | Multiagent reinforcement learning: Theoretical framework and an algorithm - Hu, Wellman - 1998 |

284 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...ure. We do not consider action costs in our formulation of MDPs, though these pose no special complications. Finally, we make the assumption of full observability. Partially observable MDPs (POMDPs) (=-=Cassandra, Kaelbling, & Littman, 1994-=-; Lovejoy, 1991; Smallwood & Sondik, 1973) are much more computationally demanding than fully observable MDPs. Our imitation model will be based on a fully observable model, though some of the general... |

283 | Self-improving reactive agents based on reinforcement learning, planning, and teaching
- Lin
- 1992
(Show Context)
Citation Context ...can be improved is by having a novice agent learn reasonable behaviorfromanexpertmentor. This type of learning can be brought about through explicit teaching or demonstration (Atkeson & Schaal, 1997; =-=Lin, 1992-=-; Whitehead, 1991a), by sharing of privileged information (Mataric, 1998), or through an explicit cognitive representation of imitation (Bakker & Kuniyoshi, 1996). In imitation, the agent’s own explor... |

268 |
Multivariate observations
- Seber
- 1984
(Show Context)
Citation Context ..., which will then allow the agent to perform the required feasibility test. Our assumption is therefore self-correcting. We deal with the multivariate complications by performing the Bonferroni test (=-=Seber, 1984-=-), which has been shown to give good results in practice (Mi & Sampson, 1993), is efficient to compute, and is known to be robust to dependence between variables. A Bonferroni hypothesis test is obtai... |

262 | Multi-agent reinforcement learning: Independent vs. cooperative agents
- Tan
- 1993
(Show Context)
Citation Context ...nteresting issues in the interaction of agent policies (Littman, 1994) must be resolved (e.g., by appeal to equilibrium concepts). However, the fact that agents may share information for mutual gain (=-=Tan, 1993-=-) or distribute their search for optimal policies and communicate reinforcement signals to one another (Mataric, 1998) offers intriguing possibilities for accelerating reinforcement learning and enhan... |

245 | Near-optimal reinforcement learning in polynomial time - Kearns, Singh |

245 | Learning by watching: Extracting reusable task knowledge from visual observation of human performance
- Kuniyoshi, Inaba, et al.
- 1994
(Show Context)
Citation Context ... such as piloting aircraft (Sammut et al., 1992) and controlling loading cranes ( ˇ Suc &Bratko, 1997). Other researchers have investigated the use of imitation to simplify the programming of robots (=-=Kuniyoshi, Inaba, & Inoue, 1994-=-). The ability of imitation to transfer complex, nonlinear and dynamic behaviors from existing human agents makes it particularly attractive for control problems. 7. Extensions The model of implicit i... |

245 |
Stochastic games
- Shapley
- 1953
(Show Context)
Citation Context ...ount for the actions and objectives of multiple agents. In this section, we introduce a formal framework for studying implicit imitation. We begin by introducing a general model for stochastic games (=-=Shapley, 1953-=-; Myerson, 1991), and then impose various assumptions and restrictions on this general model that allow us to focus on the key aspects of implicit imitation. We note that the framework proposed here i... |

217 | Algorithms for inverse reinforcement learning - Ng, Russell - 2000 |

186 |
A survey of algorithmic methods for partially observed Markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ... our formulation of MDPs, though these pose no special complications. Finally, we make the assumption of full observability. Partially observable MDPs (POMDPs) (Cassandra, Kaelbling, & Littman, 1994; =-=Lovejoy, 1991-=-; Smallwood & Sondik, 1973) are much more computationally demanding than fully observable MDPs. Our imitation model will be based on a fully observable model, though some of the generalizations of our... |

169 | Locally weighted learning for control
- Atkeson, Moore, et al.
- 1997
(Show Context)
Citation Context ...ction at points where no direct evidence has been received. Two important approaches are parameter-based models (e.g., neural networks) (Bertsekas & Tsitsiklis, 1996) and the memory-based approaches (=-=Atkeson, Moore, & Schaal, 1997-=-). In both these approaches, model-free learning is generally employed. That is, the agent keeps a value function but uses the environment as an implicit model to perform backups using the sampling di... |

161 | Learning to predict by the methods of temporal dierences - Sutton - 1988 |

152 | Decision theoretic planning: Structural assumptions and computational leverage - Boutilier, Dean, et al. - 1999 |

146 | Sequential optimality and coordination in multiagent systems - Boutilier - 1999 |

125 | Reinforcement Learning for Dynamic Channel Allocation - Singh, Bertsekas - 1996 |

112 | Learning to fly
- Sammut, Hurst, et al.
- 1992
(Show Context)
Citation Context ...mpt to directly infer a policy from its observations of mentor state-action pairs. This model has a conceptual simplicity and intuitive appeal, and forms the basis of the behavioral cloning paradigm (=-=Sammut, Hurst, Kedzier, & Michie, 1992-=-; Urbancic & Bratko, 1994). However, it assumes that the observer and mentor share the same reward function and action capabilities. It also assumes that complete and unambiguous trajectories (includi... |

111 |
Learning by imitation: A hierarchical approach
- Byrne, Russon
- 1998
(Show Context)
Citation Context ...apabilities and goals. Imitation can be further analyzed in terms of the type of correspondence demonstrated by the mentor’s behavior and the observer’s acquired behavior (Nehaniv & Dautenhahn, 1998; =-=Byrne & Russon, 1998-=-). Correspondence types are distinguished by level. At the action level, there is a correspondence between actions. At the program level, the actions 621Price & Boutilier may becompletely different b... |

107 | Model minimization in Markov decision processes
- Dean, Givan
- 1997
(Show Context)
Citation Context ...form of bisimulation of the type studied in automaton minimization (Hartmanis & Stearns, 1966; Lee & Yannakakis, 1992) and automatic abstraction methods developed for MDPs (Dearden & Boutilier, 1997; =-=Dean & Givan, 1997-=-). It is not hard to show—ignoring the presence of other agents—that the underlying system is Markovian with respect to the abstraction (or equivalently, w.r.t. Si) ifcondition 1 is met. The quantific... |

105 |
R.E.Stearns. Algebraic Structure Theory of Sequential Machines
- Hartmanis
- 1966
(Show Context)
Citation Context ...h state in an equivalence class has identical dynamics with respect to the abstraction induced by Li. This type of abstraction is a form of bisimulation of the type studied in automaton minimization (=-=Hartmanis & Stearns, 1966-=-; Lee & Yannakakis, 1992) and automatic abstraction methods developed for MDPs (Dearden & Boutilier, 1997; Dean & Givan, 1997). It is not hard to show—ignoring the presence of other agents—that the un... |

86 |
LEAP: A learning apprentice for VLSI design
- Mitchell, Mahadevan, et al.
- 1985
(Show Context)
Citation Context ...or and observer. A less direct form of teaching involves an observer extracting information from a mentor without the mentor making an explicit attempt to demonstrate a specific behavior of interest (=-=Mitchell, Mahadevan, & Steinberg, 1985-=-). In this paper we develop an imitation model we call implicit imitation that allows an agent to accelerate the reinforcement learning process through the observation of an expert mentor (or mentors)... |

84 | Rational and convergent learning in stochastic games
- Bowling, Veloso
(Show Context)
Citation Context ...ial to accelerate learning. A general solution requires the integration of imitation into more general models for multiagent RL based on stochastic or Markov games (Littman, 1994; Hu & Wellman, 1998; =-=Bowling & Veloso, 2001-=-). This would no doubt be a rather challenging, yet rewarding endeavor. To take a simple example, in simple coordination problems (e.g., two mobile agents trying to avoid each other while carrying out... |

83 | Elevator group control using multiple reinforcement learning agents
- Crites, Barto
- 1998
(Show Context)
Citation Context ...rned controller which has already adapted to its clients to a new learning controller with acompletely different architecture. Many modern products such as elevator controllers 617Price & Boutilier (=-=Crites & Barto, 1998-=-), cell traffic routers (Singh & Bertsekas, 1997) and automotive fuel injection systems use adaptive controllers to optimize the performance of a system for specific user profiles. When upgrading the ... |

80 | Robot see, robot do: An overview of robot imitation
- Bakker, Kuniyoshi
- 1996
(Show Context)
Citation Context ...it teaching or demonstration (Atkeson & Schaal, 1997; Lin, 1992; Whitehead, 1991a), by sharing of privileged information (Mataric, 1998), or through an explicit cognitive representation of imitation (=-=Bakker & Kuniyoshi, 1996-=-). In imitation, the agent’s own exploration is used to ground its observations of other agents’ c○2003 AI Access Foundation. All rights reserved.Price & Boutilier behaviors in its own capabilities a... |

67 | Abstraction and approximate decision theoretic planning
- Dearden, Boutilier
- 1997
(Show Context)
Citation Context ...s type of abstraction is a form of bisimulation of the type studied in automaton minimization (Hartmanis & Stearns, 1966; Lee & Yannakakis, 1992) and automatic abstraction methods developed for MDPs (=-=Dearden & Boutilier, 1997-=-; Dean & Givan, 1997). It is not hard to show—ignoring the presence of other agents—that the underlying system is Markovian with respect to the abstraction (or equivalently, w.r.t. Si) ifcondition 1 i... |

57 |
A complexity analysis of cooperative mechanisms in reinforcement learning
- Whitehead
- 1991
(Show Context)
Citation Context ...oved is by having a novice agent learn reasonable behaviorfromanexpertmentor. This type of learning can be brought about through explicit teaching or demonstration (Atkeson & Schaal, 1997; Lin, 1992; =-=Whitehead, 1991-=-a), by sharing of privileged information (Mataric, 1998), or through an explicit cognitive representation of imitation (Bakker & Kuniyoshi, 1996). In imitation, the agent’s own exploration is used to ... |

54 | Using communication to reduce locality in distributed multi-agent learning
- Mataric
- 1997
(Show Context)
Citation Context ...ibrium concepts). However, the fact that agents may share information for mutual gain (Tan, 1993) or distribute their search for optimal policies and communicate reinforcement signals to one another (=-=Mataric, 1998-=-) offers intriguing possibilities for accelerating reinforcement learning and enhancing agent performance. Another way in which individual agent performance can be improved is by having a novice agent... |

54 | Two kinds of training information for evaluation function learning - Utgoff, Clouse - 1991 |

39 | Reconstructing human skill with machine learning - Urbancic, Bratko - 1994 |

38 | Learning to communicate through imitation in autonomous robots
- Billard, Hayes
- 1997
(Show Context)
Citation Context ...o be extended to learning of temporal sequences (Billard & Hayes, 1999). Associative learning has been used together with innate following behaviors to acquire navigation expertise from other agents (=-=Billard & Hayes, 1997-=-). Arelated but slightly different form of imitation has been studied in the multi-agent reinforcement learning community. An early precursor to imitation can be found in work on sharing of perception... |

34 |
Robot programming by demonstration (RPD): supporting the induction by human interaction
- Friedrich, Münch, et al.
- 1996
(Show Context)
Citation Context ...). Imitation techniques have been applied in a diverse collection of applications. Classical control applications include control systems for robot arms (Kuniyoshi et al., 1994; 623Price & Boutilier =-=Friedrich, Munch, Dillmann, Bocionek, & Sassin, 1996-=-), aeration plants (Scheffer et al., 1997), and container loading cranes ( ˇ Suc &Bratko, 1997; Urbancic & Bratko, 1994). Imitation learning has also been applied to acceleration of generic reinforcem... |

30 |
Mondrian: A Teachable Graphical Editor
- Lieberman
- 1993
(Show Context)
Citation Context ...ed to extract useful knowledge by watching users, but the goal of apprentices is not to independently solve problems. Learning apprentices are closely related to programming by demonstration systems (=-=Lieberman, 1993-=-). Later efforts used 622Implicit Imitation more sophisticated techniques to extract actions from visual perceptions and abstract these actions for future use (Kuniyoshi et al., 1994). Work on associ... |

30 | Mapping between dissimilar bodies: Affordances and the algebraic foundations of imitation
- Nehaniv, Dautenhahn
- 1998
(Show Context)
Citation Context ... ways of defining similarity of ability, for example, by assuming that the observer may be able to move through state space in a similar fashion to the mentor without following the same trajectories (=-=Nehaniv & Dautenhahn, 1998-=-). For instance, the mentor may have a way of moving directly between key locations in state space, while the observer may be able to move between analogous locations in a less direct fashion. In such... |

25 | M.: Intelligent social learning
- Conte, Paolucci
- 2001
(Show Context)
Citation Context ...other animals, we know that socially facilitated learning is widespread throughout the animal kingdom. A number of researchers have pointed out, however, that social facilitation can take many forms (=-=Conte, 2000-=-; Noble & Todd, 1999). For instance, a mentor’s attention to an object can draw an observer’s attention to it and thereby lead the observer to manipulate the object independently of the model provided... |

24 |
Model-based Bayesian exploration
- Dearden, Friedman, et al.
- 1999
(Show Context)
Citation Context ...has been reached, then the estimate Pr(s, a, t) =C(s, a, t)/C(s, a). If C(s, a) =0, some prior estimate is used (e.g., one might assume all state transitions are equiprobable). A Bayesian approach (=-=Dearden, Friedman, & Andre, 1999-=-) uses an explicit prior distribution over the parameters of the transition distribution Pr(s, a, ·), and then updatesthese with each experienced transition. For instance, we might assume a Dirichlet ... |

22 | Behavior-based primitives for articulated control
- Matarić, Williamson, et al.
- 1998
(Show Context)
Citation Context ...select the action which is closest to the observed behavior (Demiris & Hayes, 1999). Explicit motor action schema have also been investigated in the dual role of perceptual and motor representations (=-=Matarić, Williamson, Demiris, & Mohan, 1998-=-). Imitation techniques have been applied in a diverse collection of applications. Classical control applications include control systems for robot arms (Kuniyoshi et al., 1994; 623Price & Boutilier ... |

21 |
Self-improvement based on reinforcement learning, planning and teaching
- Lin
- 1991
(Show Context)
Citation Context ...ursor to imitation can be found in work on sharing of perceptions between agents (Tan, 1993). Closer to imitation is the idea of replaying the perceptions and actions of one agent for a second agent (=-=Lin, 1991-=-; Whitehead, 1991a). Here, the transfer is from one agent to another, in contrast to behavioral cloning’s transfer from human to agent. The representation is also different. Reinforcement learning pro... |