## Multiple model-based reinforcement learning (2002)

### Cached

### Download Links

- [www.cns.atr.jp]
- [ece.ut.ac.ir]
- [www.isd.atr.co.jp]
- [www.cns.atr.jp]
- [www.cns.atr.jp]
- DBLP

### Other Repositories/Bibliography

Venue: | Neural Computation |

Citations: | 49 - 2 self |

### BibTeX

@ARTICLE{Doya02multiplemodel-based,

author = {Kenji Doya and Kazuyuki Samejima},

title = {Multiple model-based reinforcement learning},

journal = {Neural Computation},

year = {2002},

volume = {14},

pages = {1347--1369}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a modular reinforcement learning architecture for non-linear, non-stationary control tasks, which we call multiple model-based reinforcement learn-ing (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The 1 system is composed of multiple modules, each of which consists of a state predic-tion model and a reinforcement learning controller. The “responsibility signal,” which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules as well as to gate the learning of the predic-tion models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite state case and continuous-time, continuous state case. The performance of MMRL was demonstrated for discrete case in a non-stationary hunting task in a grid world and for continuous case in a non-linear, non-stationary control task of swinging up a pendulum with variable physical parameters. 1

### Citations

1237 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...ollers. The actual equation for the parameter update varies with the choice of the RL algorithms, which are detailed in the next section. When a temporal difference (TD) algorithm (Barto et al. 1983; =-=Sutton 1988-=-; Doya 2000) is used, the TD error, δ(t) =r(t)+γV (x(t + 1)) − V (x(t)) (18) in the discrete case and δ(t) =ˆr(t) − 1 τ V (t)+ ˙ V (t) (19) in the continuos case, is weighted by the responsibility sig... |

862 | Reinforcement learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ach state and then to improve the policy based on the value function. We define the value function of the state x(t) under the current policy as � ∞� � V (x(t)) = E γ k=0 k r(t + k) in discrete case (=-=Sutton and Barto 1998-=-) and �� ∞ V (x(t)) = E 0 � s − e τ r(t + s)ds in continuous case (Doya 2000), where 0 ≤ γ ≤ 1 and 0 <τ are the parameters for discounting future reward. 5 (5) (6)s2.1 Responsibility Signal The purpos... |

780 |
Adaptive mixtures of local experts
- Jacobs, Jordan, et al.
- 1991
(Show Context)
Citation Context ...architecture, a non-linear and/or non-stationary control task is decomposed in space and time based on the local predictability of the environmental dynamics. 2sThe “mixture of experts” architecture (=-=Jacobs et al. 1991-=-) has previously been applied to non-linear or non-stationary control tasks (Gomi and Kawato 1993; Cacciatore and Nowlan 1994). However, the success of such modular architecture depends strongly on th... |

479 |
Neuronlike adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...ing of the RL controllers. The actual equation for the parameter update varies with the choice of the RL algorithms, which are detailed in the next section. When a temporal difference (TD) algorithm (=-=Barto et al. 1983-=-; Sutton 1988; Doya 2000) is used, the TD error, δ(t) =r(t)+γV (x(t + 1)) − V (x(t)) (18) in the discrete case and δ(t) =ˆr(t) − 1 τ V (t)+ ˙ V (t) (19) in the continuos case, is weighted by the respo... |

430 | Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning - Sutton, Precup, et al. - 1999 |

323 |
Dynamic programming and optimal control
- Bertsekas
(Show Context)
Citation Context ...arning and good generalization (Schaal and Atkeson 1996). Furthermore, if the reward function is locally approximated by aquadratic function, then we can use a linear quadratic controller (see, e.g., =-=Bertsekas 1995-=-) for the RL controller design. We use a local linear dynamic model ˆ˙x(t) =Ai(x(t) − x d i )+Biu(t) (39) and a local quadratic reward model ˆr(x(t),u(t)) = r 0 i − 1 2 (x(t) − xr i ) ′ Qi(x(t) − x r ... |

255 | Feudal reinforcement learning
- Dayan, Hinton
- 1993
(Show Context)
Citation Context ...d RL algorithms can perform badly when the environment is non-stationary, or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; =-=Dayan and Hinton 1993-=-; Littman et al. 1995; Wiering and Schmidhuber 1998; Parr and Russel 1998; Sutton et al. 1999; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to decompose acomplex tas... |

241 | Reinforcement learning with hierarchies of machines
- Parr, Russell
- 1997
(Show Context)
Citation Context ...or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; Littman et al. 1995; Wiering and Schmidhuber 1998; =-=Parr and Russel 1998-=-; Sutton et al. 1999; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to decompose acomplex task into simpler subtasks. This paper presents a new RL architecture based ... |

232 | Learning policies for partially observable environments: Scaling up
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...form badly when the environment is non-stationary, or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; =-=Littman et al. 1995-=-; Wiering and Schmidhuber 1998; Parr and Russel 1998; Sutton et al. 1999; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to decompose acomplex task into simpler subtas... |

232 | Multiple Paired Forward and Inverse Models for Motor Control
- Wolpert, Kawato
- 1998
(Show Context)
Citation Context ...uence prediction. The use of the softmax function for module selection and combination was originally proposed for a tracking control paradigm as the “Multiple Paired Forward-Inverse Models (MPFIM)” (=-=Wolpert and Kawato 1998-=-; Wolpert et al. 1998; Haruno et al. 1999). It was recently reformulated as “MOdular Selection and Identification for Control (MOSAIC)” (Wolpert and Ghahramani 2000). In this paper, we apply the idea ... |

162 | Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning 8:323–339
- Singh
- 1992
(Show Context)
Citation Context ...low. Standard RL algorithms can perform badly when the environment is non-stationary, or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (=-=Singh 1992-=-; Dayan and Hinton 1993; Littman et al. 1995; Wiering and Schmidhuber 1998; Parr and Russel 1998; Sutton et al. 1999; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to... |

114 | Reinforcement Learning in Continuous Time and Space - Doya |

105 | MOSAIC model for sensorimotor learning and control - Haruno, Wolpert, et al. - 2001 |

101 |
Computational principles of movement neuroscience
- Wolpert, Ghahramani
- 2000
(Show Context)
Citation Context ...Paired Forward-Inverse Models (MPFIM)” (Wolpert and Kawato 1998; Wolpert et al. 1998; Haruno et al. 1999). It was recently reformulated as “MOdular Selection and Identification for Control (MOSAIC)” (=-=Wolpert and Ghahramani 2000-=-). In this paper, we apply the idea of a softmax selection of modules to the paradigm of reinforcement learning. The resulting learning architecture, which we call “Multiple Model-based Reinforcement ... |

96 | Learning to perceive the world as articulated: an approach for hierarchical learning in sensory-motor systems - Tani, Nolfi - 1999 |

67 | Annealed competition of experts for a segmentation and classification of switching dynamics
- Pawelzik, Kohlmorgen, et al.
- 1996
(Show Context)
Citation Context ... (26) ˙Ei(t) =logαEi(t)+|| ˙x(t) − ˆ˙xi(t)|| 2 . (27) The use of this low-pass filtered prediction errors for responsibility prediction is helpful in avoiding chattering of the responsibility signal (=-=Pawelzik et al. 1996-=-). 2.3.2 Spatial locality In the continuous case, we consider a Gaussian spatial prior ˆλi(t) = 1 − e 2 (x(t)−ci) ′ M −1 i (x(t)−ci) � 1 nj=1 − e 2 (x(t)−cj) ′ M −1 j (x(t)−cj) 10 , (28)swhere ci is t... |

57 | Mixtures of controllers for jump linear and non-linear plants
- Cacciatore, Nowlan
- 1994
(Show Context)
Citation Context ...ictability of the environmental dynamics. 2sThe “mixture of experts” architecture (Jacobs et al. 1991) has previously been applied to non-linear or non-stationary control tasks (Gomi and Kawato 1993; =-=Cacciatore and Nowlan 1994-=-). However, the success of such modular architecture depends strongly on the capability of the gating network to decide which of the given modules should be recruited at any particular moment. An alte... |

40 |
Human cerebellar activity reflecting an acquired internal model of a new tool [see comments
- Imamizu, Miyauchi, et al.
- 2000
(Show Context)
Citation Context ...ellum is activated initially and then a smaller area remains to be active after long training. They proposed that such local activation spots are the neural correlates of internal models of 21stools (=-=Imamizu et al. 2000-=-). They also suggested that internal models of different tools are represented in separated areas in the cerebellum (Imamizu et al. 1997). Our simulation results in a non-stationary environment can pr... |

37 | Adaptive and Learning using Multiple Models, Switching and Tuning - Narendra, Balakrishnan, et al. - 1995 |

36 |
Acquisition of standup behavior by a real robot using hierarchical reinforcement learning
- Morimoto, Doya
- 2001
(Show Context)
Citation Context ...motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; Littman et al. 1995; Wiering and Schmidhuber 1998; Parr and Russel 1998; Sutton et al. 1999; =-=Morimoto and Doya 2001-=-). The basic problem in modular or hierarchical RL is how to decompose acomplex task into simpler subtasks. This paper presents a new RL architecture based on multiple modules, each of which is compos... |

27 | From isolation to cooperation: An alternative view of a system of experts
- Schaal, Atkeson
- 1996
(Show Context)
Citation Context ...se of linear models for the prediction models and the controllers is a reasonable choice because local linear models have been shown to have good properties of quick learning and good generalization (=-=Schaal and Atkeson 1996-=-). Furthermore, if the reward function is locally approximated by aquadratic function, then we can use a linear quadratic controller (see, e.g., Bertsekas 1995) for the RL controller design. We use a ... |

21 | HQ-learning
- Wiering, Schmidhuber
- 1998
(Show Context)
Citation Context ...nvironment is non-stationary, or has hidden states. These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; Littman et al. 1995; =-=Wiering and Schmidhuber 1998-=-; Parr and Russel 1998; Sutton et al. 1999; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to decompose acomplex task into simpler subtasks. This paper presents a new ... |

21 | Improved switching among temporally abstract actions
- Sutton, Singh, et al.
- 1999
(Show Context)
Citation Context ...These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; Littman et al. 1995; Wiering and Schmidhuber 1998; Parr and Russel 1998; =-=Sutton et al. 1999-=-). The basic problem in modular or hierarchical RL is how to decompose a complex task into different subtasks. This paper presents a new RL architecture based on multiple modules, each of which is com... |

15 |
Recognition of manipulated objects by motor learning with modular architecture networks
- Gomi, Kawato
- 1993
(Show Context)
Citation Context ...ased on the local predictability of the environmental dynamics. 2sThe “mixture of experts” architecture (Jacobs et al. 1991) has previously been applied to non-linear or non-stationary control tasks (=-=Gomi and Kawato 1993-=-; Cacciatore and Nowlan 1994). However, the success of such modular architecture depends strongly on the capability of the gating network to decide which of the given modules should be recruited at an... |

15 |
Multiple paired forward-inverse models for human motor learning and control
- Wolpert, M, et al.
- 1999
(Show Context)
Citation Context ...ion for module selection and combination was originally proposed for a tracking control paradigm as the “Multiple Paired Forward-Inverse Models (MPFIM)” (Wolpert and Kawato 1998; Wolpert et al. 1998; =-=Haruno et al. 1999-=-). It was recently reformulated as “MOdular Selection and Identification for Control (MOSAIC)” (Wolpert and Ghahramani 2000). In this paper, we apply the idea of a softmax selection of modules to the ... |

4 |
Separated modules for visuomotor control and learning in the cerebellum: A functional MRI study
- Imamizu, Miyauchi, et al.
- 1997
(Show Context)
Citation Context ...spots are the neural correlates of internal models of 21stools (Imamizu et al. 2000). They also suggested that internal models of different tools are represented in separated areas in the cerebellum (=-=Imamizu et al. 1997-=-). Our simulation results in a non-stationary environment can provide a computational account of these fMRI data. When a new task is introduced, many modules initially compete to learn it. However, af... |

3 | Human cerebellar activity re¯ecting an acquired internal model of a new tool - Imamizu, Miyauchi, et al. - 2000 |

1 |
The DataGrid project. Spitfire, grid enabled middleware for accessing data bases
- unknown authors
- 2002
(Show Context)
Citation Context ...These problems have motivated the introduction of modular or hierarchical RL architectures (Singh 1992; Dayan and Hinton 1993; Littman et al. 1995; Wiering and Schmidhuber 1998; Parr and Russel 1998; =-=Sutton et al. 1999-=-; Morimoto and Doya 2001). The basic problem in modular or hierarchical RL is how to decompose acomplex task into simpler subtasks. This paper presents a new RL architecture based on multiple modules,... |

1 | Multiple Model-Based Reinforcement Learning 1369 - Narendra, Balakrishnan, et al. - 1995 |