## Module-Based Reinforcement Learning: Experiments with a Real Robot

### BibTeX

@MISC{Ari_module-basedreinforcement,

author = {Csaba Szepesv Ari and Konkoly Thege and M. Tit},

title = {Module-Based Reinforcement Learning: Experiments with a Real Robot},

year = {}

}

### OpenURL

### Abstract

The behavior of reinforcement learning (RL) algorithms is best understood in com

### Citations

533 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1993
(Show Context)
Citation Context ...RI AND A. LORINCZ v being initialized to v, at the beginning of the loop and letting V'+1 = v at the end of the loop. Algorithms where the value of the actual state is updated are called "real time" (=-=Barto, Bradtke, and Singh, 1995-=-). If, in each step, all the states are updated (FJ = S), and the inner loop is run until convergence is reached, the resulting algorithm will be called Adaptive Dynamic Programming (ADP). Another pop... |

443 | Planning in a hierarchy of abstraction spaces - Sacerdoti - 1974 |

356 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...ated to the learner2 ZS. KADIAR. CS. SZEPESV ARI AND A. LORINCZ in each time step. A few years ago Markovian Decision Problems (MDPs) were proposed as the model for the analysis of RL (Werbas, 1977; =-=Sutton, 1984-=-) and since then, a mathematically \vell-founded theory has been constructed for a large class of RL algorithms. These algorithms are based on modifications of the two ba.<:;ic dynamic-programming alg... |

337 | Automatic programming of behavior-based robots using reinforcement learning - Mahadevan, Connell - 1991 |

320 | A qualitative physics based on confluences - Kleer, Brown - 1984 |

241 | Reinforcement learning with hierarchies of machines - Parr, Russell - 1997 |

216 |
Applied Probability Models with Optimization Applications
- Ross
- 1970
(Show Context)
Citation Context ... the presence of the minimization operator), also known as the Bellman equations (Bellman, 1957), can be solved by various dynamic-programming methods such as t.he value- or policy-iteration methods (=-=Ross, 1970-=-). RL algorithms are generalizations of the DP methods to the case when the transition probabilities and immediate costs are unknown. The da.':iS of RL algorithms of interest here can be vie\ved as va... |

210 | On the Convergence of Stochastic Iterative Dynamic Programming Algorithms - Jaakkola, Jordan, et al. - 1994 |

209 | Learning to Coordinate Behaviors
- Maes, Brooks
- 1990
(Show Context)
Citation Context ...at Tyrrell's findings could be debated (Maes, 1991a). In another \vork of hers, she also proposed the learning of links between the modules (Jviaes1 1992) and she also tried out this on a real-robot (=-=Maes and Brooks, 1990-=-). Yet another main direction of automatic robot programming research llses genetic algorithms to find good robotic programs (see e.g. Brooks, 1991a; Koza and Rice, 1992) in a space of possible progra... |

195 | Reinforcement learning with perceptual aliasing: the perceptual distinction approach - Chrisman - 1992 |

177 | Algorithms for sequential decision making
- LITTMAN
- 1996
(Show Context)
Citation Context ...has been found to deal with such involved problems. In fact the abovementioned learning problem is indeed intractable moving to partial observability. This result follows from a theorem of Littman's (=-=Littman, 1996-=-). One of t.he most. promising approaches, originally suggest.ed t.o deal wit.h large, but observable problems; is based on the idea of decomposing the task into smaller subta."iks. This very basic id... |

143 |
Computational mechanisms for action selection
- Tyrrell
- 1993
(Show Context)
Citation Context ...y �!aes could notMODULE-BASED REIKFORCE,IE'IT LEARNIKG 27 work well compared to other action-selection mechanisms. j'v'loreover1 he added that there might be theoretical reasons behind this failure (=-=Tyrrell, 1993-=-). Maes later pointed out that Tyrrell's findings could be debated (Maes, 1991a). In another \vork of hers, she also proposed the learning of links between the modules (Jviaes1 1992) and she also trie... |

135 | Feature-based methods for large scale dynamic programming - Tsitsiklis, Roy - 1996 |

128 | Learning without state-estimation in partially observable markovian decision processes - Singh, Jaakkola, et al. - 1994 |

123 | Robot shaping: Developing autonomous agents through learning - Dorigo, Columbetti - 1994 |

123 |
The dynamics of action selection
- Maes
- 1989
(Show Context)
Citation Context ...other action-selection mechanisms. j'v'loreover1 he added that there might be theoretical reasons behind this failure (Tyrrell, 1993). Maes later pointed out that Tyrrell's findings could be debated (=-=Maes, 1991-=-a). In another \vork of hers, she also proposed the learning of links between the modules (Jviaes1 1992) and she also tried out this on a real-robot (Maes and Brooks, 1990). Yet another main direction... |

104 | Purposive Behavior Acquisition for a Real Robot by VisionBased Reinforcement - Asada, Noda, et al. - 1996 |

65 | Reinforcement Learning with a Hierarchy of Abstract Models
- Singh
- 1992
(Show Context)
Citation Context ...up, Sutton, and Singh, 1997). On the other hand, the learning of macro-operators, which can be interpreted as learning control modules; has a longer history (Korf, 1985b; Malmdevan and Connell, 1992; =-=Singh, 1992-=-). The third aspect is the learning of the s\vitching of the particular controllers, ,""hieh has been studied among others by Singh (1992) and more recently by Parr and Russell (1997) in hierarchical ... |

62 | How to dynamically merge Markov decision processes - Singh, Cohn - 1998 |

48 | Modeling agents as qualitative decision makers - Brafman, Tennenholtz - 1997 |

40 | A survey of some results in stochastic adaptive control - Kumar - 1985 |

28 | Behavior Analysis and Training: A Methodology for Behavior Engineering - Colombetti, Dorigo, et al. - 1996 |

26 |
A generalized reinforcement learning model: Convergence and applications
- Littman, Szepesvári
- 1996
(Show Context)
Citation Context ...ations of the two ba.<:;ic dynamic-programming algorithms used to solve 1IDPs, namely the valueand policy-iteration algorithms (Watkins and Dayan, 1992; Jaakkola, Jordan, and Singh, 1994; Littman and =-=Szepesvari, 1996-=-; Tsitsiklis and Van Roy, 1996; Sutton, 1996). The RL algorithms learn via experience, gradually building an estimate of the optimal value function, 1,vhich is known to cncompa.'3S all the knmvlcdgc n... |

24 | An analysis of temporal difference learning with function approximation - Tsitsiklis, Roy - 1997 |

20 | Planning with closed-loop macro actions - Precup, Sutton, et al. - 1997 |

20 |
Qualitative system identification: deriving structure from behavior
- Say, Kuru
- 1996
(Show Context)
Citation Context ...s description of the problem seems to be suitable if one wanted to automate this step using a planner. Qualitative modelling has a long tradition in artificial intelligence (de Kleer and Seely, 1984; =-=Say and Selahattin, 1996-=-; Brafman and Moshe, 1997). ::"-Jevertheless, regardless of vvhat representation and method is used, we end up \vith a set of macro-actions and their associated subgoals. In the next step the designer... |

15 | Module Based Reinforcement Learning for a Real Robot - Kalmar, Szepesvari, et al. |

14 |
The loss from imperfect value functions in expectation-based and minimax-based tasks
- Heger
- 1996
(Show Context)
Citation Context ...owing: If there exists a controller for the above accessibility decision problem where the cumulated worst-casc rcward is non-'wro (i.e. during evaluations only worst-case transitions arc considered (=-=Heger, 1996-=-)) then there exists a switching controller that can solve t.he original problem (Lygeros, Godbole, and SaBt.ry, 1997). Such a cont.roller will be called proper in the \vorst-case sense. The reverse o... |

14 |
Vector-valued dynamic programming
- Henig
- 1983
(Show Context)
Citation Context ... has precedence over the subgoal of feeding wbich again has precedence over cbasing moving objects. In :VlDPs such precedence relations can be captured by certain vector-valued evaluation fUllctions (=-=Henig, 1983-=-) and also IlL algorithms can be derived which take into account the predefined precedences (Gabor, Kalmar, and Szepesvari, 1998). Our module concept (operating conditions together \vith controllers) ... |

5 | Generalized dynamic concept model as a route to construct adaptive autonomous agents. Neural Network World; 5 - Kalmar, Lorincz - 1995 |

3 |
On the convergence of single-step on-policy reinforcement-learning algorithms
- Szepesvari
- 1997
(Show Context)
Citation Context ...visited before time t increased by 1), then, on the one hand, sufficient exploration is ensured while on the other, the whole process eventually converges to optimality (Singh, Jaakkola, Littman, and =-=Szepesvari, 1997-=-; S7;epesvari, 1997b). The most common form of randomized action selection is called "Bolt7,mann exploration" : where the probability of choosing action a in state s equals eQt (s,a) jT(s,t) where T(s... |

2 |
Sixth European Workshop on Learning Robots
- Birk, Demiris
- 1998
(Show Context)
Citation Context ...completely predictable, e.g. a grasped ball may easily slip out from the gripper. Note also that the task is quite complex compared to the tasks considered in the mobile learning literature (see e.g. =-=Birk and Demiris, 1998-=-). 3.2. The modules 3.2.1. S'Uutask decomposition Firstly, according to the principles laid dmvn in Section 2, the task was decomposed into subtasks. The following subtasks emerged naturally (see Figu... |

2 | Generalization in an autonomous agent - Lőrincz - 1994 |

1 | Dynamic Programming - Dellman - 1957 |

1 | A unified framework for hybrid cont.rol: Background, model, and t.heory. Technical report lids-p-2239, Laboratory for Information and Decision Systems, MIT, 77 Massachuset.ts A venue - nranicky, er - 1994 |

1 | Artificial life and real robots - nrooks - 1991 |

1 |
Alecsys and the autoUQIIlOuse: Learning to control a real robot by distributed classifier systems
- Dorigo
- 1995
(Show Context)
Citation Context ...ams (see e.g. Brooks, 1991a; Koza and Rice, 1992) in a space of possible programs_ Alternatively basic behaviors, including their coordination, can be learned by using a classifier systems' approach (=-=Dorigo, 1995-=-) and genetic algorithm. Like us, Dorigo also emphasized that design and learning should be well balanced and outlined a general "methodology for behavior engineering)' (Colombetti, :rvLDorigo, and Bo... |

1 | Complexity analysiR of real-time reinforcement learning applied to finding shortest paths in deterministic domains - Koenig, Simmons - 1985 |

1 | Macro-operators: A weak method for learning - KarL - 1985 |

1 | Planning as sea.rch: A qua.ntitative approach - KarL - 1987 |

1 | A design frame\vork for hierarchical, hybrid controL - Lygeros, Godbole, et al. - 1997 |

1 |
Learning behavior networks froIn experience. In To'War·d a Practice of Antonomons Systems
- 1vlaes
- 1992
(Show Context)
Citation Context ...te 8 is defined as the total expected discounted cost of executing the action from the given state and proceeding in an optimal fashion afterwards: Q' (8, a) = c (s, a) +., L p(s, a, S ' )V ' (8 ' ). =-=(1)-=- " The general structure of value-function-approximation based RL algorithms is given in Table 1. In the TIL algorithms, various models are utilized along wit.h an update rule Ft and action-selection ... |

1 | Reinforcement learning in the multi-robot domain - tvlatarit - 1997 |

1 |
Algorit.hms for design of hybrid systems
- ry
- 1997
(Show Context)
Citation Context ...s non-'wro (i.e. during evaluations only worst-case transitions arc considered (Heger, 1996)) then there exists a switching controller that can solve t.he original problem (Lygeros, Godbole, and SaBt.=-=ry, 1997-=-). Such a cont.roller will be called proper in the \vorst-case sense. The reverse of the above implication is not necessarily true and this makes the analysis somewhat limited. Besides this, the exact... |

1 |
Dyna.mic Concept Modcl learns optimal policies
- Szepesvari
- 1994
(Show Context)
Citation Context ... representation on the top of an :tv1DP rcprC'Bcntation (Kalmar, Szepesvari, and Lorincz, 1994, 1995). This algorithms relics on a 'triplet' represent.ation of MDPs (see Szepesvari and Lorincz, 1994; =-=Szepesvari, 1994-=-) when transitions are represented and evaluated instead of state-action pairs or states. Transitions are then interpreted as rules that apply to specific situations and are combined to get new, more ... |

1 | Generalized l'v"larkov Decision Processes: Dynamic programming and reinforcement learning algorithms. Neural Computation - Szcpcsvari - 1997 |

1 | Behavior of an adaptive sclf-organi:;;ing autonomous agent working with cues and competing concepts - cpcsvari, Larine - 1994 |

1 |
The mIt: of explomt'ton in leaT1ting contml. Van Nostrand Rheinhold
- Thrun
- 1992
(Show Context)
Citation Context ...e schedule performs significantly better than the others with constant learning rates. We have also tested another exploration strategy which Thrun found the best among several undirected met.hods12 (=-=Thrun, 1992-=-). These runs reinforced our previous findings that estimating a model (i.e. running AOP or AllTOP instead of Q-learning) could reduce the regret rate by as much as 50%.MODULE· BASED REIKFORCE,IE'IT ... |

1 | Asynchronous stochastic approximation and q-Iearning - Tsitsiklis - 1994 |

1 | using modular reinforcement learning - Q-learning - 1996 |