## Programmable reinforcement learning agents (2001)

### Cached

### Download Links

- [www.cs.berkeley.edu]
- [www.eecs.berkeley.edu]
- [http.cs.berkeley.edu]
- [ftp.cs.berkeley.edu]
- [www.cs.berkeley.edu]
- [www.cs.berkeley.edu]
- [www.eecs.berkeley.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cs.berkeley.edu]
- [http.cs.berkeley.edu]
- [www.eecs.berkeley.edu]
- [www.eecs.harvard.edu]
- [www.cs.berkeley.edu]
- [www.cs.berkeley.edu]
- [ftp.cs.berkeley.edu]
- [www.eecs.berkeley.edu]
- [www.cc.gatech.edu]
- [damas.ift.ulaval.ca]
- [www.damas.ift.ulaval.ca]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 102 - 1 self |

### BibTeX

@INPROCEEDINGS{Andre01programmablereinforcement,

author = {David Andre and Stuart J. Russell},

title = {Programmable reinforcement learning agents},

booktitle = {},

year = {2001},

pages = {1019--1025},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn’t specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills. 1

### Citations

1346 | Reinforcement Learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context .... The discount factor,s, is generalized to be a function,s(s; a), that represents the expected discount factor when action a is taken in state s. Our definitions follow those common in the literature =-=[9, 6, 4]-=-. The HAM language [8] provides for partial specification of agent programs. A HAM program consists of a set of partially specified Moore machines. Transitions in each machine may depend stochasticall... |

690 | The Esterel synchronous programming language: Design, semantics, implementation
- Berry, Gonthier
- 1992
(Show Context)
Citation Context ...ery important in physical behaviors---more so than in computation---and are crucial in allowing for modularity in behavioral descriptions. These features are all common in robot programming languages =-=[2, 3, 5-=-]; the key element of our approach is that behaviors need only be partiallysdescribed; reinforcement learning does the rest. David was supported by the generosity of the Fannie and John Hertz Foundat... |

444 | Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
- Sutton, Precup, et al.
- 1999
(Show Context)
Citation Context ... knowledge provides a partial description of desired behaviors. Several languages for partial descriptions have been proposed, including Hierarchical Abstract Machines (HAMs) [8], semi-Markov options =-=[12]-=-, and the MAXQ framework [4]. This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming languages. Obviously, inc... |

383 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
- 2000
(Show Context)
Citation Context ...ecomposition uses all value function components all the time to choose actions. In this sense, it differs from “delegation” techniques like feudal reinforcement learning (Dayan & Hinton, 1993), MAXQ (=-=Dietterich, 2000-=-), and the hierarchical abstract machines of Parr (1997) and Andre (2002). These methods also decompose an agent into subagents, but only one subagent (or one branch in a hierarchy of subagents) is us... |

309 | The dynamics of reinforcement learning in cooperative multiagent systems - Claus, Boutilier - 1998 |

296 | On-Line Q-Learning Using Connectionist Systems
- Rummery, Niranjan
- 1994
(Show Context)
Citation Context ... that chooses NoOp in state S 0. The principal result of the paper (Section 3.2) is the simple observation that global optimality is achieved by local reinforcement learning with the Sarsa algorithm (=-=Rummery & Niranjan, 1994-=-), provided that on each iteration the arbitrator communicates its decision to the subagents. This information allows the subagents to become realistic, rather than optimistic, about their own future ... |

261 | Reward, motivation and reinforcement learning
- Dayan, Balleine
- 2002
(Show Context)
Citation Context ...sition and exact updates. Q-decomposition uses all value function components all the time to choose actions. In this sense, it differs from “delegation” techniques like feudal reinforcement learning (=-=Dayan & Hinton, 1993-=-), MAXQ (Dietterich, 2000), and the hierarchical abstract machines of Parr (1997) and Andre (2002). These methods also decompose an agent into subagents, but only one subagent (or one branch in a hier... |

247 | Reinforcement learning with hierarchies of machines
- Parr, Russell
- 1997
(Show Context)
Citation Context ...ost obvious form of prior knowledge provides a partial description of desired behaviors. Several languages for partial descriptions have been proposed, including Hierarchical Abstract Machines (HAMs) =-=[8]-=-, semi-Markov options [12], and the MAXQ framework [4]. This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming... |

226 | Exploiting structure in policy construction
- Boutilier, Dearden, et al.
- 1995
(Show Context)
Citation Context ...ed safe if optimal solutions in the abstract space are also optimal in the original space. Safe abstractions were introduced by Amarel [1] for the Missionaries and Cannibals problem. Boutilier et al. =-=[5]-=- proposed a general method for deriving safe state abstractions for Markov decision processes (MDPs). Faster problem solving and learning can also be achieved by providing prior constraints on behavio... |

213 | Algorithms for inverse reinforcement learning
- Ng, Russell
- 2000
(Show Context)
Citation Context ... learner ignores the actions of its peers when “optimizing” its policy, analogous to a local Q learner. There is still no central arbitration mechanism, but inverse reinforcement learning techniques (=-=Ng & Russell, 2000-=-) might be used to deduce the policies of other agents and bridge the communications gap. In treating only local Q learning and local Sarsa, this paper has evaluated two points in the continuum of pos... |

200 | Teleo-reactive programs for agent control
- Nilsson
- 1994
(Show Context)
Citation Context ...ng HAMs. For example, programs in Dietterich's MAXQ language [4] are written easily as PHAMs, but not as HAMs because the MAXQ language allows parameters. The language of teleo-reactive (TR) programs =-=[7, 2]-=- relies on a prioritized set of condition--action rules to achieve a goal. Each action can itself be another TR program. The TR architecture can be implemented directly in PHAMs using the abort mechan... |

186 | Policy invariance under reward transformations: Theory and application to reward shaping
- Ng, Harada, et al.
- 1999
(Show Context)
Citation Context ...o the wall, but will not cross into the infield. An agent receives a reward of 10 for completing one lap, and a penalty of−1 for each collision with the wall. An agent also receives a shaping reward (=-=Ng et al., 1999-=-) proportional to the measure of the arc swept by its action. 4.2. Implementation This paper evaluates a single racer on a 10-unit-wide track with a 15×20 infield. Training consisted of 4000 episodes,... |

161 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...the following theorem demonstrates that this sort of off-policy update leads to the convergence of the Q j estimates to a collection of locally greedy (“selfish”) estimates. Theorem 1. (Theorem 4 in (=-=Tsitsiklis, 1994-=-).) Suppose that each (s, a)∈S×A is visited infinitely often. Under the update scheme described in equation (2), each Q j will converge a.s. to a ˜Q j satisfying � ˜Q j(s, a)= P(s ′ |s, a) s ′ ∈S � R ... |

128 | Graphical models for preference and utility
- Bacchus, Grove
- 1995
(Show Context)
Citation Context ...nal savings are also possible by combining Q-decomposition with graphical models of conditional utilities. While it may be possible to elicit and maintain conditional utilities for one-step problems (=-=Bacchus & Grove, 1995-=-; Boutilier et al., 2001), the dependencies introduced by both the transition model and reward de# Fish x 104 15 10 5 0 0 10 20 30 40 50 Year 60 70 80 90 100 Figure 4. Characteristic depletion of fish... |

124 | The MAXQ method for hierarchical reinforcement learning, in
- Dietterich
- 1998
(Show Context)
Citation Context ...sistent with partial programs. Hierarchical abstract machines, or HAMs [11], are hierarchical finite automata with nondeterministic choice points within them where learning is to occur. MAXQ programs =-=[7, 8]-=- organize behavior into a hierarchy in which each “subroutine” is simply a repeated choice among a fixed set David was supported by the generosity of the Fannie and John Hertz Foundation. The work was... |

119 |
On representations of problems of reasoning about actions
- Amarel
- 1968
(Show Context)
Citation Context ...he agent to relearn a policy from scratch. An abstraction is called safe if optimal solutions in the abstract space are also optimal in the original space. Safe abstractions were introduced by Amarel =-=[1]-=- for the Missionaries and Cannibals problem. Boutilier et al. [5] proposed a general method for deriving safe state abstractions for Markov decision processes (MDPs). Faster problem solving and learni... |

109 | A multivalued logic approach to integrating planning and control
- Saffiotti, Konolige, et al.
- 1995
(Show Context)
Citation Context ... would be if it went Up. To overcome such problems, some have proposed command fusion, whereby the arbitrator executes some kind of combination (such as an average) of the subagents’ recommendations (=-=Saffiotti et al., 1995-=-; Ogasawara, 1993; Lin, 1993; Goldberg et al., in press). Unfortunately, fusing the subagents’ actions may be disastrous. In our example, averaging the direction vectors for Left and Right yields NoOp... |

107 | S.: Decision-theoretic, high-level agent programming in the situation calculus
- Boutilier, Reiter, et al.
- 2000
(Show Context)
Citation Context ...nie and John Hertz Foundation. The work was also supported by the following grants: NSF ECS-9873474, ONR MURI N00014-00-1-0637of lower-level subroutines until a termination condition is met. DTGolog =-=[6]-=- allows partial programming in Prolog combined with symbolic dynamic programming as a solution method. Programmable HAMs, or PHAMs [4], are described in Section 2; in short, they augment Lisp with cho... |

96 | R.: Computing factored value functions for policies in structured MDPs
- Koller, Parr
- 1999
(Show Context)
Citation Context ...ction updates assume a particular form to guarantee optimal agent behavior. In some cases, like the fishery world, this additive decomposition results in a more compact value function. Other authors (=-=Koller & Parr, 1999-=-) have explored approximations that represent the true value function as a linear combination of basis functions, with each basis function defined over a small collection of state variables. In order ... |

95 | UCPNetworks: A directed graphical representation of conditional utilities
- Boutilier, Bacchus, et al.
- 2001
(Show Context)
Citation Context ...ssible by combining Q-decomposition with graphical models of conditional utilities. While it may be possible to elicit and maintain conditional utilities for one-step problems (Bacchus & Grove, 1995; =-=Boutilier et al., 2001-=-), the dependencies introduced by both the transition model and reward de# Fish x 104 15 10 5 0 0 10 20 30 40 50 Year 60 70 80 90 100 Figure 4. Characteristic depletion of fish stocks over one episode... |

93 |
Achieving Artificial Intelligence through Building Robots
- Brooks
- 1986
(Show Context)
Citation Context ...ery common design, called command arbitration, requires each subagent to recommend an action to the arbitrator. In the simplest such scheme, the arbitrator chooses one of the actions and executes it (=-=Brooks, 1986-=-). The problem with this approach is that each subagent may suggest an action that makes the other subagents very unhappy; there is no way to find a “compromise” action that is reasonable from every s... |

89 | State abstraction for programmable reinforcement learning agents. Paper presented at AAAI-02 - Andre, Russell - 2002 |

89 | Coordinated reinforcement learning - Guestrin, Lagoudakis, et al. - 2002 |

57 | Team-partitioned, opaque-transition reinforcement learning - Stone, Veloso - 1999 |

53 |
Scaling up reinforcement learning for robot control
- Lin
- 1993
(Show Context)
Citation Context ... problems, some have proposed command fusion, whereby the arbitrator executes some kind of combination (such as an average) of the subagents’ recommendations (Saffiotti et al., 1995; Ogasawara, 1993; =-=Lin, 1993-=-; Goldberg et al., in press). Unfortunately, fusing the subagents’ actions may be disastrous. In our example, averaging the direction vectors for Left and Right yields NoOp, Proceedings of the Twentie... |

51 |
Stock and recruitment
- Ricker
- 1954
(Show Context)
Citation Context ...ed line) updates in the racetrack world. Arrowheads indicate one standard deviation for local Q; dots indicate one standard deviation for local Sarsa. produces according to a density-dependent model (=-=Ricker, 1954-=-): f (t+1)= f (t) exp � R � �� f (t) 1− where R for a fish population without immigration or emigration is the difference between the birth rate and the death rate, and fmax is the “carrying capacity”... |

49 | Learning Policies with External Memory
- Peshkin, Meuleau, et al.
- 1999
(Show Context)
Citation Context ...re 1(b)). Memory variables are a feature of nearly every programming language. Some previous research has been done on using memory variables in reinforcement learning in partially observable domains =-=[10]-=-. For an example of memory use in our language, examine the DoDelivery subroutine in Figure 1(b), where z 2 is set to another memory value (set in Nav(dest,sp)). z 2 is then passed as a variable to th... |

42 | Reacting, planning, and learning in an autonomous agent
- Benson, Nilsson
- 1995
(Show Context)
Citation Context ...ery important in physical behaviors---more so than in computation---and are crucial in allowing for modularity in behavioral descriptions. These features are all common in robot programming languages =-=[2, 3, 5-=-]; the key element of our approach is that behaviors need only be partiallysdescribed; reinforcement learning does the rest. David was supported by the generosity of the Fannie and John Hertz Foundat... |

32 | Hierarchical multi-agent reinforcement learning
- Makar, Mahadevan, et al.
- 2001
(Show Context)
Citation Context ... is possible to obtain safe state abstraction while maintaining hierarchical optimality. Although it is possible to use state abstraction in an approximate fashion as a form of function approximation =-=[10]-=-, we are investigating the possibility of500 Results on Taxi World 0 -500 -1000 Score (Average of 10 trials) -1500 -2000 -2500 -3000 -3500 -4000 Q-Learning PHAM w/o SA -4500 PHAM w/ SA Better PHAM w/... |

22 | Modularity issues in reactive planning
- Firby
- 1996
(Show Context)
Citation Context ...ery important in physical behaviors---more so than in computation---and are crucial in allowing for modularity in behavioral descriptions. These features are all common in robot programming languages =-=[2, 3, 5-=-]; the key element of our approach is that behaviors need only be partiallysdescribed; reinforcement learning does the rest. David was supported by the generosity of the Fannie and John Hertz Foundat... |

21 | State abstraction in MAXQ hierarchical reinforcement learning
- Dietterich
(Show Context)
Citation Context ...red behaviors. Several simple languages for hierarchical partial descriptions have been proposed, including Hierarchical Abstract Machines (HAMs) [8], semi-Markov options [12], and the MAXQ framework =-=[4]-=-. This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming languages. Obviously, increasing expressiveness makes... |

21 |
Multiple Objective Behavior-Based Control
- Pirjanian
- 2000
(Show Context)
Citation Context ...ess-playing subagents recommend a knight move and a bishop move respectively. The weaknesses of command arbitration have been pointed out previously by proponents of utility fusion (Rosenblatt, 2000; =-=Pirjanian, 2000-=-). In a utility-fusion agent, each subagent calculates its own outcome probabilities for actions and its own utilities for the outcome states. The arbitrator combines this information to obtain a glob... |

13 | Optimal Selection of Uncertain Actions by Maximizing Expected Utility
- Rosenblatt
- 2000
(Show Context)
Citation Context ...ample, when two chess-playing subagents recommend a knight move and a bishop move respectively. The weaknesses of command arbitration have been pointed out previously by proponents of utility fusion (=-=Rosenblatt, 2000-=-; Pirjanian, 2000). In a utility-fusion agent, each subagent calculates its own outcome probabilities for actions and its own utilities for the outcome states. The arbitrator combines this information... |

6 |
RALPH-MEA: A Real-Time, Decision-Theoretic Agent Architecture
- Ogasawara
- 1993
(Show Context)
Citation Context ... To overcome such problems, some have proposed command fusion, whereby the arbitrator executes some kind of combination (such as an average) of the subagents’ recommendations (Saffiotti et al., 1995; =-=Ogasawara, 1993-=-; Lin, 1993; Goldberg et al., in press). Unfortunately, fusing the subagents’ actions may be disastrous. In our example, averaging the direction vectors for Left and Right yields NoOp, Proceedings of ... |

5 |
Temporal abstraction in reinforcement learning
- Sutton
- 1995
(Show Context)
Citation Context ... on the learning rate. 5 Expressiveness of the PHAM language As shown by Parr [9], the HAM language is at least as expressive as some existing action languages including options [12] and full-smodels =-=[11]-=-. The PHAM language is substantially more expressive than HAMs. As mentioned earlier, the Deliver--Patrol PHAM program has 9 machines whereas the HAM program requires 63. In general, the additional nu... |

2 | Hierarchical Control and Learning for MDPs - Parr - 1998 |

2 |
Neuro-dynamic programming. Belmont,MA: Athena Scientific
- Bertsekas, Tsitsiklis
- 1996
(Show Context)
Citation Context ...ward if it does so forever. 1 1 In MDP jargon, an agent that chooses to stay forever is pursuing an “improper” policy. Convergence still holds under certain restrictions on the reward functions; cf. (=-=Bertsekas & Tsitsiklis, 1996-=-).s3.3. Remarks Note that ˜Q j provides an optimistic estimate of Q ∗ j by definition of the selfish action-value function: ˜Q(s, a) = ≥ ≥ � P(s ′ n� |s, a) s ′ ∈S � s ′ ∈S � s ′ ∈S = Q ∗ (s, a) j=1 P... |

2 |
Programmable hams. tech report: www.cs.berkeley.edu/˜pham.ps
- Andre
- 2000
(Show Context)
Citation Context ... in a PHAM program and as the set of possible machine states achievable by the program (where a machine state includes the program counter, all memory variables, and the call stack). In previous work =-=[2, 4]-=-, we showed that, under appropriate restrictions (such as that the number of machine states � � stays bounded in every run in the environment), the problem of finding the optimal choices for each choi... |

2 |
State abstraction in phams. tech report: www.cs.berkeley.edu/˜sa.ps
- Andre
- 2001
(Show Context)
Citation Context ...evel. When this isn’t the case, single-choice choice points are inserted, which allows simpler analysis. �� � and � are thus simple deterministic functions, determined from the program structure. See =-=[3]-=- for more details.� x y pass dest �� � �� � �� � � �� � �get-choice� 3 3 R G 0.23 -7.5 -1.0 8.74 �get-choice� 3 3 R B 1.13 -7.5 -1.0 9.63 �get-choice� 3 2 R G 1.29 -6.45 -1.0 8.74 Table 1: Table of Q... |

1 |
Programmable HAMs. www.cs.berkeley.edu/dandre/pham.ps
- Andre
- 2000
(Show Context)
Citation Context ...f the learning process for our language. 2 Background An MDP is a 4-tuple, (S; A; T ; R), where S is a set of states, A is a set of actions, T is a probabilistic transition function mapping S A S ! [0=-=-=-; 1], and R is a reward function mapping S A S to the reals. In this paper, we focus on infinite-horizon MDPs with a discount factors. A solution to a MDP is an optimal policy that maps from S ! A ... |

1 |
State abstraction in MAXQ hierarchical RL
- Dietterich
- 2000
(Show Context)
Citation Context ... description of desired behaviors. Several languages for partial descriptions have been proposed, including Hierarchical Abstract Machines (HAMs) [8], semi-Markov options [12], and the MAXQ framework =-=[4]-=-. This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming languages. Obviously, increasing expressiveness makes... |

1 |
Programmable hams. tech report: www.davidandre.com/pham.ps
- Andre
- 2000
(Show Context)
Citation Context ... with the basics of MDPs, but provide a brief review. An MDP is a 4-tuple, (S; A; T ; Z), where S is a set of states, A is a set of actions, T is a probabilistic transition function mapping S A S ! [0=-=-=-; 1], and Z is a reward function mapping from SAS to the reals. In this paper, we focus on infinite-horizon MDPs with a discount factors. The solution to a MDP is an optimal policy that maps from S ... |