Results 1  10
of
45
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1309 (22 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract

Cited by 177 (8 self)
 Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a longrun measure of reward, and "I" is an automated planning or learning system (agent). In particular,
From Implicit Skills to Explicit Knowledge: A BottomUp Model of Skill Learning
, 1999
"... This paper presents a skill learning model CLARION. Different from existing models of mostly highlevel skill learning that use a topdown approach (that is, turning declarative knowledge into procedural knowledge through practice), we adopt a bottomup approach toward lowlevel skill learning, wher ..."
Abstract

Cited by 103 (33 self)
 Add to MetaCart
This paper presents a skill learning model CLARION. Different from existing models of mostly highlevel skill learning that use a topdown approach (that is, turning declarative knowledge into procedural knowledge through practice), we adopt a bottomup approach toward lowlevel skill learning, where procedural knowledge develops first and declarative knowledge develops later. Our model is formed by integrating connectionist, reinforcement, and symbolic learning methods to perform online reactive learning. It adopts a twolevel dualrepresentation framework (Sun, 1995), with a combination of localist and distributed representation. We compare the model with human data in a minefield navigation task, demonstrating some match between the model and human data in several respects.
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
Roles of MacroActions in Accelerating Reinforcement Learning
 In Grace Hopper Celebration of Women in Computing
, 1997
"... We analyze the use of builtin policies, or macroactions, as a form of domain knowledge that can improve the speed and scaling of reinforcement learning algorithms. Such macroactions are often used in robotics, and macrooperators are also wellknown as an aid to statespace search in AI systems. T ..."
Abstract

Cited by 40 (11 self)
 Add to MetaCart
We analyze the use of builtin policies, or macroactions, as a form of domain knowledge that can improve the speed and scaling of reinforcement learning algorithms. Such macroactions are often used in robotics, and macrooperators are also wellknown as an aid to statespace search in AI systems. The macroactions we consider are closedloop policies with termination conditions. The macroactions can be chosen at the same level as primitive actions. Macroactions commit the learning agent to act in a particular, purposeful way for a sustained period of time. Overall, macroactions may either accelerate or retard learning, depending on the appropriateness of the macroactions to the particular task. We analyze their effect in a simple example, breaking the acceleration effect into two parts: 1) the effect of the macroaction in changing exploratory behavior, independent of learning, and 2) the effect of the macroaction on learning, independent of its effect on behavior. In our example, ...
Policy iteration for decentralized control of Markov decision processes
 JAIR
"... Coordination of distributed agents is required for problems arising in many areas, including multirobot systems, networking and ecommerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DECPOMDP). Though much work has been done on o ..."
Abstract

Cited by 22 (15 self)
 Add to MetaCart
Coordination of distributed agents is required for problems arising in many areas, including multirobot systems, networking and ecommerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DECPOMDP). Though much work has been done on optimal dynamic programming algorithms for the singleagent version of the problem, optimal algorithms for the multiagent case have been elusive. The main contribution of this paper is an optimal policy iteration algorithm for solving DECPOMDPs. The algorithm uses stochastic finitestate controllers to represent policies. The solution can include a correlation device, which allows agents to correlate their actions without communicating. This approach alternates between expanding the controller and performing valuepreserving transformations, which modify the controller without sacrificing value. We present two efficient valuepreserving transformations: one can reduce the size of the controller and the other can improve its value while keeping the size fixed. Empirical results demonstrate the usefulness of valuepreserving transformations in increasing value while keeping controller size to a minimum. To broaden the applicability of the approach, we also present a heuristic version of the policy iteration algorithm, which sacrifices convergence to optimality. This algorithm further reduces the size of the controllers at each step by assuming that probability distributions over the other agentsâ€™ actions are known. While this assumption may not hold in general, it helps produce higher quality solutions in our test problems. 1.
Incremental Dynamic Programming for OnLine Adaptive Optimal Control
, 1994
"... Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended for use in situations where the information or computational resources needed by traditional dynamic programming algorithms are not available. IDP algorithms attempt to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment. This class of algorithms is not new, going back at least as far as Samuel's adaptive checkersplaying programs,...
SelfSegmentation of Sequences: Automatic Formation of Hierarchies of Sequential Behaviors
 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: PART B CYBERNETICS
, 2000
"... The paper presents an approach for hierarchical reinforcement learning that does not rely on a priori domainspecific knowledge regarding hierarchical structures. Thus this work deals with a more difficult problem compared with existing work. It involves learning to segment action sequences to cr ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
The paper presents an approach for hierarchical reinforcement learning that does not rely on a priori domainspecific knowledge regarding hierarchical structures. Thus this work deals with a more difficult problem compared with existing work. It involves learning to segment action sequences to create hierarchical structures (for example, for the purpose of dealing with partially observable Markov decision processes, with multiple limitedmemory or memoryless modules). Segmentation is based on reinforcement received during task execution, with different levels of control communicating with each other through sharing reinforcement estimates obtained by each other. The algorithm segments action sequences to reduce nonMarkovian temporal dependencies, and seeks out proper configurations of long and shortrange dependencies, to facilitate the learning of the overall task. Developing hierarchies also facilitates the extraction of explicit hierarchical plans. The initial experiments demonstrate the promise of the approach.
Open Theoretical Questions in Reinforcement Learning
, 1999
"... nfinite number of terms (in which case we usually assume fl ! 1). Infinite horizon cases with fl = 1 are also possible though less common (e.g., see Mahadevan, 1996). The agent's action choices are a stochastic function of the state, called a policy , : S 7! P r(A). The value of a state given a po ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
nfinite number of terms (in which case we usually assume fl ! 1). Infinite horizon cases with fl = 1 are also possible though less common (e.g., see Mahadevan, 1996). The agent's action choices are a stochastic function of the state, called a policy , : S 7! P r(A). The value of a state given a policy is the expected return starting from that state following the policy: V (s) = EfR t j s t = s; g; and the best that can be done in a state is its optimal value: V (s) = max V (s): There is always at least one optimal policy , , that achieves this maximum at all states s 2 S. Paralleling the two statevalue functions defined above are two<F
MultiAgent Reinforcement Learning: Weighting and Partitioning
, 1999
"... This paper addresses weighting and partitioning in complex reinforcement learning tasks, with the aim of facilitating learning. The paper presents some ideas regarding weighting of multiple agents and extends them into partitioning an input/state space into multiple regions with differential weighti ..."
Abstract

Cited by 19 (11 self)
 Add to MetaCart
This paper addresses weighting and partitioning in complex reinforcement learning tasks, with the aim of facilitating learning. The paper presents some ideas regarding weighting of multiple agents and extends them into partitioning an input/state space into multiple regions with differential weighting in these regions, to exploit differential characteristics of regions and differential characteristics of agents to reduce the learning complexity of agents (and their function approximators) and thus to facilitate the learning overall. It analyzes, in reinforcement learning tasks, different ways of partitioning a task and using agents selectively based on partitioning. Based on the analysis, some heuristic methods are described and experimentally tested. We find that some offline heuristic methods performed the best, significantly better than singleagent models. Keywords: weighting, averaging, neural networks, partitioning, gating, reinforcement learning, 1 Introduction Multiple ag...