Results 1  10
of
95
Policy Gradient Methods for Reinforcement Learning with Function Approximation
, 1999
"... Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by i ..."
Abstract

Cited by 319 (18 self)
 Add to MetaCart
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor–critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate actionvalue or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
PEGASUS: A policy search method for large MDPs and POMDPs
 In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence
, 2000
"... We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP ..."
Abstract

Cited by 207 (7 self)
 Add to MetaCart
We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP in which all state transitions (given the current state and action) are deterministic. This reduces the general problem of policy search to one in which we need only consider POMDPs with deterministic transitions. We give a natural way of estimating the value of all policies in these transformed POMDPs. Policy search is then simply performed by searching for a policy with high estimated value. We also establish conditions under which our value estimates will be good, recovering theoretical results similar to those of Kearns, Mansour and Ng [7], but with "sample complexity" bounds that have only a polynomial rather than exponential dependence on the horizon time. Our method appl...
Multiagent Learning Using a Variable Learning Rate
 Artificial Intelligence
, 2002
"... Learning to act in a multiagent environment is a difficult problem since the normal definition of an optimal policy no longer applies. The optimal policy at any moment depends on the policies of the other agents and so creates a situation of learning a moving target. Previous learning algorithms hav ..."
Abstract

Cited by 180 (8 self)
 Add to MetaCart
Learning to act in a multiagent environment is a difficult problem since the normal definition of an optimal policy no longer applies. The optimal policy at any moment depends on the policies of the other agents and so creates a situation of learning a moving target. Previous learning algorithms have one of two shortcomings depending on their approach. They either converge to a policy that may not be optimal against the specific opponents' policies, or they may not converge at all. In this article we examine this learning problem in the framework of stochastic games. We look at a number of previous learning algorithms showing how they fail at one of the above criteria. We then contribute a new reinforcement learning technique using a variable learning rate to overcome these shortcomings. Specifically, we introduce the WoLF principle, "Win or Learn Fast", for varying the learning rate. We examine this technique theoretically, proving convergence in selfplay on a restricted class of iterated matrix games. We also present empirical results on a variety of more general stochastic games, in situations of selfplay and otherwise, demonstrating the wide applicability of this method.
Infinitehorizon policygradient estimation
 Journal of Artificial Intelligence Research
, 2001
"... Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a si ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a simulationbased algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of biasvariance tradeoff), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multipleagents, higherorder derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward. 1.
Learning to Cooperate via Policy Search
, 2000
"... Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcementlearning algorithms, such as variants of Qlearning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Poli ..."
Abstract

Cited by 131 (4 self)
 Add to MetaCart
Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcementlearning algorithms, such as variants of Qlearning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to valuebased methods for partially observable environments. In this paper, we provide a gradientbased distributed policysearch method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain. 1 INTRODUCTION The interaction of decision makers who share an environment is traditionally studied in game theory and economics. The game theoretic formalism is very general, and analyzes the problem in terms of solution concepts such as Nash equilibrium [12], but usually works under the assu...
Approximate Planning in Large POMDPs via Reusable Trajectories
, 1999
"... We consider the problem of choosing a nearbest strategy from a restricted class of strategies in a partially observable Markov decision process (POMDP). We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience su cient to a ..."
Abstract

Cited by 111 (10 self)
 Add to MetaCart
We consider the problem of choosing a nearbest strategy from a restricted class of strategies in a partially observable Markov decision process (POMDP). We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience su cient to accurately approximate the expected return of any strategy in the class. We prove upper bounds on the amount of simulated experience our methods must generate in order to achieve such uniform approximation. These bounds have no dependence on the size or complexity of the underlying POMDP, but depend only on the complexity of the restricted strategy class. The main challenge is in generating trajectories in the POMDP that can be reused, in the sense that they simultaneously provide estimates of the return of many strategies in the class. Our measure of strategy class complexity generalizes the classical notion of VC dimension, and our methods develop connections between problems of current interest in reinforcement learning and wellstudied issues in the theory of supervised learning. We also discuss a number of practical planning algorithms for POMDPs that arise from our reusable trajectories.
Reinforcement learning for RoboCupsoccer keepaway
 Adaptive Behavior
, 2005
"... 1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMD ..."
Abstract

Cited by 109 (32 self)
 Add to MetaCart
1 RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tilecoding function approximation and variable λ to learning higherlevel decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers, ” tries to keep control of the ball for as long as possible despite the efforts of “the takers. ” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.
Scaling Reinforcement Learning toward RoboCup Soccer
, 2001
"... RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the eects of actions. We describe our application of episodic SMDP Sarsa() with linear tilecoding funct ..."
Abstract

Cited by 101 (21 self)
 Add to MetaCart
RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple agents, and long and variable delays in the eects of actions. We describe our application of episodic SMDP Sarsa() with linear tilecoding function approximation and variable to learning higherlevel decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, \the keepers," tries to keep control of the ball for as long as possible despite the eorts of \the takers." The keepers learn individually when to hold the ball and when to pass to a teammate, while the takers learn when to charge the ballholder and when to cover possible passing lanes. Our agents learned policies that signi cantly outperformed a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including dierent eld sizes and dierent numbers of players on each team.
Variable Resolution Discretization in Optimal Control
 Machine Learning
, 2001
"... The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which nearoptimal policie ..."
Abstract

Cited by 101 (2 self)
 Add to MetaCart
The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which nearoptimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kdtrie. We then consider topdown approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new nonlocal measures, inuence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficientlyalculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficientlycalculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the longterm reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, nonlinear, nonminimum phase, and two dimensional problem of the "Car on the hill". It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number...
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 91 (20 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.