Results 1 - 10
of
94
Policy Gradient Methods for Reinforcement Learning with Function Approximation
, 1999
"... Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by i ..."
Abstract
-
Cited by 262 (13 self)
- Add to MetaCart
Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor–critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
AntNet: Distributed stigmergetic control for communications networks
- Journal of Artificial Intelligence Research
, 1998
"... This paper introduces AntNet, a novel approach to the adaptive learning of routing tables in communications networks. AntNet is a distributed, mobile agents based Monte Carlo system that was inspired by recent work on the ant colony metaphor for solving optimization problems. AntNet's agents concurr ..."
Abstract
-
Cited by 205 (29 self)
- Add to MetaCart
This paper introduces AntNet, a novel approach to the adaptive learning of routing tables in communications networks. AntNet is a distributed, mobile agents based Monte Carlo system that was inspired by recent work on the ant colony metaphor for solving optimization problems. AntNet's agents concurrently explore the network and exchange collected information. The communication among the agents is indirect and asynchronous, mediated by the network itself. This form of communication is typical of social insects and is called stigmergy. We compare our algorithm with six state-of-the-art routing algorithms coming from the telecommunications and machine learning elds. The algorithms' performance is evaluated over a set of realistic testbeds. We run many experiments over real and arti cial IP datagram networks with increasing number of nodes and under several paradigmatic spatial and temporal tra c distributions. Results are very encouraging. AntNet showed superior performance under all the experimental conditions with respect to its competitors. We analyze the main characteristics of the algorithm and try to explain the reasons for its superiority. 1.
Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
- Advances in Neural Information Processing Systems 7
, 1995
"... Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable ..."
Abstract
-
Cited by 137 (7 self)
- Add to MetaCart
Increasing attention has been paid to reinforcement learning algorithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo policy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that performs considerably better than any deterministic policy. Although the space of stochastic policies is continuous---even for a discrete action space---our algorith...
Infinite-horizon policy-gradient estimation
- Journal of Artificial Intelligence Research
, 2001
"... Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a si ..."
Abstract
-
Cited by 119 (5 self)
- Add to MetaCart
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce � � , a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward. 1.
Learning to Cooperate via Policy Search
, 2000
"... Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Poli ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to value-based methods for partially observable environments. In this paper, we provide a gradient-based distributed policysearch method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain. 1 INTRODUCTION The interaction of decision makers who share an environment is traditionally studied in game theory and economics. The game theoretic formalism is very general, and analyzes the problem in terms of solution concepts such as Nash equilibrium [12], but usually works under the assu...
Value-function approximations for partially observable Markov decision processes
- Journal of Artificial Intelligence Research
, 2000
"... Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advanta ..."
Abstract
-
Cited by 105 (0 self)
- Add to MetaCart
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain. 1.
Reinforcement Learning with Soft State Aggregation
- Advances in Neural Information Processing Systems 7
, 1995
"... It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. In this paper we address t ..."
Abstract
-
Cited by 97 (3 self)
- Add to MetaCart
It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representations. In this paper we address the pressing issue of combining function approximation and RL, and present 1) a function approximator based on a simple extension to state aggregation (a commonly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented. 1 INTRODUCTION The strong theory of convergence available for reinforcement lea...
Learning finite-state controllers for partially observable environments
- In Proceedings of the fifteenth conference on uncertainty in artificial intelligence
, 1999
"... Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPs), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend B ..."
Abstract
-
Cited by 67 (9 self)
- Add to MetaCart
Reactive (memoryless) policies are sufficient in completely observable Markov decision processes (MDPs), but some kind of memory is usually necessary for optimal control of a partially observable MDP. Policies with finite memory can be represented as finite-state automata. In this paper, we extend Baird and Moore’s VAPS algorithm to the problem of learning general finite-state automata. Because it performs stochastic gradient descent, this algorithm can be shown to converge to a locally optimal finitestate controller. We provide the details of the algorithm and then consider the question of under what conditions stochastic gradient descent will outperform exact gradient descent. We conclude with empirical results comparing the performance of stochastic and exact gradient descent, and showing the ability of our algorithm to extract the useful information contained in the sequence of past observations to compensate for the lack of observability at each time-step. 1
Experiments with Infinite-Horizon, Policy-Gradient Estimation
- Journal of Artificial Intelligence Research
, 2001
"... In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the p ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochastic-gradient algorithm, and with an algorithm based on conjugate-gradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems. 1.
Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes
- In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... Recent research on hidden-state reinforcement learning (RL) problems has concentrated on overcoming partial observability by using memory to estimate state. However, such methods are computationally extremely expensive and thus have very limited applicability. This emphasis on state estimation has c ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
Recent research on hidden-state reinforcement learning (RL) problems has concentrated on overcoming partial observability by using memory to estimate state. However, such methods are computationally extremely expensive and thus have very limited applicability. This emphasis on state estimation has come about because it has been widely observed that the presence of hidden state or partial observability renders popular RL methods such as Q-learning and Sarsa useless. However, this observation is misleading in two ways: first, the theoretical results supporting it only apply to RL algorithms that do not use eligibility traces, and second these results are worst-case results, which leaves open the possibility that there may be large classes of hidden-state problems in which RL algorithms work well without any state estimation. In this paper we show empirically that Sarsa(#), a well known family of RL algorithms that use eligibility traces, can work very well on hidden state problems that ...

