Results 1  10
of
298
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1298 (23 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Markov games as a framework for multiagent reinforcement learning
 IN PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1994
"... In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior ..."
Abstract

Cited by 500 (10 self)
 Add to MetaCart
In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Qlearninglike algorithm for finding optimal policies and demonstrates its application to a simple twoplayer game in which the optimal policy is probabilistic.
Efficient Algorithms for Globally Optimal Trajectories
 IEEE TRANSACTIONS ON AUTOMATIC CONTROL
, 1995
"... We present serial and parallel algorithms for solving ..."
Abstract

Cited by 297 (1 self)
 Add to MetaCart
We present serial and parallel algorithms for solving
Residual Algorithms: Reinforcement Learning with Function Approximation
 In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the optimal solution when used with lookup tables. It is shown, however, that these algorithms can easily become unstable when implemented directly with a general functionapproximation system, such ..."
Abstract

Cited by 237 (5 self)
 Add to MetaCart
A number of reinforcement learning algorithms have been developed that are guaranteed to converge to the optimal solution when used with lookup tables. It is shown, however, that these algorithms can easily become unstable when implemented directly with a general functionapproximation system, such as a sigmoidal multilayer perceptron, a radialbasisfunction system, a memorybased learning system, or even a linear functionapproximation system. A new class of algorithms, residual gradient algorithms, is proposed, which perform gradient descent on the mean squared Bellman residual, guaranteeing convergence. It is shown, however, that they may learn very slowly in some cases. A larger class of algorithms, residual algorithms, is proposed that has the guaranteed convergence of the residual gradient algorithms, yet can retain the fast learning speed of direct algorithms. In fact, both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is s...
Nearoptimal reinforcement learning in polynomial time
 Machine Learning
, 1998
"... We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the m ..."
Abstract

Cited by 237 (3 self)
 Add to MetaCart
We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the ExplorationExploitation tradeoff. 1
Learning policies for partially observable environments: Scaling up
, 1995
"... Partially observable Markov decision processes (pomdp's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of pomdp's is motivated by a need to address realistic problems, existing techniques for finding optim ..."
Abstract

Cited by 234 (11 self)
 Add to MetaCart
Partially observable Markov decision processes (pomdp's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of pomdp's is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of pomdp's, this paper discusses several simple solution methods and shows that all are capable of finding nearoptimal policies for a selection of extremely small pomdp's taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains. 1 Introduction Mobile robots must act on the basis of thei...
Convergence of Stochastic Iterative Dynamic Programming Algorithms
 Neural Computation
, 1994
"... Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of th ..."
Abstract

Cited by 207 (8 self)
 Add to MetaCart
Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learning problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of the behavior of these methods has been missing. In this paper we relate DPbased learning algorithms to powerful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD() and Qlearning belong. 1
On the Optimality of Solutions of the MaxProduct Belief Propagation Algorithm in Arbitrary Graphs
, 2001
"... Graphical models, suchasBayesian networks and Markov random fields, represent statistical dependencies of variables by a graph. The maxproduct "belief propagation" algorithm is a localmessage passing algorithm on this graph that is known to converge to a unique fixed point when the graph is a tr ..."
Abstract

Cited by 185 (15 self)
 Add to MetaCart
Graphical models, suchasBayesian networks and Markov random fields, represent statistical dependencies of variables by a graph. The maxproduct "belief propagation" algorithm is a localmessage passing algorithm on this graph that is known to converge to a unique fixed point when the graph is a tree. Furthermore, when the graph is a tree, the assignment based on the fixedpoint yields the most probable a posteriori (MAP) values of the unobserved variables given the observed ones. Recently, good
Correctness of Local Probability Propagation in Graphical Models with Loops
, 2000
"... This article analyzes the behavior of local propagation rules in graphical models with a loop. ..."
Abstract

Cited by 178 (9 self)
 Add to MetaCart
This article analyzes the behavior of local propagation rules in graphical models with a loop.