Results 1  10
of
82
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1298 (23 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract

Cited by 175 (8 self)
 Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a longrun measure of reward, and "I" is an automated planning or learning system (agent). In particular,
Recent advances in hierarchical reinforcement learning
, 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract

Cited by 161 (23 self)
 Add to MetaCart
A preliminary unedited version of this paper was incorrectly published as part of Volume
MachineLearning Research  Four Current Directions
"... Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up super ..."
Abstract

Cited by 114 (1 self)
 Add to MetaCart
Machine Learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (a) improving classification accuracy by learning ensembles of classifiers, (b) methods for scaling up supervised learning algorithms, (c) reinforcement learning, and (d) learning complex stochastic models.
Hierarchical Control and Learning for Markov Decision Processes
, 1998
"... This dissertation investigates the use of hierarchy and problem decomposition as a means of solving large, stochastic, sequential decision problems. These problems are framed as Markov decision problems (MDPs). The new technical content of this dissertation begins with a discussion of the concept o ..."
Abstract

Cited by 108 (2 self)
 Add to MetaCart
This dissertation investigates the use of hierarchy and problem decomposition as a means of solving large, stochastic, sequential decision problems. These problems are framed as Markov decision problems (MDPs). The new technical content of this dissertation begins with a discussion of the concept of temporal abstraction. Temporal abstraction is shown to be equivalent to the transformation of a policy defined over a region of an MDP to an action in a semiMarkov decision problem (SMDP). Several algorithms are presented for performing this transformation efficiently. This dissertation introduces the HAM method for generating hierarchical, temporally abstract actions. This method permits the partial specification of abstract actions in a way that corresponds to an abstract plan or strategy. Abstr...
A Generalized ReinforcementLearning Model: Convergence and Applications
 In Proceedings of the 13th International Conference on Machine Learning
, 1996
"... Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this pap ..."
Abstract

Cited by 44 (5 self)
 Add to MetaCart
Reinforcement learning is the process by which an autonomous agent uses its experience interacting with an environment to improve its behavior. The Markov decision process (mdp) model is a popular way of formalizing the reinforcementlearning problem, but it is by no means the only way. In this paper, we show how many of the important theoretical results concerning reinforcement learning in mdps extend to a generalized mdp model that includes mdps, twoplayer games and mdps under a worstcase optimality criterion as special cases. The basis of this extension is a stochasticapproximation theorem that reduces asynchronous convergence to synchronous convergence. 1 INTRODUCTION Reinforcement learning is the process by which an agent improves its behavior in an environment via experience. A reinforcementlearning scenario is defined by the experience presented to the agent at each step, and the criterion for evaluating the agent's behavior. One particularly wellstudied reinforcementle...
SelfImproving Factory Simulation using Continuoustime AverageReward Reinforcement Learning
 Proceedings of the 14th International Conference on Machine Learning
, 1997
"... Many factory optimization problems, from inventory control to scheduling and reliability, can be formulated as continuoustime Markov decision processes. A primary goal in such problems is to find a gainoptimal policy that minimizes the longrun average cost. This paper describes a new averagerewar ..."
Abstract

Cited by 44 (11 self)
 Add to MetaCart
Many factory optimization problems, from inventory control to scheduling and reliability, can be formulated as continuoustime Markov decision processes. A primary goal in such problems is to find a gainoptimal policy that minimizes the longrun average cost. This paper describes a new averagereward algorithm called SMART for finding gainoptimal policies in continuous time semiMarkov decision processes. The paper presents a detailed experimental study of SMART on a large unreliable production inventory problem. SMART outperforms two wellknown reliability heuristics from industrial engineering. A key feature of this study is the integration of the reinforcement learning algorithm directly into two commercial discreteevent simulation packages, ARENA and CSIM, paving the way for this approach to be applied to many other factory optimization problems for which there already exist simulation models. 1 Introduction Many problems in industrial design and manufacturing, such as schedulin...
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporaldifference learning  a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
A unified analysis of valuefunctionbased reinforcementlearning algorithms. Neural Computation
, 1997
"... Reinforcement learning is the problem of generating optimal behavior in a sequential decisionma.king environment given the opportunity of interacting,vith it. Many algorithms for solving reinforcementlearning problems work by computing improved estimates of the optimal value function. \Ve extend p ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
Reinforcement learning is the problem of generating optimal behavior in a sequential decisionma.king environment given the opportunity of interacting,vith it. Many algorithms for solving reinforcementlearning problems work by computing improved estimates of the optimal value function. \Ve extend prior analyses of reinforcementlearning algorithms and present a powerful new theorem that can provide a unified analysis of valuefunctionbased reinforcementlearning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcementlearning algorithm to be proven by verifying that a Himplcr HynchronouH algorithm convergeH. \Ve illuHtrate the application of the theorem by analyzing the convergence of Qlearningl modelbased reinforcement learning, Qlearning with multistate updates, Qlearning for:\farkov games, and risksensitive reinforcement learning. 1
Autoexploratory Average Reward Reinforcement Learning
 Artificial Intelligence
, 1996
"... We introduce a modelbased average reward Reinforcement Learning method called Hlearning and compare it with its discounted counterpart, Adaptive RealTime Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to Hlearning, which automatically explores the unexp ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
We introduce a modelbased average reward Reinforcement Learning method called Hlearning and compare it with its discounted counterpart, Adaptive RealTime Dynamic Programming, in a simulated robot scheduling task. We also introduce an extension to Hlearning, which automatically explores the unexplored parts of the state space, while always choosing greedy actions with respect to the current value function. We show that this "Autoexploratory Hlearning" performs better than the original Hlearning under previously studied exploration methods such as random, recencybased, or counterbased exploration. Introduction Reinforcement Learning (RL) is the study of learning agents that improve their performance at some task by receiving rewards and punishments from the environment. Most approaches to reinforcement learning, including Qlearning (Watkins and Dayan 92) and Adaptive RealTime Dynamic Programming (ARTDP) (Barto, Bradtke, & Singh 95), optimize the total discounted reward the ...