Results 1 - 10
of
40
Reinforcement learning: a survey
- Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract
-
Cited by 1134 (21 self)
- Add to MetaCart
This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Reinforcement Learning
, 1998
"... � How should a robot decide what to do? � It should plan for each move (Planning) � It should plan for all moves and compile its results into a set of rapid reactions (Reactive Systems) � It should Learn a set of reactions by trial-anderror ..."
Abstract
-
Cited by 649 (7 self)
- Add to MetaCart
� How should a robot decide what to do? � It should plan for each move (Planning) � It should plan for all moves and compile its results into a set of rapid reactions (Reactive Systems) � It should Learn a set of reactions by trial-anderror
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract
-
Cited by 158 (7 self)
- Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a long-run measure of reward, and "I" is an automated planning or learning system (agent). In particular,
Efficient Learning and Planning Within the Dyna Framework
- Adaptive Behavior
, 1993
"... Sutton's Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency. The ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
Sutton's Dyna framework provides a novel and computationally appealing way to integrate learning, planning, and reacting in autonomous agents. Examined here is a class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency. The benefit of using these strategies is demonstrated on some simple abstract learning tasks. 1 Introduction Many problems faced by an autonomous agent in an unknown environment can be cast in the form of reinforcement learning tasks. Recent work in this area has led to a clearer understanding of the relationship between algorithms found useful for such tasks and asynchronous approaches to dynamic programming (Bertsekas & Tsitsiklis, 1989), and this understanding has led in turn to both new results relevant to the theory of dynamic programming (Barto, Bradtke, & Singh, 1991; Watkins & Dayan, 1991; Williams & Baird, 1990) and the creation of new reinforcement learning algorithms, such as Qlearn...
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract
-
Cited by 72 (1 self)
- Add to MetaCart
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on state-action pairs, as are used in Q-learning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
Reinforcement Learning with a Hierarchy of Abstract Models
- IN PROCEEDINGS OF THE TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1992
"... Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an e ..."
Abstract
-
Cited by 61 (8 self)
- Add to MetaCart
Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing one-step lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present H-DYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. H-DYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while...
Optimal Motion Planning for Multiple Robots Having Independent Goals
- IEEE Trans. on Robotics and Automation
, 1996
"... This work makes two contributions to geometric motion planning for multiple robots: i) Motion plans are computed that simultaneously optimize an independent performance measure for each robot; ii) A general spectrum is defined between decoupled and centralized planning, in which we introduce coordin ..."
Abstract
-
Cited by 58 (5 self)
- Add to MetaCart
This work makes two contributions to geometric motion planning for multiple robots: i) Motion plans are computed that simultaneously optimize an independent performance measure for each robot; ii) A general spectrum is defined between decoupled and centralized planning, in which we introduce coordination along independent roadmaps. By considering independent performance measures, we introduce a form of optimality that is consistent with concepts from multi-objective optimization and game theory literature. Previous multiple-robot motion planning approaches that consider optimality combine individual performance measures into a scalar criterion. As a result, these methods can fail to find many potentially useful motion plans. We present implemented, multiple-robot motion planning algorithms that are derived from the principle of optimality, for three problem classes along the spectrum between centralized and decoupled planning: i) coordination along fixed, independent paths; ii) coordin...
An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2000
"... This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email
- In Proceedings of the 36th Annual Meeting of the Association of Computational Linguistics, COLING/ACL 98
, 1998
"... This paper describes a novel method by which a dia-logue agent can learn to choose an optimal dialogue strategy. While it is widely agreed that dialogue strategies should be formulated in terms of com-municative intentions, there has been little work on automatically optimizing an agent's choices wh ..."
Abstract
-
Cited by 47 (11 self)
- Add to MetaCart
This paper describes a novel method by which a dia-logue agent can learn to choose an optimal dialogue strategy. While it is widely agreed that dialogue strategies should be formulated in terms of com-municative intentions, there has been little work on automatically optimizing an agent's choices when there are multiple ways to realize a communica-tive intention. Our method is based on a combina-tion of learning algorithms and empirical evaluation techniques. The learning component of our method is based on algorithms for reinforcement learning, such as dynamic programming and Q-learning. The empirical component uses the PARADISE evalua-tion framework (Walker et al., 1997) to identify the important peribrmance factors and to provide the performance function needed by the learning algo-rithm. We illustrate our method with a dialogue agent named ELVIS (EmaiL Voice Interactive Sys-tem), that supports access to email over the phone. We show how ELVIS can learn to choose among alternate strategies for agent initiative, for reading messages, and for summarizing email folders. 1

