Results 1  10
of
309
Reinforcement Learning I: Introduction
, 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract

Cited by 5500 (120 self)
 Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve longterm goals.
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1690 (26 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Cognitive networks
 in Proc. of IEEE DySPAN 2005
, 2005
"... Abstract — This paper presents a definition and framework for a novel type of adaptive data network: the cognitive network. In a cognitive network, the collection of elements that make up the network observes network conditions and then, using prior knowledge gained from previous interactions with t ..."
Abstract

Cited by 1090 (7 self)
 Add to MetaCart
(Show Context)
Abstract — This paper presents a definition and framework for a novel type of adaptive data network: the cognitive network. In a cognitive network, the collection of elements that make up the network observes network conditions and then, using prior knowledge gained from previous interactions with the network, plans, decides and acts on this information. Cognitive networks are different from other “intelligent ” communication technologies because these actions are taken with respect to the endtoend goals of a data flow. In addition to the cognitive aspects of the network, a specification language is needed to translate the user’s endtoend goals into a form understandable by the cognitive process. The cognitive network also depends on a Software Adaptable Network that has both an external interface accessible to the cognitive network and network status sensors. These devices are used to provide control and feedback. The paper concludes by presenting a simple case study to illustrate a cognitive network and its framework. I.
Ant algorithms for discrete optimization
 ARTIFICIAL LIFE
, 1999
"... This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic ..."
Abstract

Cited by 475 (42 self)
 Add to MetaCart
(Show Context)
This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic biological findings on real ants are reviewed and their artificial counterparts as well as the ACO metaheuristic are defined. In the second part of the article a number of applications of ACO algorithms to combinatorial optimization and routing in communications networks are described. We conclude with a discussion of related work and of some of the most important aspects of the ACO metaheuristic.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Machine Learning
, 1992
"... Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinfor ..."
Abstract

Cited by 444 (0 self)
 Add to MetaCart
Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediatereinforcement tasks and certain limited forms of delayedreinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Learning to coordinate behaviors
 In Proceedings of AAAI90
, 1990
"... We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find ..."
Abstract

Cited by 231 (3 self)
 Add to MetaCart
(Show Context)
We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find out (i) whether it is relevant (ie. whether it is at all correlated to positive feedback) and (ii) what the conditions are under which it becomes reliable (i.e. the conditions under which it maximizes the probability of receiving positive feedback and minimizes the probability of receiving negative feedback). The algorithm has been tested successfully on an autonomous 6legged robot which had to learn how to coordinate its legs so as to walk forward. Situation of the Problem Since 1985, the MIT Mobile Robot group has advocated a radically different architecture for autonomous intelligent agents (Brooks, 1986). Instead of decomposing the architecture into functional modules, such as perception, modeling, and planning (figure 1), the architecture is decomposed into taskachieving modules, also called behaviors (figure 2). This novel approach has already demonstrated to be very successful and similar approaches have become more
Task Decomposition Through Competition in a Modular Connectionist Architecture
 COGNITIVE SCIENCE
, 1990
"... A novel modular connectionist architecture is presented in which the networks composing the architecture compete to learn the training patterns. As a result of the competition, different networks learn different training patterns and, thus, learn to compute different functions. The architecture pe ..."
Abstract

Cited by 211 (7 self)
 Add to MetaCart
A novel modular connectionist architecture is presented in which the networks composing the architecture compete to learn the training patterns. As a result of the competition, different networks learn different training patterns and, thus, learn to compute different functions. The architecture performs task decomposition in the sense that it learns to partition a task into two or more functionally independent vii tasks and allocates distinct networks to learn each task. In addition, the architecture tends to allocate to each task the network whose topology is most appropriate to that task, and tends to allocate the same network to similar tasks and distinct networks to dissimilar tasks. Furthermore, it can be easily modified so as to...
Learning and Sequential Decision Making
 LEARNING AND COMPUTATIONAL NEUROSCIENCE
, 1989
"... In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been pr ..."
Abstract

Cited by 204 (11 self)
 Add to MetaCart
(Show Context)
In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of longterm payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the nonengineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.
Gradient Descent for General Reinforcement Learning
 In Advances in Neural Information Processing Systems 11
, 1998
"... A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcementlearning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcemen ..."
Abstract

Cited by 142 (0 self)
 Add to MetaCart
A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcementlearning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MDPs. These include Q learning, SARSA, and advantage learning. In addition to these valuebased algorithms it also generates pure policysearch reinforcementlearning algorithms, which learn optimal policies without learning a value function. In addition, it allows policysearch and valuebased algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (VAPS) algorithm. And these algorithms converge for POMDPs without requiring a ...
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 129 (14 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.