Results 1  10
of
27
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1693 (27 self)
 Add to MetaCart
(Show Context)
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Machine Learning
, 1992
"... Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinfor ..."
Abstract

Cited by 447 (0 self)
 Add to MetaCart
Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediatereinforcement tasks and certain limited forms of delayedreinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Learning to Cooperate via Policy Search
, 2000
"... Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcementlearning algorithms, such as variants of Qlearning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Poli ..."
Abstract

Cited by 141 (4 self)
 Add to MetaCart
(Show Context)
Cooperative games are those in which both agents share the same payoff structure. Valuebased reinforcementlearning algorithms, such as variants of Qlearning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to valuebased methods for partially observable environments. In this paper, we provide a gradientbased distributed policysearch method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain. 1 INTRODUCTION The interaction of decision makers who share an environment is traditionally studied in game theory and economics. The game theoretic formalism is very general, and analyzes the problem in terms of solution concepts such as Nash equilibrium [12], but usually works under the assu...
Learning Policies with External Memory
, 2001
"... In order for an agent to perform well in partially observable domains, it is usually necessary for actions to depend on the history of observations. In this paper, we explore a stigmergic approach, in which the agent’s actions include the ability to set and clear bits in an external memory, and the ..."
Abstract

Cited by 50 (8 self)
 Add to MetaCart
In order for an agent to perform well in partially observable domains, it is usually necessary for actions to depend on the history of observations. In this paper, we explore a stigmergic approach, in which the agent’s actions include the ability to set and clear bits in an external memory, and the external memory is included as part of the input to the agent. In this case, we need to learn a reactive policy in a highly nonMarkovian domain. We explore two algorithms: sarsa(λ), which has had empirical success in partially observable domains, and vaps, a new algorithm due to Baird and Moore, with convergence guarantees in partially observable domains. We compare the performance of these two algorithms on benchmark problems.
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are know ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multiagent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience reuse. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
Reinforcement Learning for Adaptive Routing
 In Proceedings of the International Joint Conference on Neural Networks (IJCNN
, 2002
"... Reinforcement learning means learning a policya mapping of observations into actions based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. We present an application of gradient a ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Reinforcement learning means learning a policya mapping of observations into actions based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. We present an application of gradient ascent algorithm for reinforcement learning to a complex domain of packet routing in network communication and compare the performance of this algorithm to other routing methods on a benchmark problem.
Reinforcement Learning Through Gradient Descent
, 1999
"... Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Ma ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Markov chains, even with linear function approximators and epochwise training. These algorithms are also very difficult to analyze, and difficult to combine with other algorithms. A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Qlearning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better conve...
Reinforcement learning for an ARTbased fuzzy adaptive learning control network
 IEEE Trans. Neural Network. 1996
"... Abstract This paper proposes a reinforcement fuzzy adaptive learning control network (RFALCON) for solving various reinforcement learning problems. The proposed RFALCON is constructed by integrating two fuzzy adaptive learning control networks (FALCON’S), each of which is a connectionist model wi ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Abstract This paper proposes a reinforcement fuzzy adaptive learning control network (RFALCON) for solving various reinforcement learning problems. The proposed RFALCON is constructed by integrating two fuzzy adaptive learning control networks (FALCON’S), each of which is a connectionist model with a feedforward multilayer network developed for the realization of a fuzzy controller. One FALCON performs as a critic network (fuzzy predictor), and the other as an action network (fuzzy controller). Using the temporal difference prediction method, the critic network can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the action network. The action network performs a stochastic exploratory algorithm to adapt itself according to the internal reinforcement signal. An ARTbased reinforcement structurdparameterlearning algorithm is developed for constructing the RFALCON dynamically. During the learning process, both structure learning and parameter learning are performed simultaneously in the two FALCON’S. The proposed RFALCON can construct a fuzzy control system dynamically and automatically through a rewardlpenalty signal (i.e., a “good “ or “bad ” signal). It is best applied to the learning environment, where obtaining exact training data is expensive. The proposed RFALCON has two important features, First, it reduces the combinatorial demands placed by the standard methods for adaptive linearization of a system. Second, the RFALCON is a highly autonomous system. Initially, there are no hidden nodes (i.e., no membership functions or fuzzy rules). They are created and begin to grow as learning proceeds. The RFALCON can also dynamically partition the inputoutput spaces, tune activation (membership) functions, and find proper network connection types (fuzzy rules). Computer simulations have been conducted to illustrate the performance and applicability of the proposed learning scheme. I.
GAbased fuzzy reinforcement learning for control of a magnetic bearing system
 IEEE Trans. Syst. Man Cybern. B
, 2000
"... Abstract—This paper proposes a TD (temporal difference) and GA (genetic algorithm)based reinforcement (TDGAR) learning method and applies it to the control of a real magnetic bearing system. The TDGAR learning scheme is a new hybrid GA, which integrates the TD prediction method and the GA to perfor ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper proposes a TD (temporal difference) and GA (genetic algorithm)based reinforcement (TDGAR) learning method and applies it to the control of a real magnetic bearing system. The TDGAR learning scheme is a new hybrid GA, which integrates the TD prediction method and the GA to perform the reinforcement learning task. The TDGAR learning system is composed of two integrated feedforward networks. One neural network acts as a critic network to guide the learning of the other network (the action network) which determines the outputs (actions) of the TDGAR learning system. The action network can be a normal neural network or a neural fuzzy network. Using the TD prediction method, the critic network can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the action network. The action network uses the GA to adapt itself according to the internal reinforcement signal. The key concept of the TDGAR learning scheme is to formulate the internal reinforcement signal as the fitness function for the GA such that the GA can evaluate the candidate solutions (chromosomes) regularly, even during periods without external feedback from the environment. This enables the GA to proceed to new generations regularly without waiting for the arrival of the external reinforcement signal. This can usually accelerate the GA learning since a reinforcement signal may only be available at a time long after a sequence of actions has occurred in the reinforcement learning problem. The proposed TDGAR learning system has been used to control an active magnetic bearing (AMB) system in practice. A systematic design procedure is developed to achieve successful integration of all the subsystems including magnetic suspension, mechanical structure, and controller training. The results show that the TDGAR learning scheme can successfully find a neural controller or a neural fuzzy controller for a selfdesigned magnetic bearing system. Index Terms—Action network, active magnetic bearing, adaptive heuristic critic, critic network.