Results 1  10
of
36
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1298 (23 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Locally weighted learning
 ARTIFICIAL INTELLIGENCE REVIEW
, 1997
"... This paper surveys locally weighted learning, a form of lazy learning and memorybased learning, and focuses on locally weighted linear regression. The survey discusses distance functions, smoothing parameters, weighting functions, local model structures, regularization of the estimates and bias, ass ..."
Abstract

Cited by 448 (52 self)
 Add to MetaCart
This paper surveys locally weighted learning, a form of lazy learning and memorybased learning, and focuses on locally weighted linear regression. The survey discusses distance functions, smoothing parameters, weighting functions, local model structures, regularization of the estimates and bias, assessing predictions, handling noisy data and outliers, improving the quality of predictions by tuning t parameters, interference between old and new data, implementing locally weighted learning e ciently, and applications of locally weighted learning. A companion paper surveys how locally weighted learning can be used in robot learning and control.
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract

Cited by 175 (8 self)
 Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a longrun measure of reward, and "I" is an automated planning or learning system (agent). In particular,
Locally Weighted Learning for Control
, 1996
"... Lazy learning methods provide useful representations and training algorithms for learning about complex phenomena during autonomous adaptive control of complex systems. This paper surveys ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks. We ex ..."
Abstract

Cited by 159 (17 self)
 Add to MetaCart
Lazy learning methods provide useful representations and training algorithms for learning about complex phenomena during autonomous adaptive control of complex systems. This paper surveys ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks. We explain various forms that control tasks can take, and how this affects the choice of learning paradigm. The discussion section explores the interesting impact that explicitly remembering all previous experiences has on the problem of learning to control.
KernelBased Reinforcement Learning
 Machine Learning
, 1999
"... We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernelbased approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the biasvariance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or nonparametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
"... Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman resid ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a onestep lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on stateaction pairs, as are used in Qlearning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...
Advantage Updating
, 1993
"... A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Qlearning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q learning. When there is noise or small time steps, advantage updating learns more quickly than Qlearning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results...
A Comparison of Direct and ModelBased Reinforcement Learning
 IN INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION
, 1997
"... This paper compares direct reinforcement learning (no explicit model) and modelbased reinforcement learning on a simple task: pendulum swing up. We find that in this task modelbased approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing g ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
This paper compares direct reinforcement learning (no explicit model) and modelbased reinforcement learning on a simple task: pendulum swing up. We find that in this task modelbased approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing goals. 1 Introduction Many proposed reinforcement learning algorithms require large amounts of training data before achieving acceptable performance. This paper explores the training data requirements of two kinds of reinforcement learning algorithms, direct (modelfree) and indirect (modelbased), when continuous actions are available. Direct reinforcement learning algorithms learn a policy or value function without explicitly representing a model of the controlled system (Sutton et al., 1992). Modelbased approaches learn an explicit model of the system simultaneously with a value function and policy (Sutton, 1990, 1991a,b; Barto et al., 1995; Kaelbling et al., 1996). We find that in the p...
QLearning in Continuous State and Action Spaces
 IN AUSTRALIAN JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1999
"... Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to con ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a nonholonomic control task. Advantage Learning, a variation of Qlearning, is shown enhance learning speed and reliability for this task.
Reinforcement Learning Through Gradient Descent
, 1999
"... Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Ma ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Markov chains, even with linear function approximators and epochwise training. These algorithms are also very difficult to analyze, and difficult to combine with other algorithms. A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Qlearning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better conve...