Results 1  10
of
64
Linear leastsquares algorithms for temporal difference learning
 Machine Learning
, 1996
"... Abstract. We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call LeastSquares TD (LS TD) for which we prove probabilityone convergence when it is used with a function approximator linear in the adju ..."
Abstract

Cited by 182 (0 self)
 Add to MetaCart
Abstract. We introduce two new temporal difference (TD) algorithms based on the theory of linear leastsquares function approximation. We define an algorithm we call LeastSquares TD (LS TD) for which we prove probabilityone convergence when it is used with a function approximator linear in the adjustable parameters. We then define a recursive version of this algorithm, Recursive LeastSquares TD (RLS TD). Although these new TD algorithms require more computation per timestep than do Sutton's TD(A) algorithms, they are more efficient in a statistical sense because they extract more information from training experiences. We describe a simulation experiment showing the substantial improvement in learning rate achieved by RLS TD in an example Markov prediction problem. To quantify this improvement, we introduce the TD error variance of a Markov chain, arc,, and experimentally conclude that the convergence rate of a TD algorithm depends linearly on ~ro. In addition to converging more rapidly, LS TD and RLS TD do not have control parameters, such as a learning rate parameter, thus eliminating the possibility of achieving poor performance by an unlucky choice of parameters.
Recent advances in hierarchical reinforcement learning
, 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract

Cited by 161 (23 self)
 Add to MetaCart
A preliminary unedited version of this paper was incorrectly published as part of Volume
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporaldifference learning  a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
QLearning in Continuous State and Action Spaces
 IN AUSTRALIAN JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1999
"... Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to con ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a nonholonomic control task. Advantage Learning, a variation of Qlearning, is shown enhance learning speed and reliability for this task.
Differential Training Of Rollout Policies
, 1997
"... We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. We indicate that, in a stochastic environment, the popular methods Qfactor and costtogo values. In particular, we propose a method, called differential training, that can be used to obtain an approximation to costtogo differences rather than costtogo values by using standard methods such as TD(#) and #policy iteration. This method is suitable for recursively generating rollout policies in the context of simulationbased policy iteration methods.
Incremental Dynamic Programming for OnLine Adaptive Optimal Control
, 1994
"... Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Reinforcement learning algorithms based on the principles of Dynamic Programming (DP) have enjoyed a great deal of recent attention both empirically and theoretically. These algorithms have been referred to generically as Incremental Dynamic Programming (IDP) algorithms. IDP algorithms are intended for use in situations where the information or computational resources needed by traditional dynamic programming algorithms are not available. IDP algorithms attempt to find a global solution to a DP problem by incrementally improving local constraint satisfaction properties as experience is gained through interaction with the environment. This class of algorithms is not new, going back at least as far as Samuel's adaptive checkersplaying programs,...
Comparison of Heuristic Dynamic Programming and Dual Heuristic Programming Adaptive Critics for Neurocontrol of a Turbogenerator
 IEEE Transactions on Neural Networks
, 2000
"... This paper presents the design of an optimal neurocontroller that replaces the conventional automatic voltage regulator (AVR) and the turbine governor for a turbogenerator connected to the power grid. The neurocontroller design uses a novel technique based on the adaptive critic designs (ACDs), spec ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
This paper presents the design of an optimal neurocontroller that replaces the conventional automatic voltage regulator (AVR) and the turbine governor for a turbogenerator connected to the power grid. The neurocontroller design uses a novel technique based on the adaptive critic designs (ACDs), specifically on heuristic dynamic programming (HDP) and dual heuristic programming (DHP). Results show that both neurocontrollers are robust, but that DHP outperforms HDP or conventional controllers, especially when the system conditions and configuration change. This paper also shows how to design optimal neurocontrollers for nonlinear systems, such as turbogenerators, without having to do continually online training of the neural networks, thus avoiding risks of instability.
Building a basic block instruction scheduler with reinforcement learning and rollouts
 Machine Learning
, 2002
"... amy ¡ moss ¡ ..."