Results 11  20
of
288
R.: Incremental multistep Qlearning
, 1996
"... Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinfor ..."
Abstract

Cited by 93 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamicprogramming based reinforcement learning method, with the TD(A) return estimation process, which is typically used in actorcritic learning, another wellknown dynamicprogramming based reinforcement learning method. The parameter A is used to distribute credit hroughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q(A)learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.
Rollout algorithms for stochastic scheduling problems
 Journal of Heuristics
, 1999
"... Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout ..."
Abstract

Cited by 74 (3 self)
 Add to MetaCart
Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout algorithms based on these heuristics which approximate the stochastic dynamic programming algorithm. We show how the rollout algorithms can be implemented efficiently, with considerable savings in computation over optimal algorithms. We delineate circumstances under which the rollout algorithms are guaranteed to perform better than the heuristics on which they are based. We also show computational results which suggest that the performance of the rollout policies is nearoptimal, and is substantially better than the performance of their underlying heuristics.
Approximate Solutions to Markov Decision Processes
, 1999
"... One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, ..."
Abstract

Cited by 69 (9 self)
 Add to MetaCart
One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to realworld problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. ...
Reinforcement Learning with a Hierarchy of Abstract Models
 IN PROCEEDINGS OF THE TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1992
"... Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from ..."
Abstract

Cited by 66 (8 self)
 Add to MetaCart
(Show Context)
Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing onestep lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present HDYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. HDYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while...
TD(λ) Converges with Probability 1
, 1994
"... The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1 ..."
Abstract

Cited by 60 (2 self)
 Add to MetaCart
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.
Qlearning
 Machine Learning
, 1992
"... Abstract. ~learning (Watkins, 1989) is a simple way for agents o learn how to act optimally incontrolled Markovian domains. Itamounts o an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of par ..."
Abstract

Cited by 60 (0 self)
 Add to MetaCart
(Show Context)
Abstract. ~learning (Watkins, 1989) is a simple way for agents o learn how to act optimally incontrolled Markovian domains. Itamounts o an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular ctions at particular states. This paper presents and proves in detail a convergence theorem for ~learning based on that outlined in Watkins (1989). We show that 0~learning converges to the optimum actionvalues with probability 1 so long as all actions are repeatedly sampled in all states and the actionvalues are represented discretely. We also sketch extensions to the cases of nondiscounted, butabsorbing, Markov environments, and where many O ~ values can be changed each iteration, rather than just one.
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Problem Solving With Reinforcement Learning
, 1995
"... This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high di ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
(Show Context)
This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous statespace environments. In particular, the extension of online updating methods is considered, where the term implies systems that learn as each experience arrives, rather than storing the experiences for use in a separate offline learning phase. Firstly, the use of alternative update rules in place of standard Qlearning (Watkins 1989) is examined to provide faster convergence rates. Secondly, the use of multilayer perceptton (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalising function approximators. Finally, consideration is given to the combination of Adaptive Heuristic Critic (AHC) methods and Qlearning to produce systems combining the benefits of realvalued actions and discrete switching
Advantage Updating
, 1993
"... A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Qlearning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q learning. When there is noise or small time steps, advantage updating learns more quickly than Qlearning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results...
A Lyapunov Bound for Solutions of Poisson's Equation
 Ann. Probab
, 1996
"... In this paper we consider /irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which ..."
Abstract

Cited by 45 (24 self)
 Add to MetaCart
(Show Context)
In this paper we consider /irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which the solution is square integrable. These results are applied to obtain sufficient conditions that guarantee the validity of a functional central limit theorem for the Markov process. As a second consequence of the bounds obtained, a perturbation theory for Markov processes is developed which gives conditions under which both the solution to Poisson's equation and the invariant probability for the process are continuous functions of its transition kernel. The techniques are illustrated with applications to queueing theory and autoregressive processes. AMS subject classifications: 68M20, 60J10 Running head: Poisson's Equation Keywords: Markov chain, Markov process, Poisson's equation, Lyapunov f...