Results 11  20
of
206
Incremental MultiStep QLearning
 Machine Learning
, 1996
"... . This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement le ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q()learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. Keywords: reinforcement learning, temporal difference learning 1. Introduction The incremental multistep Qlearning (Q()learning) method is a new direct (or modelfree) algorithm that extends the onestep Qlearning algorithm (Watkins 1989) by combining it with TD() ret...
Rollout algorithms for stochastic scheduling problems
 Journal of Heuristics
, 1999
"... Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout ..."
Abstract

Cited by 70 (3 self)
 Add to MetaCart
Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout algorithms based on these heuristics which approximate the stochastic dynamic programming algorithm. We show how the rollout algorithms can be implemented efficiently, with considerable savings in computation over optimal algorithms. We delineate circumstances under which the rollout algorithms are guaranteed to perform better than the heuristics on which they are based. We also show computational results which suggest that the performance of the rollout policies is nearoptimal, and is substantially better than the performance of their underlying heuristics.
Approximate Solutions to Markov Decision Processes
, 1999
"... One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, ..."
Abstract

Cited by 66 (9 self)
 Add to MetaCart
One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to realworld problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. ...
Reinforcement Learning with a Hierarchy of Abstract Models
 IN PROCEEDINGS OF THE TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1992
"... Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an e ..."
Abstract

Cited by 65 (8 self)
 Add to MetaCart
Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing onestep lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present HDYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. HDYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while...
TD(λ) Converges with Probability 1
, 1994
"... The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) ..."
Abstract

Cited by 55 (2 self)
 Add to MetaCart
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Problem Solving With Reinforcement Learning
, 1995
"... This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
This dissertation is submitted for consideration for the dwree of Doctor' of Philosophy at the Uziver'sity of Cambr'idge Summary This thesis is concerned with practical issues surrounding the application of reinforcement lear'ning techniques to tasks that take place in high dimensional continuous statespace environments. In particular, the extension of online updating methods is considered, where the term implies systems that learn as each experience arrives, rather than storing the experiences for use in a separate offline learning phase. Firstly, the use of alternative update rules in place of standard Qlearning (Watkins 1989) is examined to provide faster convergence rates. Secondly, the use of multilayer perceptton (MLP) neural networks (Rumelhart, Hinton and Williams 1986) is investigated to provide suitable generalising function approximators. Finally, consideration is given to the combination of Adaptive Heuristic Critic (AHC) methods and Qlearning to produce systems combining the benefits of realvalued actions and discrete switching
Advantage Updating
, 1993
"... A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Qlearning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q learning. When there is noise or small time steps, advantage updating learns more quickly than Qlearning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results...
A Lyapunov Bound for Solutions of Poisson's Equation
 Ann. Probab
, 1996
"... In this paper we consider /irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which the ..."
Abstract

Cited by 43 (25 self)
 Add to MetaCart
In this paper we consider /irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which the solution is square integrable. These results are applied to obtain sufficient conditions that guarantee the validity of a functional central limit theorem for the Markov process. As a second consequence of the bounds obtained, a perturbation theory for Markov processes is developed which gives conditions under which both the solution to Poisson's equation and the invariant probability for the process are continuous functions of its transition kernel. The techniques are illustrated with applications to queueing theory and autoregressive processes. AMS subject classifications: 68M20, 60J10 Running head: Poisson's Equation Keywords: Markov chain, Markov process, Poisson's equation, Lyapunov f...
Hidden markov modeling for network communication channels
 in Proceedings of the ACM SIGMETRICS
, 2001
"... In this paper we perform the statistical analysis of an Internet communication channel. Our study is based on a Hidden Markov Model (HMM). The channel switches between different states; to each state corresponds the probability that a packet sent by the transmitter will be lost. The transition be ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
In this paper we perform the statistical analysis of an Internet communication channel. Our study is based on a Hidden Markov Model (HMM). The channel switches between different states; to each state corresponds the probability that a packet sent by the transmitter will be lost. The transition between the different states of the channel is governed by a Markov chain; this Markov chain is not observed directly, but the received packet flow provides some probabilistic information about the current state of the channel, as well as some information about the parameters of the model. In this paper we detail some useful algorithms for the estimation of the channel parameters, and for making inference about the state of the channel. We discuss the relevance of the Markov model of the channel; we also discuss how many states are required to pertinently model a real communication channel.