Results 11 - 20
of
155
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 80 (12 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
Approximate Solutions to Markov Decision Processes
, 1999
"... One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-world problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. ...
Reinforcement Learning with a Hierarchy of Abstract Models
- IN PROCEEDINGS OF THE TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1992
"... Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an e ..."
Abstract
-
Cited by 61 (8 self)
- Add to MetaCart
Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing one-step lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present H-DYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. H-DYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while...
Rollout algorithms for stochastic scheduling problems
- Journal of Heuristics
, 1999
"... Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
Abstract. Stochastic scheduling problems are difficult stochastic control problems with combinatorial decision spaces. In this paper we focus on a class of stochastic scheduling problems, the quiz problem and its variations. We discuss the use of heuristics for their solution, and we propose rollout algorithms based on these heuristics which approximate the stochastic dynamic programming algorithm. We show how the rollout algorithms can be implemented efficiently, with considerable savings in computation over optimal algorithms. We delineate circumstances under which the rollout algorithms are guaranteed to perform better than the heuristics on which they are based. We also show computational results which suggest that the performance of the rollout policies is near-optimal, and is substantially better than the performance of their underlying heuristics.
TD(λ) Converges with Probability 1
, 1994
"... The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future. Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result that the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Advantage Updating
, 1993
"... A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
A new algorithm for reinforcement learning, advantage updating, is proposed. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which Q-learning is not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem with no noise and large time steps, advantage updating learns slightly faster than Q- learning. When there is noise or small time steps, advantage updating learns more quickly than Q-learning by a factor of more than 100,000. Convergence properties and implementation issues are discussed. New convergence results...
A Lyapunov Bound for Solutions of Poisson's Equation
- Ann. Probab
, 1996
"... In this paper we consider /-irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which the ..."
Abstract
-
Cited by 33 (23 self)
- Add to MetaCart
In this paper we consider /-irreducible Markov processes evolving in discrete or continuous time, on a general state space. We develop a Lyapunov function criterion that permits one to obtain explicit bounds on the solution to Poisson's equation and, in particular, obtain conditions under which the solution is square integrable. These results are applied to obtain sufficient conditions that guarantee the validity of a functional central limit theorem for the Markov process. As a second consequence of the bounds obtained, a perturbation theory for Markov processes is developed which gives conditions under which both the solution to Poisson's equation and the invariant probability for the process are continuous functions of its transition kernel. The techniques are illustrated with applications to queueing theory and autoregressive processes. AMS subject classifications: 68M20, 60J10 Running head: Poisson's Equation Keywords: Markov chain, Markov process, Poisson's equation, Lyapunov f...
Measurement-Based Usage Charges in Communications Networks
- Operations Research
, 1997
"... This paper describes methods of computing usage charges from simple measurements and relating these to bounds on the effective bandwidth. Thus we show that charging for usage on the basis of effective bandwidths can be well-approximated by charges based on simple measurements. Charging and pricing a ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
This paper describes methods of computing usage charges from simple measurements and relating these to bounds on the effective bandwidth. Thus we show that charging for usage on the basis of effective bandwidths can be well-approximated by charges based on simple measurements. Charging and pricing are essential requirements in the operation of a communication network. They are needed not only to recover costs and make a profit. Even if a generous operator is willing to offer a network for free, there are still compelling reasons to charges for services in order to exercise control. The congestion that has plagued the Internet because it lacks any mechanism for charging and pricing highlights the fact that without charges it is difficult to control congestion or divide network resources amongst users in a workable and stable way. Subject classifications: Communications: measurement-based charging. Of course there are many considerations that influence the prices at which an operator will choose to sell network services. Marketing and regulation are certainly important, but these considerations are not unique to the operation of a communications network. Special considerations do, however, arise from the fact that a broadband communications network is intended simultaneously to carry a wide variety of traffic types. Our conception of a broadband network is that of a collection of resources (links, buffers, switches, etc.) which can be used to provide a wide variety of communications services. These services are distinguished by traffic contracts, which specify parameters to which the traffic must adhere (a maximum peak rate, for example), and the quality of service which the network undertakes to guarantee (typically, cell loss or delay). These concepts are accepted as ...

