Results 1  10
of
48
The o.d.e. method for convergence of stochastic approximation and reinforcement learning
 SIAM J. CONTROL OPTIM
, 2000
"... It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the algorithm. Several specific classes of algorithms are considered as applications. It is found that the result ..."
Abstract

Cited by 98 (19 self)
 Add to MetaCart
It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the algorithm. Several specific classes of algorithms are considered as applications. It is found that the results provide (i) a simpler derivation of known results for reinforcement learning algorithms; (ii) a proof for the first time that a class of asynchronous stochastic approximation algorithms are convergent without using any a priori assumption of stability; (iii) a proof for the first time that asynchronous adaptive critic and Qlearning algorithms are convergent for the average cost optimal control problem.
Stochastic Approximation for Nonexpansive Maps: Application to QLearning Algorithms
, 2002
"... We discuss synchronous and asynchronous iterations of the form x k+1 = x k + γ(k)(h(x k)+w k), where h is a suitable map and {wk} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
We discuss synchronous and asynchronous iterations of the form x k+1 = x k + γ(k)(h(x k)+w k), where h is a suitable map and {wk} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s lemma for the synchronous case or on Borkar’s theorem for the asynchronous case. However, the analysis requires that the iterates {xk} be bounded, a factwhich is usually hard to prove. We develop a novel framework for proving boundedness in the deterministic framework, which is also applicable to the stochastic case when the deterministic hypotheses can be verified in the almost sure sense. This is based on scaling ideas and on the properties of Lyapunov functions. We then combine the boundedness property with Borkar’s stability analysis of ODEs involving nonexpansive mappings to prove convergence (with probability 1 in the stochastic case). We also apply our convergence analysis to Qlearning algorithms for stochastic shortest path problems and are able to relax some of the assumptions of the currently available results.
Reinforcement Learning: A Tutorial Survey and Recent Advances
 INFORMS JOURNAL ON COMPUTING. VOL
, 2009
"... In the last few years, Reinforcement Learning (RL), also called adaptive (or approximate) dynamic programming (ADP), has emerged as a powerful tool for solving complex sequential decisionmaking problems in control theory. Although seminal research in this area was performed in the artificial intell ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
In the last few years, Reinforcement Learning (RL), also called adaptive (or approximate) dynamic programming (ADP), has emerged as a powerful tool for solving complex sequential decisionmaking problems in control theory. Although seminal research in this area was performed in the artificial intelligence (AI) community, more recently, it has attracted the attention of optimization theorists because of several noteworthy success stories from operations management. It is on largescale and complex problems of dynamic optimization, in particular the Markov decision problem (MDP) and its variants, that the power of RL becomes more obvious. It has been known for many years that on largescale MDPs, the curse of dimensionality and the curse of modeling render classical dynamic programming (DP) ineffective. The excitement in RL stems from its direct attack on these curses, allowing it to solve problems that were considered intractable, via classical DP, in the past. The success of RL is due to its strong mathematical roots in the principles of DP, Monte Carlo simulation, function approximation, and AI. Topics treated in some detail in this survey are: Temporal differences, QLearning, semiMDPs and stochastic games. Several recent advances in RL, e.g., policy gradients and hierarchical RL, are covered along with references. Pointers to numerous examples of applications are provided. This overview is aimed at uncovering the mathematical roots of this science, so that readers gain a clear understanding of the core concepts and are able to use them in their own research. The survey points to more than 100 references from the literature.
An OnLine Learning Algorithm for Energy Efficient Delay Constrained Scheduling over a Fading Channel
"... In this paper, we consider the problem of energy efficient scheduling under average delay constraint for a single user fading channel. We propose a new approach for online implementation of the optimal packet scheduling algorithm. This approach is based on reformulating the value iteration equatio ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
In this paper, we consider the problem of energy efficient scheduling under average delay constraint for a single user fading channel. We propose a new approach for online implementation of the optimal packet scheduling algorithm. This approach is based on reformulating the value iteration equation by introducing a virtual state called postdecision state. The resultant value iteration equation becomes amenable to online implementation based on stochastic approximation. This approach has an advantage that an explicit knowledge of the probability distribution of the channel state as well as the arrivals is not required for the implementation. We prove that the online algorithm indeed converges to the optimal policy.
Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling
, 2009
"... We consider a wireless system with a small number of delay constrained users and a larger number of users without delay constraints. We develop a scheduling algorithm that reacts to time varying channels and maximizes throughput utility (to within a desired proximity), stabilizes all queues, and sa ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
We consider a wireless system with a small number of delay constrained users and a larger number of users without delay constraints. We develop a scheduling algorithm that reacts to time varying channels and maximizes throughput utility (to within a desired proximity), stabilizes all queues, and satisfies the delay constraints. The problem is solved by reducing the constrained optimization to a set of weighted stochastic shortest path problems, which act as natural generalizations of maxweight policies to Markov modulated networks. We also present approximation results that do not require apriori statistical knowledge, and discuss the additional complexity and delay incurred as compared to systems without delay constraints. The solution technique is general and applies to other constrained stochastic network optimization problems.
A new QoS provisioning method for adaptive multimedia in cellular wireless networks
 in: Proc. IEEE Infocom’04, HongKong
, 2004
"... Abstract—Future wireless networks are designed to support adaptive multimedia by controlling individual ongoing flows to increase or decrease their bandwidths in response to changes in traffic load. There is growing interest in qualityofservice (QoS) provisioning under this adaptive multimedia fra ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
Abstract—Future wireless networks are designed to support adaptive multimedia by controlling individual ongoing flows to increase or decrease their bandwidths in response to changes in traffic load. There is growing interest in qualityofservice (QoS) provisioning under this adaptive multimedia framework, in which a bandwidth adaptation algorithm needs to be used in conjunction with the call admission control algorithm. This paper presents a novel method for QoS provisioning via average reward reinforcement learning in conjunction with stochastic approximation, which can maximize the network revenue subject to several predetermined QoS constraints. Unlike other modelbased algorithms (e.g., linear programming), our scheme does not require explicit state transition probabilities, and therefore, the assumptions behind the underlying system model are more realistic than those in previous schemes. In addition, when we consider the status of neighboring cells, the proposed scheme can dynamically adapt to changes in traffic condition. Moreover, the algorithm can control the bandwidth adaptation frequency effectively by accounting for the cost of bandwidth switching in the model. The effectiveness of the proposed approach is demonstrated using simulation results in adaptive multimedia wireless networks. Index Terms—Adaptive multimedia, QoS, reinforcement learning, wireless networks.
Hierarchically Optimal Average Reward Reinforcement Learning
 In Proceedings of the Nineteenth International Conference on Machine Learning
, 2002
"... Two notions of optimality have been explored in previous work on hierarchical reinforcement learning (HRL): hierarchical optimality, or the optimal policy in the space de ned by a task hierarchy, and a weaker local model called recursive optimality. In this paper, we introduce two new average ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
Two notions of optimality have been explored in previous work on hierarchical reinforcement learning (HRL): hierarchical optimality, or the optimal policy in the space de ned by a task hierarchy, and a weaker local model called recursive optimality. In this paper, we introduce two new averagereward HRL algorithms for nding hierarchically optimal policies.
Dynamic pricing models for electronic business
 Sadhana
, 2005
"... Abstract. Dynamic pricing is the dynamic adjustment of prices to consumers depending upon the value these customers attribute to a product or service. Today’s digital economy is ready for dynamic pricing; however recent research has shown that the prices will have to be adjusted in fairly sophistica ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Dynamic pricing is the dynamic adjustment of prices to consumers depending upon the value these customers attribute to a product or service. Today’s digital economy is ready for dynamic pricing; however recent research has shown that the prices will have to be adjusted in fairly sophisticated ways, based on sound mathematical models, to derive the benefits of dynamic pricing. This article attempts to survey different models that have been used in dynamic pricing. We first motivate dynamic pricing and present underlying concepts, with several examples, and explain conditions under which dynamic pricing is likely to succeed. We then bring out the role of models in computing dynamic prices. The models surveyed include inventorybased models, datadriven models, auctions, and machine learning. We present a detailed example of an ebusiness market to show the use of reinforcement learning in dynamic pricing.
Reinforcement Learning for Resource Allocation in LEO satellite networks
 IEEE Transactions on Systems, Man, and Cybernetics Part B. Volume 37, Issue
, 2007
"... Abstract—In this paper, we develop and assess online decisionmaking algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semiMarkov decision process formulation of the call admission and routi ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we develop and assess online decisionmaking algorithms for call admission and routing for low Earth orbit (LEO) satellite networks. It has been shown in a recent paper that, in a LEO satellite system, a semiMarkov decision process formulation of the call admission and routing problem can achieve better performance in terms of an average revenue function than existing routing methods. However, the conventional dynamic programming (DP) numerical solution becomes prohibited as the problem size increases. In this paper, two solution methods based on reinforcement learning (RL) are proposed in order to circumvent the computational burden of DP. The first method is based on an actor–critic method with temporaldifference (TD) learning. The second method is based on a criticonly method, called optimistic TD learning. The algorithms enhance performance in terms of requirements in storage, computational complexity and computational time, and in terms of an overall longterm average revenue function that penalizes blocked calls. Numerical studies are carried out, and the results obtained show that the RL framework can achieve up to 56% higher average revenue over existing routing methods used in LEO satellite networks with reasonable storage and computational requirements. Index Terms—Call admission control (CAC), low Earth orbit (LEO) satellite network, reinforcement learning (RL), routing, temporaldifference (TD) learning. I.