Results 11  20
of
50
Error Propagation for Approximate Policy and Value Iteration
"... We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared RadonNikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
Switch Packet Arbitration via QueueLearning
 In Proc. NIPS14
, 2001
"... In packet switches, packets queue at switch inputs and contend for outputs. ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In packet switches, packets queue at switch inputs and contend for outputs.
Dynamic Policy Programming
"... In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinitehorizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finiteiteratio ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinitehorizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finiteiteration and asymptotic ℓ∞norm performanceloss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a samplingbased variant of DPP (DPPRL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.
Risksensitive reinforcement learning applied to chance constrained control
 JAIR
, 2005
"... In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of find ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some userspecified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed. 1.
Computing Optimal Stationary Policies for MultiObjective Markov Decision Processes
"... Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multiobjective Markov decision processes. Experiments show that CONMODP is much faster than previous multiobjective value iteration algorithms. I.
Linear fittedq iteration with multiple reward functions
 Journal of Machine Learning Research
"... We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidencebased clinical decision support.
Efficient bayesadaptive reinforcement learning using samplebased search
 In Advances in Neural Information Processing Systems 25
, 2012
"... Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayesoptimal policies is notoriously taxing, since the search space become ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayesoptimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, samplebased method for approximate Bayesoptimal planning which exploits MonteCarlo tree search. Our approach outperformed prior Bayesian modelbased RL algorithms by a significant margin on several wellknown benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
"... A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and valuebased policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We t ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and valuebased policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actorcritic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
Algorithms for Fast Gradient Temporal Difference Learning
"... Temporal difference learning is one of the oldest and most used techniques in reinforcement learning to estimate value functions. Many modifications and extension of the classical TD methods have been proposed. Recent examples are TDC and GTD(2) ([Sutton et al., 2009b]), the first approaches that ar ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Temporal difference learning is one of the oldest and most used techniques in reinforcement learning to estimate value functions. Many modifications and extension of the classical TD methods have been proposed. Recent examples are TDC and GTD(2) ([Sutton et al., 2009b]), the first approaches that are as fast as classical TD and have proven convergence for linear function approximation in on and offpolicy cases. This paper introduces these methods to novices of TD learning by presenting the important concepts of the new algorithms. Moreover the methods are compared against each other and alternative approaches both theoretically and empirically. Eventually, experimental results give rise to question the practical relevance of convergence guarantees for offpolicy prediction by TDC and GTD(2). 1