Results 11  20
of
51
Coarticulation: An approach for generating concurrent plans in markov decision processes
 In Proceedings of the 22nd International Conference on Machine Learning (ICML2005
, 2005
"... We study an approach for performing concurrent activities in Markov decision processes (MDPs) based on the coarticulation framework. We assume that the agent has multiple degrees of freedom (DOF) in the action space which enables it to perform activities simultaneously. We demonstrate that one natur ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We study an approach for performing concurrent activities in Markov decision processes (MDPs) based on the coarticulation framework. We assume that the agent has multiple degrees of freedom (DOF) in the action space which enables it to perform activities simultaneously. We demonstrate that one natural way for generating concurrency in the system is by coarticulating among the set of learned activities available to the agent. In general due to the multiple DOF in the system, often there exists a redundant set of admissible suboptimal policies associated with each learned activity. Such flexibility enables the agent to concurrently commit to several subgoals according to their priority levels, given a new task defined in terms of a set of prioritized subgoals. We present efficient approximate algorithms for computing such policies and for generating concurrent plans. We also evaluate our approach in a simulated domain. 1.
Error Propagation for Approximate Policy and Value Iteration
"... We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared RadonNikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
Switch Packet Arbitration via QueueLearning
 In Proc. NIPS14
, 2001
"... In packet switches, packets queue at switch inputs and contend for outputs. ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In packet switches, packets queue at switch inputs and contend for outputs.
Dynamic Policy Programming
"... In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinitehorizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finiteiteratio ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinitehorizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finiteiteration and asymptotic ℓ∞norm performanceloss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a samplingbased variant of DPP (DPPRL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods.
Efficient bayesadaptive reinforcement learning using samplebased search
 In Advances in Neural Information Processing Systems 25
, 2012
"... Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayesoptimal policies is notoriously taxing, since the search space become ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayesoptimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, samplebased method for approximate Bayesoptimal planning which exploits MonteCarlo tree search. Our approach outperformed prior Bayesian modelbased RL algorithms by a significant margin on several wellknown benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
Risksensitive reinforcement learning applied to chance constrained control
 JAIR
, 2005
"... In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of find ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some userspecified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed. 1.
Computing Optimal Stationary Policies for MultiObjective Markov Decision Processes
"... Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multiobjective Markov decision processes. Experiments show that CONMODP is much faster than previous multiobjective value iteration algorithms. I.
Linear fittedq iteration with multiple reward functions
 Journal of Machine Learning Research
"... We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidencebased clinical decision support.
A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
"... A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and valuebased policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We t ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and valuebased policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actorcritic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1