DMCA
Arbitrarily modulated Markov decision processes (2009)
Venue: | In Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference |
Citations: | 12 - 2 self |
Citations
875 | The weighted majority algorithm
- Littlestone, Warmuth
- 1989
(Show Context)
Citation Context ...uitable when the transitions are so arbitrary as to make the notion of states meaningless. This model has been extensively studied in the context of repeated games [17], prediction with expert advice =-=[18]-=- and the adversarial multiarmed bandit [7]. In MDPs with arbitrarily varying transition functions, but a fixed reward function, [6] proposes a solution using robust dynamicprogramming.However,thismeth... |
788 |
Markov Decision Processes
- Puterman
- 1994
(Show Context)
Citation Context ...work considers a model more general than that of [10]. We give comparisons and pose open problems in Section V. II. SETTING A Markov decision process is a standard sequential decision-making problem (=-=[11]-=-, [12]). At each time step t ∈ {1, 2, . . .}, an agent takes an action at from a finite set A. Starting from a fixed state s1, the agent occupies at each time step t a state st belonging to a finite s... |
491 | The nonstochastic multi-armed bandit problem
- Auer, Cesa-Bianchi, et al.
- 2002
(Show Context)
Citation Context ... in [10], the ORDP algorithm can be modified to cope with limited observation of the reward functions and to reduce computational complexity. The first modification allows a bandit-like setting as in =-=[7]-=-, where instead of observing the full reward function rt after time step t, the agent observes only the actual reward rt(st, at) that it received. The second modification provides a way to tradeoff co... |
491 |
G.: Stochastic approximation algorithms and applications. New-York: Springer-Verlag
- Kushner, Yin
- 1997
(Show Context)
Citation Context ...DP algorithm. For the purpose of analysis, we consider the synchronous version of the Q-learning iteration in the Q-FPL algorithm and write it in the form of a stochastic approximation algorithm (cf. =-=[22]-=-): Qt+1 = Qt + γm(ht(Qt) + Mt+1), t ∈ αm, ht(Q)(s, a) = rtm(s, a) + ∑ Pt(s ′ | s, a) max a ′′ ∈A Q(s′ , a ′′ ) s ′ ∈S − Q(s, a) − Q(s0, a0), Mt+1(s, a) = max a ′ ∈A Qt(st+1, a ′ ) − ∑ s ′ ∈S Pt(s ′ |... |
337 |
Competitive Markov Decision Processes
- Filar, Vrieze
- 1997
(Show Context)
Citation Context ...metry in the extent of arbitrary uncertainty in the rewards and transitions, as illustrated by the following game. Example II.3 (AR-AT Games). Additive-reward additivetransition stochastic games (cf. =-=[15]-=-) can provide natural ex-1 − ǫ 1 − ǫ 1 1 − ǫ ǫ ǫ ǫ ǫ/2 ǫ/2 ǫ ǫ 1 ǫ/2 ǫ/2 ǫ 3 2 4 1 − ǫ 1 − ǫ (a) Transition model Pt(· | ·, L) at even time steps. 1 − ǫ 3 2 4 1 − ǫ 1 − ǫ (b) Transition model Pt(· | ... |
270 |
Uncertainty principles and signal recovery
- Donoho, Stark
- 1989
(Show Context)
Citation Context ...e.g., the next-state distribution following each arm-selection is only specified within some range. This uncertainty may arise from inaccurate measurement or estimation, from an uncertainty principle =-=[13]-=-, or from nonstationaryvariationsin the probabilities. The latter occur, for instance, in systems that rely on replaceable components whose specifications fall within some ǫ-tolerance range (e.g., bat... |
192 | Efficient algorithms for the online decision problem.
- Kalai, Vempala
- 2003
(Show Context)
Citation Context ...edby Qαm−1 andusedto derivea new policy(cf.(3))tobeusedthroughoutthenextinterval αm.As in the ORDP algorithm,this policy is computed accordingto the concept of “following the perturbed leader” ([17], =-=[21]-=-), where continuity between successive policies is ensured by adding a vanishing noise term nt to the Q-function Qαm−1 . However, the Q-FPL algorithm is computationally more efficient: at each time st... |
150 |
Approximation to Bayes risk in repeated play,” in Contributions to the Theory of Games, Volume III, ser.
- Hannan
- 1957
(Show Context)
Citation Context ...a notion of state. Such a model is suitable when the transitions are so arbitrary as to make the notion of states meaningless. This model has been extensively studied in the context of repeated games =-=[17]-=-, prediction with expert advice [18] and the adversarial multiarmed bandit [7]. In MDPs with arbitrarily varying transition functions, but a fixed reward function, [6] proposes a solution using robust... |
98 | The O.D.E. method for convergence of stochastic approximation and reinforcement learning,”
- Borkar, Meyn
- 2000
(Show Context)
Citation Context ...(s ′ | s, a) max a ′′ ∈A Qt(s ′ , a ′′ ), where the state st+1 is distributed according to Pt(· | s, a). Our analysis of synchronous Q-learning can be extended to the asynchronous version, as done in =-=[23]-=-. In contrast to the convergence analysis for average-reward Q-learning ([23]), the function ht is not fixed, but may change arbitrarily in our setting. We also define the sequence Qδ t with the fixed... |
96 | Perturbation theory and finite markov chains. - Schweitzer - 1968 |
80 | Robust control of Markov decision processes with uncertain transition matrices
- Nilim, Ghaoui
- 2005
(Show Context)
Citation Context ...to the system (probabilistic transitions, uncertainty principles), or may arise from measurements. In many decision-making problems, uncertainty exists in the rewards and transition probabilities (cf.=-=[6]-=- andreferencestherein).When thisuncertainty follows a stochastic model ([7], [8]), sampling can give estimatesonthe parametersofthe transitionmodel,but some residual uncertainty always remains as a re... |
69 |
R-max-a general polynomial time algorithm for near-optimal reinforcement learning,”
- Brafman, Tennenholtz
- 2003
(Show Context)
Citation Context ...from measurements. In many decision-making problems, uncertainty exists in the rewards and transition probabilities (cf.[6] andreferencestherein).When thisuncertainty follows a stochastic model ([7], =-=[8]-=-), sampling can give estimatesonthe parametersofthe transitionmodel,but some residual uncertainty always remains as a result of limited samples. Under these circumstances, if it is imperative to take ... |
68 |
Dynamic Programming and Optimal Control (2nd Ed
- Bertsekas
- 2001
(Show Context)
Citation Context ...h state space S,action space A,an uncertainty set D ⊆ ∆(C), and with reward function brt. In other words, solve the following Bellman equations for MDPs with infinite-horizon average-reward objective =-=[19]-=- (via linear programming or otherwise): Vt(s) = max a∈A “ brt(s, a) + inf δ∈D X Vt(s ′ ) X δ(c)P c (s ′ ” | st, a) − Vt(s0) , s ∈ S, (2) s ′ ∈S where s0 ∈ S is a fixed state and Vt(s0) is a normalizat... |
26 |
Experts in a Markov decision process.
- Even-Dar, Kakade, et al.
- 2005
(Show Context)
Citation Context ...ture). This generalized version is reminiscent of stochastic games [1]. However, in stochastic games, one usually assumes that there is an adversary whose utility is well-defined. In our setup, as in =-=[2]-=-, [3], and more generally in the online learning setting [4], we do not assume that such an adversary exists, but rather that the reward and transition processes are modulated 1 by arbitrary individua... |
21 | Markov decision processes with arbitrary reward processes.
- Yu, Mannor, et al.
- 2009
(Show Context)
Citation Context ...otions have previously been studied separately. Online learning yields solutions that are robust against arbitrary variations in the reward functions when the transition probabilities are fixed ([2], =-=[9]-=-). Robust dynamic programming has been used to control MDPs where the transition probabilities may vary arbitrarily, but where the reward functions may not [6]. In this work, we address both uncertain... |
19 | The empirical bayes envelope and regret minimization in competitive markov decision processes. - Mannor, Shimkin - 2003 |
14 | Asymptotic Operating Characteristics of an Optimal Change Point Detection in Hidden Markov Models,
- Fuh
- 2004
(Show Context)
Citation Context ...se specifications fall within some ǫ-tolerance range (e.g., battery life, physical dimensions). Our model also encompasses MDPs where the transition functionmay change abruptlyat unknowntime instants =-=[14]-=-, e.g., as a result of a mechanical break-down, or triggered by an adversary. Consequently, the true state-transition model may deviate unpredictably from the nominal model. In the next example, we sh... |
8 | Online learning in Markov decision processes with arbitrarily changing rewards and transitions.
- Yu, Mannor
- 2009
(Show Context)
Citation Context ...nsition-uncertainty by using the notion introduced in Assumption II.2.B. Robust Dynamic Programming In this section, we present, as a basis for comparison, the algorithm and performance guarantee of =-=[10]-=-, adapted to our setting. For this section, we assume that the set of basic transition functions {P c : c ∈ C} and the modulating sequence uncertainty set D are known to the agent. For ease of exposit... |
6 | The robustness-performance tradeoff in Markov decision processes
- Xu, Mannor
- 2007
(Show Context)
Citation Context ... have been studied before. MDPs with fixed, but unknown, reward functions and transition probabilities have been solved using robust dynamic programmingwith finite-and infinite-horizonobjectives([6], =-=[16]-=-). In general, the models for nonstationary settings limit the nonstationarity to either the rewards or the transition probabilities. MDPs where the reward functions may change arbitrarily, but the tr... |
4 |
Markov modulated bernoulli process
- ¨Ozekici
- 1997
(Show Context)
Citation Context ... random processes whose parameters change according to another process; e.g., a Markov modulated Bernoulli process is a Bernoulli process whose success probability changes according to a Markov chain =-=[5]-=-. our setting, the standard robust approach does not offer a satisfying solution. In particular, it only guarantees optimal performanceagainstthe worst realizationof theenvironment. It does not promis... |