Results 1  10
of
12
The simplex method is strongly polynomial for deterministic Markov Decision Processes
 In Proceedings of the 24th ACMSIAM Symposium on Discrete Algorithms, SODA
, 2013
"... We prove that the simplex method with the highest gain/mostnegativereduced cost pivoting rule converges in strongly polynomial time for deterministic Markov decision processes (MDPs) regardless of the discount factor. For a deterministic MDP with n states and m actions, we prove the simplex method ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We prove that the simplex method with the highest gain/mostnegativereduced cost pivoting rule converges in strongly polynomial time for deterministic Markov decision processes (MDPs) regardless of the discount factor. For a deterministic MDP with n states and m actions, we prove the simplex method runs in O(n3m2 log2 n) iterations if the discount factor is uniform and O(n5m3 log2 n) iterations if each action has a distinct discount factor. Previously the simplex method was known to run in polynomial time only for discounted MDPs where the discount was bounded away from 1 [Ye11]. Unlike in the discounted case, the algorithm does not greedily converge to the optimum, and we require a more complex measure of progress. We identify a set of layers in which the values of primal variables must lie and show that the simplex method always makes progress optimizing one layer, and when the upper layer is updated the algorithm makes a substantial amount of progress. In the case of nonuniform discounts, we define a polynomial number of “milestone” policies and we prove that, while the objective function may not improve substantially overall, the value of at least one dual variable is always making progress towards some milestone, and the algorithm will reach the next milestone in a polynomial number of steps. 1
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
, 2013
"... Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γdiscounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γdiscounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and SimplexPI that only changes the action in the state with maximal advantage. We show that Howard’s PI terminates after at most n(m − 1) ⌈ 1 1−γ log ( 1 1−γ iterations, improving by a factor O(log n) a result by Hansen et al. (2013), while SimplexPI terminates after at most n(m − 1) ⌈ n 1−γ log ()⌉ n iterations, improving by a factor 2 a result by Ye 1−γ (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor γ. When the MDP is deterministic, we show that SimplexPI terminates after at most 2n 2 m(m − 1)⌈2(n − 1) log n⌉⌈2n log n ⌉ = O(n 4 m 2 log 2 n) iterations, improving by a factor O(n) a bound obtained by Post and Ye (2012). We generalize this result to stochastic MDPs: given a measure of the maximal transient time τt and the maximal time τr to revisit states in recurrent classes under all policies, we show that SimplexPI terminates after at most n 2 m(m − 1) (⌈τr log(nτr) ⌉ + ⌈τr log(nτt)⌉) ⌈τt log(n(τt + 1)) ⌉ = Õ(n2 τtτrm 2) iterations. We explain why similar results seem hard to derive for Howard’s PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, corresponding to states that are transient (respectively recurrent) for all policies, we show that SimplexPI and Howard’s PI terminate after at most n(m − 1) (⌈τt log nτt ⌉ + ⌈τr log nτr⌉) = Õ(nm(τt + τr)) iterations.
RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning
"... We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the sensitivity to the approximation error of suboptimal policies in comparison to classical methods such as fitted value iteration. This results in red ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the sensitivity to the approximation error of suboptimal policies in comparison to classical methods such as fitted value iteration. This results in reducing the bounds on the γdiscounted infinite horizon performance loss by a factor of 1/(1 − γ) while preserving polynomialtime computational complexity. Our experimental results show that using the robust representation can significantly improve the solution quality with minimal additional computational cost. 1
Approximate Dynamic Programming for TwoPlayer ZeroSum Markov Games
, 2015
"... This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zerosum twoplayer Stochastic Games. We provide a novel and unified error propagation analysis in Lpnorm of three wellknown algorithms adapted to Stochastic Games (namely ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zerosum twoplayer Stochastic Games. We provide a novel and unified error propagation analysis in Lpnorm of three wellknown algorithms adapted to Stochastic Games (namely
Safe Policy Improvement by Minimizing Robust Baseline Regret
"... Abstract An important problem in sequential decisionmaking under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new modelbased approach that computes a safe policy, given an inaccura ..."
Abstract
 Add to MetaCart
Abstract An important problem in sequential decisionmaking under uncertainty is to use limited data to compute a safe policy, which is guaranteed to outperform a given baseline strategy. In this paper, we develop and analyze a new modelbased approach that computes a safe policy, given an inaccurate model of the system's dynamics and guarantees on the accuracy of this model. The new robust method uses this model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and to seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NPhard and propose a simple approximate algorithm. Our empirical results on several domains further show that even the simple approximate algorithm can outperform standard approaches.
STOCHASTIC MEAN PAYOFF GAMES WITH BOUNDED FIRST RETURN TIMES IS STRONGLY POLYNOMIAL
"... ar ..."
(Show Context)
An improved version of the RandomFacet pivoting rule for the simplex algorithm
, 2015
"... The RandomFacet pivoting rule of Kalai and of Matoušek, Sharir and Welzl is an elegant randomized pivoting rule for the simplex algorithm, the classical combinatorial algorithm for solving linear programs (LPs). The expected number of pivoting steps performed by the simplex algorithm when using th ..."
Abstract
 Add to MetaCart
The RandomFacet pivoting rule of Kalai and of Matoušek, Sharir and Welzl is an elegant randomized pivoting rule for the simplex algorithm, the classical combinatorial algorithm for solving linear programs (LPs). The expected number of pivoting steps performed by the simplex algorithm when using this rule, on any linear program involving n inequalities in d variables, is