Results 1  10
of
12
Approximate Modified Policy Iteration
"... In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); G ..."
Abstract

Cited by 17 (13 self)
 Add to MetaCart
(Show Context)
In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon et al. (2011). We then provide an error propagation analysis of AMPI (Sec. 4), which shows how the Lpnorm of its performance loss can be controlled by the error at each iteration of the algorithm. We show that the error propagation analysis of AMPI is more involved than that of AVI and API. This is due to the fact that neither the contraction nor monotonicity arguments, that the error propagation analysis of these two algorithms rely on, hold for AMPI. The analysis of this section unifies those for AVI and API and is applied to the AMPI implementations presented in Sec. 3. We detail the analysis of the classificationbased implementation of MPI (CBMPI) of Sec. 3 by providing its finite sample analysis in Sec. 5. Our analysis indicates that the parameter m allows us to balance the estimation error of the classifier with the overall quality of the value approximahal00697169,
On the Use of NonStationary Policies for Stationary InfiniteHorizon Markov Decision Processes
 In: Advances in Neural Information Processing Systems (NIPS
, 2012
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
(Show Context)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Qlearning and policy iteration algorithms for stochastic shortest path problems
, 2012
"... We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Qlearning schemes with as reliable convergence as classical Qlearning. We also discuss methods that use basis function approximations of Qfactors and we give an associated error bound.
An empirical study
"... Exploiting policy knowledge in online leastsquares policy iteration: An empirical study ∗ L. Bu¸soniu, B. De Schutter, R. Babuˇska, and D. Ernst If you want to cite this report, please use the following reference instead: ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Exploiting policy knowledge in online leastsquares policy iteration: An empirical study ∗ L. Bu¸soniu, B. De Schutter, R. Babuˇska, and D. Ernst If you want to cite this report, please use the following reference instead:
Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
, 2011
"... We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that gener ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD(λ) described by Sutton and Barto (1998). We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration (Munos, 2007) and Approximate Policy Iteration (Munos, 2003). Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). We provide an original performance bound that can be applied to such an undiscounted control problem.
Journal of Machine Learning Research () Submitted 2013/07; Revised 2014/05; Published Approximate Modified Policy Iteration and its Application to the Game of Tetris
"... Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or i ..."
Abstract
 Add to MetaCart
(Show Context)
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the wellknown approximate DP algorithms: fittedvalue iteration, fittedQ iteration, and classificationbased policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finitesample analysis of these algorithms, which highlights the influence of their parameters. In the classificationbased version of the algorithm (CBMPI), the analysis shows that MPI’s main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current stateoftheart methods while using fewer samples.1
Stochastic Shortest Path Problems Under Weak Conditions
"... In this paper we weaken the conditions under which some of the basic analytical and algorithmic results for finitestate stochastic shortest path problems hold. We provide an analysis under three types of assumptions, under all of which the standard form of policy iteration may fail, and other anoma ..."
Abstract
 Add to MetaCart
In this paper we weaken the conditions under which some of the basic analytical and algorithmic results for finitestate stochastic shortest path problems hold. We provide an analysis under three types of assumptions, under all of which the standard form of policy iteration may fail, and other anomalies may occur. In the first type of assumptions, we require a standard compactness and continuity condition, as well as the existence of an optimal proper policy, thereby allowing positive and negative costs per stage, and improper policies with finite cost at all states. The analysis is based on introducing an additive perturbation δ> 0 to the cost per stage, which drives the cost of improper policies to infinity. By considering the δperturbed problem and taking the limit as δ ↓ 0, we show the validity of Bellman’s equation and value iteration, and we construct a convergent policy iteration algorithm that uses a diminishing sequence of perturbations. In the second type of assumptions we require nonpositive onestage costs and we give policy iteration algorithms that are optimistic and do not require the use of perturbations. In the third type of assumptions we require nonnegative onestage costs, as well as the compactness and continuity condition, and we convert the problem to an equivalent stochastic shortest path problem for which the existing theory applies. Using this transformation, we address the uniqueness of solution of Bellman’s equation, the convergence of value iteration, and the convergence of some variants of policy iteration. Our analysis and algorithms under the second and third type of assumptions fully apply to finitestate positive (reward) and negative (reward) dynamic programming models.
Weighted Bellman Equations and their Applications in Approximate Dynamic Programming ∗
, 2012
"... We consider approximation methods for Markov decision processes in the learning and simulation context. For policy evaluation based on solving approximate versions of a Bellman equation, we propose the use of weighted Bellman mappings. Such mappings comprise weighted sums of onestep and multistep B ..."
Abstract
 Add to MetaCart
We consider approximation methods for Markov decision processes in the learning and simulation context. For policy evaluation based on solving approximate versions of a Bellman equation, we propose the use of weighted Bellman mappings. Such mappings comprise weighted sums of onestep and multistep Bellman mappings, where the weights depend on both the step and the state. For projected versions of the associated Bellman equations, we show that their solutions have the same nature and essential approximation properties as the commonly used approximate solutions from TD(λ). The most important feature of our framework is that each state can be associated with a different type of mapping. Compared with the standard TD(λ) framework, this gives a more flexible way to combine multistage costs and state transition probabilities in approximate policy evaluation, and provides alternative means for biasvariance control. With weighted Bellman mappings, there is also greater flexibility to design learning and simulationbased algorithms. We demonstrate this with examples, including new TDtype algorithms with statedependent λ parameters, as well as block versions of the algorithms. Weighted Bellman mappings can also