Results 1  10
of
15
Approximate Modified Policy Iteration
"... In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); G ..."
Abstract

Cited by 18 (14 self)
 Add to MetaCart
In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon et al. (2011). We then provide an error propagation analysis of AMPI (Sec. 4), which shows how the Lpnorm of its performance loss can be controlled by the error at each iteration of the algorithm. We show that the error propagation analysis of AMPI is more involved than that of AVI and API. This is due to the fact that neither the contraction nor monotonicity arguments, that the error propagation analysis of these two algorithms rely on, hold for AMPI. The analysis of this section unifies those for AVI and API and is applied to the AMPI implementations presented in Sec. 3. We detail the analysis of the classificationbased implementation of MPI (CBMPI) of Sec. 3 by providing its finite sample analysis in Sec. 5. Our analysis indicates that the parameter m allows us to balance the estimation error of the classifier with the overall quality of the value approximahal00697169,
Tight performance bounds for approximate modified policy iteration with nonstationary policies
, 2013
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Approximate Policy Iteration Schemes: A Comparison
"... We consider the infinitehorizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We consider the infinitehorizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm:
Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
, 2011
"... We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that gener ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD(λ) described by Sutton and Barto (1998). We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration (Munos, 2007) and Approximate Policy Iteration (Munos, 2003). Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). We provide an original performance bound that can be applied to such an undiscounted control problem.
On the Performance Bounds of some Policy Search Dynamic Programming Algorithms
, 2013
"... We consider the infinitehorizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an ɛapproximate greedy operator (Kakade and Lang ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We consider the infinitehorizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an ɛapproximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative—exponential in 1 ɛ — increase of time complexity. We then describe an algorithm, NonStationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the NonStationary PI with growing period of Scherrer and Lesner (2012). We provide an analysis of this algorithm, that shows in particular that it enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a time complexity similar to that of DPI. 1
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
"... Abstract — We analyze losses resulting from uncertain transition probabilities in Markov decision processes with bounded nonnegative rewards. We assume that policies are precomputed using exact dynamic programming with the estimated transition probabilities, but the system evolves according to diffe ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract — We analyze losses resulting from uncertain transition probabilities in Markov decision processes with bounded nonnegative rewards. We assume that policies are precomputed using exact dynamic programming with the estimated transition probabilities, but the system evolves according to different, true transition probabilities. Given a bound on the total variation error of estimated transition probability distributions, we derive upper bounds on the loss of expected total reward. The approach analyzes the growth of errors incurred by stepping backwards in time while precomputing value functions, which requires bounding a multilinear program. Loss bounds are given for the finite horizon undiscounted, finite horizon discounted, and infinite horizon discounted cases, and a tight example is shown. I.
The Dependence of Effective Planning Horizon on Model Accuracy
"... For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most rea ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most realworld problems, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies to be learned. In particular, it has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and datadependent Rademacher complexity measure. Each complexity measure gives rise to a bound on the planning loss predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve test performance, and we confirm these predictions empirically.
Journal of Machine Learning Research () Submitted 2013/07; Revised 2014/05; Published Approximate Modified Policy Iteration and its Application to the Game of Tetris
"... Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or i ..."
Abstract
 Add to MetaCart
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the wellknown approximate DP algorithms: fittedvalue iteration, fittedQ iteration, and classificationbased policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finitesample analysis of these algorithms, which highlights the influence of their parameters. In the classificationbased version of the algorithm (CBMPI), the analysis shows that MPI’s main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current stateoftheart methods while using fewer samples.1
Adaptive Sparse Grids in Reinforcement Learning
"... Abstract We propose a modelbased online reinforcement learning approach for continuous domains with deterministic transitions using a spatially adaptive sparse grid in the planning stage. The model learning employs Gaussian processes regression and allows a low sample complexity. The adaptive spar ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We propose a modelbased online reinforcement learning approach for continuous domains with deterministic transitions using a spatially adaptive sparse grid in the planning stage. The model learning employs Gaussian processes regression and allows a low sample complexity. The adaptive sparse grid is introduced to allow the representation of the value function in the planning stage in higher dimensional state spaces. This work gives numerical evidence that adaptive sparse grids are applicable in the case of reinforcement learning. 1
Asymptotic Performance Guarantee for Online Reinforcement Learning with the LeastSquares Regression
, 2011
"... We introduce a new online reinforcement learning (RL) method called leastsquares action preference learning (LSAPL) for learning the nearoptimal policy in Markovian decision processes (MDPs) (Bertsekas, 2007). Online RL aims at learning a policy to control a system in an incremental fashion, such ..."
Abstract
 Add to MetaCart
We introduce a new online reinforcement learning (RL) method called leastsquares action preference learning (LSAPL) for learning the nearoptimal policy in Markovian decision processes (MDPs) (Bertsekas, 2007). Online RL aims at learning a policy to control a system in an incremental fashion, such that some measure of longterm performance is maximized by that (optimal) policy. A typical setting where online RL operates is as follows: Given the state x and the behavior policy ¯π the controller calculates a control action a which is sent back to the system. The system then makes a transition to the new state x ′ and issues a control feedback (reward) and the cycle is repeated. The learning problem is to gradually improve the estimate of optimal control based on a history of observations (stateactionreward). 1 Although many online RL algorithms with various levels of success have been proposed during last 20 years (Maei et al., 2010; Melo et al., 2008; Szepesvari and Smart, 2004), we know of no theoretical guarantee in terms of performance loss for general function approximation. This paper provides an asymptotic performance guarantee for online RL, relying on a new variant of dynamic programming (DP) for