Results 1 - 10
of
15
Approximate Modified Policy Iteration
"... In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classification-based API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); G ..."
Abstract
-
Cited by 18 (14 self)
- Add to MetaCart
In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classification-based API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon et al. (2011). We then provide an error propagation analysis of AMPI (Sec. 4), which shows how the Lp-norm of its performance loss can be controlled by the error at each iteration of the algorithm. We show that the error propagation analysis of AMPI is more involved than that of AVI and API. This is due to the fact that neither the contraction nor monotonicity arguments, that the error propagation analysis of these two algorithms rely on, hold for AMPI. The analysis of this section unifies those for AVI and API and is applied to the AMPI implementations presented in Sec. 3. We detail the analysis of the classification-based implementation of MPI (CBMPI) of Sec. 3 by providing its finite sample analysis in Sec. 5. Our analysis indicates that the parameter m allows us to balance the estimation error of the classifier with the overall quality of the value approximahal-00697169,
Tight performance bounds for approximate modified policy iteration with non-stationary policies
, 2013
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Approximate Policy Iteration Schemes: A Comparison
"... We consider the infinite-horizon discounted opti-mal control problem formalized by Markov De-cision Processes. We focus on several approxi-mate variations of the Policy Iteration algorithm: ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We consider the infinite-horizon discounted opti-mal control problem formalized by Markov De-cision Processes. We focus on several approxi-mate variations of the Policy Iteration algorithm:
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
"... Abstract — We analyze losses resulting from uncertain transition probabilities in Markov decision processes with bounded nonnegative rewards. We assume that policies are precomputed using exact dynamic programming with the estimated transition probabilities, but the system evolves according to diffe ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract — We analyze losses resulting from uncertain transition probabilities in Markov decision processes with bounded nonnegative rewards. We assume that policies are precomputed using exact dynamic programming with the estimated transition probabilities, but the system evolves according to different, true transition probabilities. Given a bound on the total variation error of estimated transition probability distributions, we derive upper bounds on the loss of expected total reward. The approach analyzes the growth of errors incurred by stepping backwards in time while precomputing value functions, which requires bounding a multilinear program. Loss bounds are given for the finite horizon undiscounted, finite horizon discounted, and infinite horizon discounted cases, and a tight example is shown. I.
On the Performance Bounds of some Policy Search Dynamic Programming Algorithms
, 2013
"... We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an ɛ-approximate greedy operator (Kakade and Lang ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an ɛ-approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative—exponential in 1 ɛ — increase of time complexity. We then describe an algorithm, Non-Stationary Direct Policy Iteration (NSDPI), that can either be seen as 1) a variation of Policy Search by Dynamic Programming by Bagnell et al. (2003) to the infinite horizon situation or 2) a simplified version of the Non-Stationary PI with growing period of Scherrer and Lesner (2012). We provide an analysis of this algorithm, that shows in particular that it enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a time complexity similar to that of DPI. 1
Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
, 2011
"... We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that gener ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We consider the discrete-time infinite-horizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD(λ) described by Sutton and Barto (1998). We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration (Munos, 2007) and Approximate Policy Iteration (Munos, 2003). Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). We provide an original performance bound that can be applied to such an undiscounted control problem.
The dependence of effective planning horizon on model accuracy.
- In International Conference on Autonomous Agents and Multiagent Systems, AAMAS.
, 2015
"... ABSTRACT For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT For Markov decision processes with long horizons (i.e., discount factors close to one), it is common in practice to use reduced horizons during planning to speed computation. However, perhaps surprisingly, when the model available to the agent is estimated from data, as will be the case in most real-world problems, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies to be learned. In particular, it has an intuitive, monotonic relationship with a simple counting measure of complexity, and that a similar relationship can be observed empirically with a more general and data-dependent Rademacher complexity measure. Each complexity measure gives rise to a bound on the planning loss predicting that a planning horizon shorter than the true horizon can reduce overfitting and improve test performance, and we confirm these predictions empirically.
Approximation of Stationary Control Policies by Quantized Control in Markov Decision Processes
"... Abstract — We consider the problem of approximating opti-mal stationary control policies by quantized control. Stationary quantizer policies are introduced and it is shown that such policies are ε-optimal among stationary policies under mild technical conditions. Quantitative bounds on the approxima ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — We consider the problem of approximating opti-mal stationary control policies by quantized control. Stationary quantizer policies are introduced and it is shown that such policies are ε-optimal among stationary policies under mild technical conditions. Quantitative bounds on the approximation error in terms of the rate of the approximating quantizers are also derived. Thus, one can search for ε-optimal policies within quantized control policies. These pave the way for applications in optimal design of networked control systems where controller actions need to be quantized, as well as for a new computational method for the generation of approximately optimal Markov decision policies in general (Borel) state and action spaces for both discounted cost and average cost infinite horizon optimal control problems. I.
Performance Bounds for λ Policy Iteration
, 2011
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
- Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Asymptotic Performance Guarantee for Online Reinforcement Learning with the Least-Squares Regression
, 2011
"... We introduce a new online reinforcement learning (RL) method called least-squares action preference learning (LS-APL) for learning the near-optimal policy in Markovian decision processes (MDPs) (Bertsekas, 2007). Online RL aims at learning a policy to control a system in an incremental fashion, such ..."
Abstract
- Add to MetaCart
We introduce a new online reinforcement learning (RL) method called least-squares action preference learning (LS-APL) for learning the near-optimal policy in Markovian decision processes (MDPs) (Bertsekas, 2007). Online RL aims at learning a policy to control a system in an incremental fashion, such that some measure of long-term performance is maximized by that (optimal) policy. A typical setting where online RL operates is as follows: Given the state x and the behavior policy ¯π the controller calculates a control action a which is sent back to the system. The system then makes a transition to the new state x ′ and issues a control feedback (reward) and the cycle is repeated. The learning problem is to gradually improve the estimate of optimal control based on a history of observations (state-action-reward). 1 Although many online RL algorithms with various levels of success have been proposed during last 20 years (Maei et al., 2010; Melo et al., 2008; Szepesvari and Smart, 2004), we know of no theoretical guarantee in terms of performance loss for general function approximation. This paper provides an asymptotic performance guarantee for online RL, relying on a new variant of dynamic programming (DP) for