Results 1  10
of
15
Random sampling of states in dynamic programming
 in Proc. NIPS Conf., 2007
"... Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding s ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steadystate policies for deterministic timeinvariant discrete time control problems with continuous states and actions often found in robotics. In this paper, we describe our approach and provide initial results on several simulated robotics problems. Index Terms—Dynamic programming, optimal control, random sampling. I.
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
"... We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu & Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average.
Adaptiveresolution reinforcement learning with polynomial exploration in deterministic domains
 Mach. Learn
"... We propose a modelbased learning algorithm, the Adaptiveresolution Reinforcement Learning (ARL) algorithm, that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. Our goal is to combine adaptiveresolution approximation scheme with efficient ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We propose a modelbased learning algorithm, the Adaptiveresolution Reinforcement Learning (ARL) algorithm, that aims to solve the online, continuous state space reinforcement learning problem in a deterministic domain. Our goal is to combine adaptiveresolution approximation scheme with efficient exploration in order to obtain fast (polynomial) learning rates. The proposed algorithm uses an adaptive approximation of the optimal value function using kernelbased averaging, going from coarse to fine kernelbased representation of the state space, which enables to use finer resolution in the “important ” areas of the state space, and coarser resolution elsewhere. We consider an online learning approach, in which we discover these important areas online, using an uncertainty intervals exploration technique. Polynomial learning rates in terms of mistake bound (in a PAC framework) are established for this algorithm, under appropriate continuity assumptions. 1
Convergence analysis of onpolicy LSPI for multidimensional continuous state and actionspace mdps and extension with orthogonal polynomial approximation. Working paper
, 2010
"... We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the postdecision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by finitely many known basis functions. Furthermore, the convergence result extends to the Central to the solution of Markov decision processes is Bellman’s equation, which is often written in the standard form (Puterman (1994)) Vt(xt) = max ut∈U {C(xt, ut) + γ ∑
Approximate Modified Policy Iteration
"... In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
In this paper, we propose three implementations of AMPI (Sec. 3) that generalize the AVI implementations of Ernst et al. (2005); Antos et al. (2007); Munos & Szepesvári (2008) and the classificationbased API algorithm of Lagoudakis & Parr (2003); Fern et al. (2006); Lazaric et al. (2010); Gabillon et al. (2011). We then provide an error propagation analysis of AMPI (Sec. 4), which shows how the Lpnorm of its performance loss can be controlled by the error at each iteration of the algorithm. We show that the error propagation analysis of AMPI is more involved than that of AVI and API. This is due to the fact that neither the contraction nor monotonicity arguments, that the error propagation analysis of these two algorithms rely on, hold for AMPI. The analysis of this section unifies those for AVI and API and is applied to the AMPI implementations presented in Sec. 3. We detail the analysis of the classificationbased implementation of MPI (CBMPI) of Sec. 3 by providing its finite sample analysis in Sec. 5. Our analysis indicates that the parameter m allows us to balance the estimation error of the classifier with the overall quality of the value approximahal00697169,
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
, 2011
"... We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that gener ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD(λ) described by Sutton and Barto (1998). We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration (Munos, 2007) and Approximate Policy Iteration (Munos, 2003). Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). We provide an original performance bound that can be applied to such an undiscounted control problem.
ModelFree Monte Carlo–like Policy Evaluation
"... We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on th ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of onestep transitions. 1
KernelBased Reinforcement Learning Using Bellman Residual Elimination
 JOURNAL OF MACHINE LEARNING RESEARCH
"... This paper presents a class of new approximate policy iteration algorithms for solving infinitehorizon, discounted Markov decision processes (MDPs) for which a model of the system is available. The algorithms are similar in spirit to Bellman residual minimization methods. However, by exploiting ker ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This paper presents a class of new approximate policy iteration algorithms for solving infinitehorizon, discounted Markov decision processes (MDPs) for which a model of the system is available. The algorithms are similar in spirit to Bellman residual minimization methods. However, by exploiting kernelbased regression techniques with nondegenerate kernel functions as the underlying costtogo function approximation architecture, the new algorithms are able to explicitly construct costtogo solutions for which the Bellman residuals are identically zero at a set of chosen sample states. For this reason, we have named our approach Bellman residual elimination (BRE). Since the Bellman residuals are zero at the sample states, our BRE algorithms can be proven to reduce to exact policy iteration in the limit of sampling the entire state space. Furthermore, by exploiting knowledge of the model, the BRE algorithms eliminate the need to perform trajectory simulations and therefore do not suffer from simulation noise effects. The theoretical basis of our approach is a pair of reproducing kernel Hilbert spaces corresponding to the cost and Bellman residual function spaces, respectively. By construcing an invertible linear mapping between