Results 11  20
of
30
Agent capability in persistent mission planning using approximate dynamic programming
 in American Control Conference
, 2010
"... AbstractThis paper presents an extension of our previous work on the persistent surveillance problem. An extended problem formulation incorporates realtime changes in agent capabilities as estimated by an onboard health monitoring system in addition to the existing communication constraints, stoc ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
AbstractThis paper presents an extension of our previous work on the persistent surveillance problem. An extended problem formulation incorporates realtime changes in agent capabilities as estimated by an onboard health monitoring system in addition to the existing communication constraints, stochastic sensor failure and fuel flow models, and the basic constraints of providing surveillance coverage using a team of autonomous agents. An approximate policy for the persistent surveillance problem is computed using a parallel, distributed implementation of the approximate dynamic programming algorithm known as Bellman Residual Elimination. This paper also presents flight test results which demonstrate that this approximate policy correctly coordinates the team to simultaneously provide reliable surveillance coverage and a communications link for the duration of the mission and appropriately retasks agents to maintain these services in the event of agent capability degradation.
ModelFree Monte Carlo–like Policy Evaluation
"... We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on th ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of onestep transitions. 1
Convergence analysis of onpolicy LSPI for multidimensional continuous state and actionspace mdps and extension with orthogonal polynomial approximation. Working paper
, 2010
"... We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the postdecision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by finitely many known basis functions. Furthermore, the convergence result extends to the Central to the solution of Markov decision processes is Bellman’s equation, which is often written in the standard form (Puterman (1994)) Vt(xt) = max ut∈U {C(xt, ut) + γ ∑
Scaling up approximate value iteration with options: Better policies with fewer iterations
 In Proceedings of the 31 st International Conference on Machine Learning
, 2014
"... We show how options, a class of control structures encompassing primitive and temporally extended actions, can play a valuable role in planning in MDPs with continuous statespaces. Analyzing the convergence rate of Approximate Value Iteration with options reveals that for pessimistic initial v ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We show how options, a class of control structures encompassing primitive and temporally extended actions, can play a valuable role in planning in MDPs with continuous statespaces. Analyzing the convergence rate of Approximate Value Iteration with options reveals that for pessimistic initial value function estimates, options can speed up convergence compared to planning with only primitive actions even when the temporally extended actions are suboptimal and sparsely scattered throughout the statespace. Our experimental results in an optimal replacement task and a complex inventory management task demonstrate the potential for options to speed up convergence in practice. We show that options induce faster convergence to the optimal value function, which implies deriving better policies with fewer iterations. 1.
Practical Reinforcement Learning Using Representation Learning and Safe Exploration for Large Scale Markov Decision Processes
, 2012
"... degree of ..."
(Show Context)
KernelBased Reinforcement Learning Using Bellman Residual Elimination
 JOURNAL OF MACHINE LEARNING RESEARCH
"... This paper presents a class of new approximate policy iteration algorithms for solving infinitehorizon, discounted Markov decision processes (MDPs) for which a model of the system is available. The algorithms are similar in spirit to Bellman residual minimization methods. However, by exploiting ker ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper presents a class of new approximate policy iteration algorithms for solving infinitehorizon, discounted Markov decision processes (MDPs) for which a model of the system is available. The algorithms are similar in spirit to Bellman residual minimization methods. However, by exploiting kernelbased regression techniques with nondegenerate kernel functions as the underlying costtogo function approximation architecture, the new algorithms are able to explicitly construct costtogo solutions for which the Bellman residuals are identically zero at a set of chosen sample states. For this reason, we have named our approach Bellman residual elimination (BRE). Since the Bellman residuals are zero at the sample states, our BRE algorithms can be proven to reduce to exact policy iteration in the limit of sampling the entire state space. Furthermore, by exploiting knowledge of the model, the BRE algorithms eliminate the need to perform trajectory simulations and therefore do not suffer from simulation noise effects. The theoretical basis of our approach is a pair of reproducing kernel Hilbert spaces corresponding to the cost and Bellman residual function spaces, respectively. By construcing an invertible linear mapping between
Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
, 2011
"... We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that gener ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We consider the discretetime infinitehorizon optimal control problem formalized by Markov Decision Processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ Policy Iteration, a family of algorithms parameterized by λ that generalizes the standard algorithms Value Iteration and Policy Iteration, and has some deep connections with the Temporal Differences algorithm TD(λ) described by Sutton and Barto (1998). We deepen the original theory developped by the authors by providing convergence rate bounds which generalize standard bounds for Value Iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form and show that this is sound. Doing so, we extend and unify the separate analyses developped by Munos for Approximate Value Iteration (Munos, 2007) and Approximate Policy Iteration (Munos, 2003). Eventually, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). We provide an original performance bound that can be applied to such an undiscounted control problem.
Approximate Dynamic Programming for TwoPlayer ZeroSum Markov Games
, 2015
"... This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zerosum twoplayer Stochastic Games. We provide a novel and unified error propagation analysis in Lpnorm of three wellknown algorithms adapted to Stochastic Games (namely ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper provides an analysis of error propagation in Approximate Dynamic Programming applied to zerosum twoplayer Stochastic Games. We provide a novel and unified error propagation analysis in Lpnorm of three wellknown algorithms adapted to Stochastic Games (namely