Results 1  10
of
36
Projected equation methods for approximate solution of large linear systems
 JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS
"... ..."
Kalman Temporal Differences
 Journal of Artificial Intelligence Research (JAIR
, 2010
"... Because reinforcement learning suffers from a lack of scalability, online value (and Q) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the foll ..."
Abstract

Cited by 17 (15 self)
 Add to MetaCart
Because reinforcement learning suffers from a lack of scalability, online value (and Q) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the following features: sampleefficiency, nonlinear approximation, nonstationarity handling and uncertainty management. A first KTDbased algorithm is provided for deterministic Markov Decision Processes (MDP) which produces biased estimates in the case of stochastic transitions. Than the eXtended KTD framework (XKTD), solving stochastic MDP, is described. Convergence is analyzed for special cases for both deterministic and stochastic transitions. Related algorithms are experimented on classical benchmarks. They compare favorably to the state of the art while exhibiting the announced features. 1.
New Error Bounds for Approximations from Projected Linear Equations
, 2008
"... We consider linear fixed point equations and their approximations by projection on a low dimensional subspace. We derive new bounds on the approximation error of the solution, which are expressed in terms of low dimensional matrices and can be computed by simulation. When the fixed point mapping is ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
(Show Context)
We consider linear fixed point equations and their approximations by projection on a low dimensional subspace. We derive new bounds on the approximation error of the solution, which are expressed in terms of low dimensional matrices and can be computed by simulation. When the fixed point mapping is a contraction, as is typically the case in Markovian decision processes (MDP), one of our bounds is always sharper than the standard worst case bounds, and another one is often sharper. Our bounds also apply to the noncontraction case, including policy evaluation in MDP with nonstandard projections that enhance exploration. There are no error
Kalman Temporal Differences: the deterministic case
 In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009
, 2009
"... Abstract — This paper deals with value function and Qfunction approximation in deterministic Markovian decision processes. A general statistical framework based on the Kalman filtering paradigm is introduced. Its principle is to adopt a parametric representation of the value function, to model the ..."
Abstract

Cited by 15 (11 self)
 Add to MetaCart
(Show Context)
Abstract — This paper deals with value function and Qfunction approximation in deterministic Markovian decision processes. A general statistical framework based on the Kalman filtering paradigm is introduced. Its principle is to adopt a parametric representation of the value function, to model the associated parameter vector as a random variable and to minimize the meansquared error of the parameters conditioned on past observed transitions. From this general framework, which will be called Kalman Temporal Differences (KTD), and using an approximation scheme called the unscented transform, a family of algorithms is derived, namely KTDV, KTDSARSA and KTDQ, which aim respectively at estimating the value function of a given policy, the Qfunction of a given policy and the optimal Qfunction. The proposed approach holds for linear and nonlinear parameterization. This framework is discussed and potential advantages and shortcomings are highlighted.
Predictive state temporal difference learning
"... We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either highdimensional or partially observable. Th ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
(Show Context)
We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either highdimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem. 1
QLearning and Policy Iteration Algorithms for Stochastic Shortest
 Path Problems,” Lab. for Information and Decision Systems Report LIDSP2871, MIT
, 2011
"... We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
We consider the stochastic shortest path problem, a classical finitestate Markovian decision problem with a termination state, and we propose new convergent Qlearning algorithms that combine elements of policy iteration and classical Qlearning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in [BY10b]. The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Qlearning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iterationlike alternative Qlearning schemes with as reliable convergence as classical Qlearning. We also discuss methods that use basis function approximations of Qfactors and we give an associated error bound.
Performance loss bounds for approximate value iteration with state aggregation
 Mathematics of Operations Research
, 2005
"... We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies as projection weights. Such projection weighting relates to what is done by temporaldifference learning. Our analysis also leads to the first performance loss bound for approximate value iteration with an averagecost objective. Key words: approximate value iteration; state aggregation; temporaldifference learning
Tracking Value Function Dynamics to Improve Reinforcement Learning with Piecewise Linear Function Approximation
"... Reinforcement learning algorithms can become unstable when combined with linear function approximation. Algorithms that minimize the meansquare Bellman error are guaranteed to converge, but often do so slowly or are computationally expensive. In this paper, we propose to improve the convergence spe ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Reinforcement learning algorithms can become unstable when combined with linear function approximation. Algorithms that minimize the meansquare Bellman error are guaranteed to converge, but often do so slowly or are computationally expensive. In this paper, we propose to improve the convergence speed of piecewise linear function approximation by tracking the dynamics of the value function with the Kalman filter using a randomwalk model. We cast this as a general framework in which we implement the TD, QLearning and MAXQ algorithms for different domains, and report empirical results demonstrating improved learning speed over previous methods. 1.
Qlearning and enhanced policy iteration in discounted dynamic programming
, 2012
"... We consider the classical finitestate discounted Markovian decision problem, and we introduce a new policy iterationlike algorithm for finding the optimal state costs or Qfactors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
We consider the classical finitestate discounted Markovian decision problem, and we introduce a new policy iterationlike algorithm for finding the optimal state costs or Qfactors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Qlearning algorithm, to obtain a new method that is intermediate between policy iteration and Qlearning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration, and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Qlearning schemes. Furthermore, for largescale problems, where linear basis function approximations and simulationbased temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces. 1.
Parametric Value Function Approximation: a Unified View
"... Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subt ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This survey reviews and unifies state of the art methods for parametric value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive leastsquares approach. Index Terms—Reinforcement learning, value function approximation, survey. I.