Results 1  10
of
18
Least Squares Policy Evaluation Algorithms With Linear Function Approximation
 Theory and Applications
, 2002
"... We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradientlike algorithm involving leastsquares subproblems and a diminishing stepsize, which is based on the #policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(#) algorithm recently proposed by Boyan, which for # =0coincides with the linear leastsquares temporaldifference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(#), with probability 1, for every # [0, 1].
Automatic basis function construction for approximate dynamic programming and reinforcement learning
 In Cohen and Moore (2006
, 2006
"... We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. ..."
Abstract

Cited by 62 (2 self)
 Add to MetaCart
We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a highdimensional state space to a lowdimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lowerdimensional space. These are added as new features for the linear function approximator. This approach is applied to a highdimensional inventory control problem. 1.
Offpolicy temporaldifference learning with function approximation
 Proceedings of the 18th International Conference on Machine Learning
, 2001
"... We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear fun ..."
Abstract

Cited by 45 (10 self)
 Add to MetaCart
We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multiscale, multigoal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any ɛsoft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the actionvalue function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1 Although Qlearning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a sevenstate Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet 1 This is a retypeset version of an article published in the Proceedings
A Generalized Kalman Filter for Fixed Point Approximation and Efficient TemporalDifference
 Learning,” Proceedings of the International Joint Conference on Machine Learning
, 2001
"... The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal–difference learning, and our developments lead to a possibly more efficient variant of temporal–difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
An Analysis of Reinforcement Learning with Function Approximation
"... We address the problem of computing the optimal Qfunction in Markov decision problems with infinite statespace. We analyze the convergence properties of several variations of Qlearning when combined with function approximation, extending the analysis of TDlearning in (Tsitsiklis & Van Roy, 1996a ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
We address the problem of computing the optimal Qfunction in Markov decision problems with infinite statespace. We analyze the convergence properties of several variations of Qlearning when combined with function approximation, extending the analysis of TDlearning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works. 1.
A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation
 Advances in Neural Information Processing Systems 21 (to appear
"... We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policyevaluation setting in which the data need not come from onpolicy experience. The gradient temporaldifference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same leastsquares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importancesampling methods. 1 Offpolicy learning methods Offpolicy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “offpolicy ” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Qlearning algorithm (Watkins 1989) is an offpolicy temporaldifference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding ɛgreedy policy. Offpolicy methods are also critical to reinforcementlearningbased efforts to model humanlevel world knowledge and state representations as predictions of option outcomes (e.g.,
Kernel LeastSquares Temporal Difference Learning Kernel LeastSquares Temporal Difference Learning
"... Kernel methods have attracted many research interests recently since by utilizing Mercer kernels, nonlinear and nonparametric versions of conventional supervised or unsupervised learning algorithms can be implemented and usually better generalization abilities can be obtained. However, kernel meth ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Kernel methods have attracted many research interests recently since by utilizing Mercer kernels, nonlinear and nonparametric versions of conventional supervised or unsupervised learning algorithms can be implemented and usually better generalization abilities can be obtained. However, kernel methods in reinforcement learning have not been popularly studied in the literature. In this paper, we present a novel kernelbased leastsquares temporaldifference (TD) learning algorithm called KLSTD(λ), which can be viewed as the kernel version or nonlinear form of the previous linear LSTD(λ) algorithms. By introducing kernelbased nonlinear mapping, the KLSTD(λ) algorithm is superior to conventional linear TD(λ) algorithms in value function prediction or policy evaluation problems with nonlinear value functions. Furthermore, in KLSTD(λ), the eligibility traces in kernelbased TD learning are derived to make use of data more efficiently, which is different from the recent work on Gaussian Processes in reinforcement learning. Experimental results on a typical valuefunction learning prediction problem of a Markov chain demonstrate the
Convergence analysis of onpolicy LSPI for multidimensional continuous state and actionspace mdps and extension with orthogonal polynomial approximation. Working paper
, 2010
"... We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We propose an online, onpolicy leastsquares policy iteration (LSPI) algorithm which can be applied to infinite horizon problems with where states and controls are vectorvalued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the postdecision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by finitely many known basis functions. Furthermore, the convergence result extends to the Central to the solution of Markov decision processes is Bellman’s equation, which is often written in the standard form (Puterman (1994)) Vt(xt) = max ut∈U {C(xt, ut) + γ ∑
A policy gradient method for semiMarkov decision processes
, 2002
"... Solving a semiMarkov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance crite ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Solving a semiMarkov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it using stochastic approximation. We apply our algorithm to call admission control. Index Terms Stochastic processes; SemiMarkov decision process; Policy gradient; Twotime scale; Call admission control.