Results 1 
9 of
9
An analysis of temporaldifference learning with function approximation
 IEEE Transactions on Automatic Control
, 1997
"... We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodi ..."
Abstract

Cited by 218 (7 self)
 Add to MetaCart
We discuss the temporaldifference learning algorithm, as applied to approximating the costtogo function of an infinitehorizon discounted Markov chain. The algorithm weanalyze updates parameters of a linear function approximator online, duringasingle endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability 1), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporaldifference learning. In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporaldifference learning. Second, we present anexample illustrating the possibility of divergence when temporaldifference learning is used in the presence of a nonlinear function approximator.
Least Squares Policy Evaluation Algorithms With Linear Function Approximation
 Theory and Applications
, 2002
"... We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradientlike algorithm involving leastsquares subproblems and a diminishing stepsize, which is based on the #policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(#) algorithm recently proposed by Boyan, which for # =0coincides with the linear leastsquares temporaldifference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(#), with probability 1, for every # [0, 1].
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporaldifference learning  a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
A Generalized Kalman Filter for Fixed Point Approximation and Efficient TemporalDifference
 Learning,” Proceedings of the International Joint Conference on Machine Learning
, 2001
"... The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal–difference learning, and our developments lead to a possibly more efficient variant of temporal–difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
Improved Temporal Difference Methods with Linear Function Approximation
"... This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional d ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional discussions of such problems can be found in Chapters 12 and 6. The advantage of the method presented here is that this is the first iterative temporal difference method that converges without requiring a diminishing step size. The chapter discusses the connections with Suttonfls TD(λ) and with various versions of leastsquares that are based on valueiteration. It is shown using both analysis and experiments that the proposed method is substantially faster, simpler, and more reliable than TD(λ). Comparisons are also made with the LSTD method of Boyan and Bradtke and Barto.
Average Cost TemporalDifference Learning
 Automatica
, 1997
"... We propose a variant of temporaldifference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
We propose a variant of temporaldifference learning that approximates average and differential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of fixed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1), and a characterization of the limit of convergence. We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the "mixing time" of the Markov chain. The results parallel previous work by the authors, involving approximations of discounted costtogo. 1 Introduction Temporaldifference learning, originally proposed by Sutton (1988), is an algorithm for approximating the costtogo function of a Markov chain (the expected future cost, as a function of the initial state). Given a set of basis functions, the algorithm tunes a vector of weights so that the weighted combi...
Convergence Results for Some Temporal Difference Methods Based on Least Squares,” Lab. for Information and Decision Systems Report 2697
, 2006
"... We consider finitestate Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(λ). These are temporal difference methods for constructing a linear function approximation of the cost function o ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
We consider finitestate Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(λ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinitehorizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(λ), as well as more reliable.
Title of the Book!
"... Neurodynamic programming is comprised of algorithms for solving large scale stochastic control problems. Many ideas underlying these algorithms originated in the field of artificial intelligence and were motivated to some extent by descriptive models of animal behavior. This chapter provides an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Neurodynamic programming is comprised of algorithms for solving large scale stochastic control problems. Many ideas underlying these algorithms originated in the field of artificial intelligence and were motivated to some extent by descriptive models of animal behavior. This chapter provides an overview of the history and stateoftheart in neurodynamic programming, as well as a review of recent results involving two classes of algorithms that have been the subject of much recent research activity: temporal di#erence learning and actorcritic methods.
Brief paper Average cost temporaldi!erence learning �
, 1999
"... We propose a variant of temporaldi!erence learning that approximates average and di!erential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of "xed basis functions whose weights are incrementally updated during a single endless trajectory of the ..."
Abstract
 Add to MetaCart
We propose a variant of temporaldi!erence learning that approximates average and di!erential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of "xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1) and a characterization of the limit of convergence. We also provide a bound on the resulting approximation error that exhibits an interesting dependence on the `mixing timea of the Markov chain. The results parallel previous work by the authors, involving