Results 1  10
of
27
ActorCritic Algorithms
 SIAM JOURNAL ON CONTROL AND OPTIMIZATION
, 2001
"... In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on in ..."
Abstract

Cited by 244 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we propose and analyze a class of actorcritic algorithms. These are twotimescale algorithms in which the critic uses temporal difference (TD) learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actorcritic algorithms for Markov decision processes with general state and action spaces. We state and prove two results regarding their convergence.
New Error Bounds for Approximations from Projected Linear Equations
, 2008
"... We consider linear fixed point equations and their approximations by projection on a low dimensional subspace. We derive new bounds on the approximation error of the solution, which are expressed in terms of low dimensional matrices and can be computed by simulation. When the fixed point mapping is ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
We consider linear fixed point equations and their approximations by projection on a low dimensional subspace. We derive new bounds on the approximation error of the solution, which are expressed in terms of low dimensional matrices and can be computed by simulation. When the fixed point mapping is a contraction, as is typically the case in Markovian decision processes (MDP), one of our bounds is always sharper than the standard worst case bounds, and another one is often sharper. Our bounds also apply to the noncontraction case, including policy evaluation in MDP with nonstandard projections that enhance exploration. There are no error
FiniteSample Analysis of LeastSquares Policy Iteration
 Journal of Machine learning Research (JMLR
, 2011
"... In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) l ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
In this paper, we report a performance bound for the widely used leastsquares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning method, and report finitesample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is βmixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
KernelBased Reinforcement Learning in AverageCost Problems: An Application to Optimal Portfolio Choice
"... Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporaldifference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorith ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporaldifference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernelbased approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an averagecost framework and on a practical application to the optimal portfolio choice problem.
Performance loss bounds for approximate value iteration with state aggregation
 Mathematics of Operations Research
, 2005
"... We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal costtogo function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to using invariant distributions of appropriate policies as projection weights. Such projection weighting relates to what is done by temporaldifference learning. Our analysis also leads to the first performance loss bound for approximate value iteration with an averagecost objective. Key words: approximate value iteration; state aggregation; temporaldifference learning
On average versus discounted reward temporaldifference learning
 Machine Learning
, 2002
"... Abstract. We provide an analytical comparison between discounted and average reward temporaldifference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function prod ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Abstract. We provide an analytical comparison between discounted and average reward temporaldifference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms. Keywords: average reward, dynamic programming, function approximation, temporaldifference learning
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Temporal Difference Methods for General Projected Equations
, 2011
"... We consider projected equations for approximate solution of highdimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities, and algorithms that may be implemented with lowdimensional simulation. These ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
We consider projected equations for approximate solution of highdimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities, and algorithms that may be implemented with lowdimensional simulation. These algorithms originated in approximate dynamic programming (DP), where they are collectively known as temporal difference (TD) methods. Even when specialized to DP, our methods include extensions/new versions of TD methods, which offer special implementation advantages and reduced overhead over the standard LSTD and LSPE methods, and can deal with near singularity in the associated matrix inversion. We develop deterministic iterative methods and their simulationbased versions, and we discuss a sharp qualitative distinction between them: the performance of the former is greatly affected by direction and feature scaling, yet the latter have the same asymptotic convergence rate regardless of scaling, because of their common simulationinduced performance bottleneck.
Interferencebased dynamic pricing and radio resource management for WCDMA networks,” Vehicular Technology Conference
, 2005
"... Abstract — In this paper, a new parameter, Noise Rise Factor, that indicates the amount of interference generated by a call is suggested as a basis for setting price. We study the problem of optimal integrated dynamic pricing and radio resource management, in terms of resource allocation and call ad ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper, a new parameter, Noise Rise Factor, that indicates the amount of interference generated by a call is suggested as a basis for setting price. We study the problem of optimal integrated dynamic pricing and radio resource management, in terms of resource allocation and call admission control, in an interferencelimited network. The methods of Dynamic Programming are unsuitable for problems with large state spaces due to the “curse of dimensionality”. To overcome this, we solve the problem using the simulationbased methods of NeuroDynamic Programming. The results show that the method suggested provides significant average reward and congestion improvement over conventional policies that charge users based on their load factor. I.