Results 11 - 20
of
44
A Formal Framework and Extensions for Function Approximation in Learning Classifier Systems
, 2006
"... In this paper we introduce part of a formal framework for Learning Classifier Systems (LCS) which, as a whole, aims at incorporating all components of LCS: function approximation, reinforcement learning and classifier replacement. The part introduced here concerns function approximation, and provide ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In this paper we introduce part of a formal framework for Learning Classifier Systems (LCS) which, as a whole, aims at incorporating all components of LCS: function approximation, reinforcement learning and classifier replacement. The part introduced here concerns function approximation, and provides a formal problem definition, a formalisation of the LCS function approximation architecture, and a definition of the approximation aim. Additionally, we provide definitions of optimality and what conditions need to be fulfilled for a classifier to be optimal. Furthermore, as a demonstration of the usefulness of the framework, we derive commonly used algorithmic approaches that aim at reaching optimality from first principles, and introduce a new Kalman filter-based method that outperforms all currently implemented methods. How to mix classifiers to reach an overall approximation is simplified when compared to current LCS, and is justified by the Maximum Likelihood Estimate of a combination of all classifiers.
Kernelizing LSPE(λ
- In Proc. of IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning
, 2007
"... Abstract — We propose the use of kernel-based methods as underlying function approximator in the least-squares based policy evaluation framework of LSPE(λ) and LSTD(λ). In particular we present the ’kernelization ’ of model-free LSPE(λ). The ’kernelization’ is computationally made possible by using ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract — We propose the use of kernel-based methods as underlying function approximator in the least-squares based policy evaluation framework of LSPE(λ) and LSTD(λ). In particular we present the ’kernelization ’ of model-free LSPE(λ). The ’kernelization’ is computationally made possible by using the subset of regressors approximation, which approximates the kernel using a vastly reduced number of basis functions. The core of our proposed solution is an efficient recursive implementation with automatic supervised selection of the relevant basis functions. The LSPE method is well-suited for optimistic policy iteration and can thus be used in the context of online reinforcement learning. We use the high-dimensional Octopus benchmark to demonstrate this. I.
Statistically Linearized Recursive Least Squares
- in Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2010), Kittilä (Finland), August-September 2010
"... This article proposes a new interpretation of the sigmapoint kalman filter (SPKF) for parameter estimation as being a statistically linearized recursive least-squares algorithm. This gives new insight on the SPKF for parameter estimation and particularly this provides an alternative proof for a resu ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
This article proposes a new interpretation of the sigmapoint kalman filter (SPKF) for parameter estimation as being a statistically linearized recursive least-squares algorithm. This gives new insight on the SPKF for parameter estimation and particularly this provides an alternative proof for a result of Van der Merwe. On the other hand, it legitimates the use of statistical linearization and suggests many ways to use it for parameter estimation, not necessarily in a least-squares sens. Index Terms — Recursive least-squares, statistical linearization, parameter estimation. 1.
Projected Equations, Variational Inequalities, and Temporal Difference Methods
, 2009
"... We consider projected equations for approximate solution of high-dimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities (VIs), and a class of iterative feasible direction methods that may be impleme ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider projected equations for approximate solution of high-dimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities (VIs), and a class of iterative feasible direction methods that may be implemented with low-dimensional simulation. These methods originated in approximate dynamic programming (DP), where they are collectively known as temporal difference (TD) methods. Even when specialized to DP, our methods include extensions/new versions of TD algorithms, which offer special implementation advantages and reduced overhead over the standard LSTD and LSPE methods. We demonstrate a sharp qualitative distinction between the deterministic and the simulation-based versions: the performance of the former is greatly affected by direction and feature scaling, yet the latter asymptotically perform identically, regardless of scaling. I.
Importance Sampling Actor-Critic Algorithms
"... Abstract — Importance Sampling (IS) and actor-critic are two methods which have been used to reduce the variance of gradient estimates in policy gradient optimization methods. We show how IS can be used with Temporal Difference methods to estimate a cost function parameter for one policy using the e ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — Importance Sampling (IS) and actor-critic are two methods which have been used to reduce the variance of gradient estimates in policy gradient optimization methods. We show how IS can be used with Temporal Difference methods to estimate a cost function parameter for one policy using the entire history of system interactions incorporating many different policies. The resulting algorithm is then applied to improving gradient estimates in a policy gradient optimization. The empirical results demonstrate a 20-40 × reduction in variance over the IS estimator for an example queueing problem, resulting in a similar factor of improvement in convergence for a gradient search. I.
Statistically Linearized Least-Squares Temporal Differences
- in Proceedings of the IEEE International Conference on Ultra Modern Control Systems (ICUMT 2010). Moscow (Russia): IEEE
, 2010
"... Abstract — A common drawback of standard reinforcement learning algorithms is their inability to scale-up to real-world problems. For this reason, a current important trend of research is (state-action) value function approximation. A prominent value function approximator is the least-squares tempor ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract — A common drawback of standard reinforcement learning algorithms is their inability to scale-up to real-world problems. For this reason, a current important trend of research is (state-action) value function approximation. A prominent value function approximator is the least-squares temporal differences (LSTD) algorithm. However, for technical reasons, linearity is mandatory: the parameterization of the value function must be linear (compact nonlinear representations are not allowed) and only the Bellman evaluation operator can be considered (imposing policy-iteration-like schemes). In this paper, this restriction of LSTD is lifted thanks to a derivativefree statistical linearization approach. This way, nonlinear parameterizations and the Bellman optimality operator can be taken into account (this last point allows taking into account value-iteration-like schemes). The efficiency of the resulting algorithms are demonstrated using a linear parametrization and neural networks as well as on a Q-learning-like problem. A theoretical analysis is also provided. Index Terms — reinforcement learning, value function approximation, statistical linearization, neural networks. I.
Using reinforcement learning to adapt an imitation task
- in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’07
, 2007
"... Abstract — The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots that can be accomplished by everyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The r ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots that can be accomplished by everyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot must then be able to reproduce the task even when unexpected perturbations occur. In this case, it has to learn a new solution. In this paper, we describe a system that allows a robot to re-learn constrained reaching tasks by combining the knowledge acquired during the demonstration, with that acquired though reinforcement learning. I.
Scaling Reinforcement Learning Paradigms for Motor Control
, 2003
"... Reinforcement learning o#ers a general framework to explain reward related learning in artificial and biological motor control. However, current reinforcement learning methods rarely scale to high dimensional movement systems and mainly operate in discrete, low dimensional domains like game-playing, ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Reinforcement learning o#ers a general framework to explain reward related learning in artificial and biological motor control. However, current reinforcement learning methods rarely scale to high dimensional movement systems and mainly operate in discrete, low dimensional domains like game-playing, artificial toy problems, etc. This drawback makes them unsuitable for application to human or bio-mimetic motor control. In this poster, we look at promising approaches that can potentially scale and suggest a novel formulation of the actor-critic algorithm which takes steps towards alleviating the current shortcomings. We argue that methods based on greedy policies are not likely to scale into high-dimensional domains as they are problematic when used with function approximation -- a must when dealing with continuous domains. We adopt the path of direct policy gradient based policy improvements since they avoid the problems of unstabilizing dynamics encountered in traditional value iteration based updates. While regular policy gradient methods have demonstrated promising results in the domain of humanoid notor control, we demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular policy gradient. Based on this, it is proved that Kakade's `average natural policy gradient' is indeed the true natural gradient. A general algorithm for estimating the natural gradient, the Natural Actor-Critic algorithm, is introduced. This algorithm converges with probability one to the nearest local minimum in Riemannian space of the cost function. The algorithm outperforms nonnatural policy gradients by far in a cart-pole balancing evaluation, and o#ers a promising route for the development of reinforcement learning for truly high-dime...
Temporal Difference Methods for General Projected Equations
"... Abstract—We consider projected equations for approximate solution of high-dimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities, and algorithms that may be implemented with low-dimensional simulati ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract—We consider projected equations for approximate solution of high-dimensional fixed point problems within lowdimensional subspaces. We introduce an analytical framework based on an equivalence with variational inequalities, and algorithms that may be implemented with low-dimensional simulation. These algorithms originated in approximate dynamic programming (DP), where they are collectively known as temporal difference (TD) methods. Even when specialized to DP, our methods include extensions/new versions of TD methods, which offer special implementation advantages and reduced overhead over the standard LSTD and LSPE methods, and can deal with near singularity in the associated matrix inversion. We develop deterministic iterative methods and their simulationbased versions, and we discuss a sharp qualitative distinction between them: the performance of the former is greatly affected by direction and feature scaling, yet the latter have the same asymptotic convergence rate regardless of scaling, because of their common simulation-induced performance bottleneck. Index Terms—Dynamic programming, Markov decision processes, approximation methods, temporal difference methods, reinforcement learning. I.
A unified view of td algorithms – introducing full-gradient td and equi-gradient descent td
"... Abstract. This paper addresses policy evaluation in MDP. It provides a unified view of algorithms such as TD(λ), LSTD(λ), iLSTD, andresidualgradient TD. We assert that they all consist of minimizing a gradient function and differ in the form of this function and their means of minimizing it. Buildin ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. This paper addresses policy evaluation in MDP. It provides a unified view of algorithms such as TD(λ), LSTD(λ), iLSTD, andresidualgradient TD. We assert that they all consist of minimizing a gradient function and differ in the form of this function and their means of minimizing it. Building on this unified view, two new schemes are introduced: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, andEGD TD which reduces the gradient by successive equigradient descents. These three algorithms share the worthy property of using much more efficiently the samples than TD, while keeping the good properties of gradient descent schemes. 1 The policy evaluation problem A Markov Decision Process (MDP) [1] describes a dynamical system in which an agent has to learn a behavior so as to reach a given goal, in an optimal way. In this paper, the state of the system s ∈Smay be either discrete, or continuous. The agent applies an action u ∈Uat each time step t ∈N. This drives the

