Results 1  10
of
18
Path integral policy improvement with covariance matrix adaptation
 in ICML
, 2012
"... There has been a recent focus in reinforcement learning on addressing continuous state and action problems by optimizing parameterized policies. PI2 is a recent example of this approach. It combines a derivation from first principles of stochastic optimal control with tools from statistical estimati ..."
Abstract

Cited by 36 (10 self)
 Add to MetaCart
There has been a recent focus in reinforcement learning on addressing continuous state and action problems by optimizing parameterized policies. PI2 is a recent example of this approach. It combines a derivation from first principles of stochastic optimal control with tools from statistical estimation theory. In this paper, we consider PI2 as a member of the wider family of methods which share the concept of probabilityweighted averaging to iteratively update parameters to optimize a cost function. We compare PI2 to other members of the same family – CrossEntropy Methods and CMAES – at the conceptual level and in terms of performance. The comparison suggests the derivation of a novel algorithm which we call PI2CMA for “Path Integral Policy Improvement with Covariance Matrix Adaptation”. PI2CMA’s main advantage is that it determines the magnitude of the exploration noise automatically. This is a double submission with ICML2012 paper 171 1.
Parameterexploring Policy Gradients
, 2009
"... We present a modelfree reinforcement learning method for partially observable Markov decision problems. Our method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than obtained by regular policy gradient methods. We show that ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
We present a modelfree reinforcement learning method for partially observable Markov decision problems. Our method estimates a likelihood gradient by sampling directly in parameter space, which leads to lower variance gradient estimates than obtained by regular policy gradient methods. We show that for several complex control tasks, including robust standing with a humanoid robot, this method outperforms wellknown algorithms from the fields of standard policy gradients, finite difference methods and population based heuristics. We also show that the improvement is largest when the parameter samples are drawn symmetrically. Lastly we analyse the importance of the individual components of our method by incrementally incorporating them into the other algorithms, and measuring the gain in performance after each step.
Analysis and Improvement of Policy Gradient Estimation
"... Policy gradient is a useful modelfree reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Policy gradient is a useful modelfree reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameterbased exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1
A Natural Evolution Strategy for MultiObjective Optimization
"... Abstract. The recently introduced family of natural evolution strategies (NES), a novel stochastic descent method employing the natural gradient, is providing a more principled alternative to the wellknown covariance matrix adaptation evolution strategy (CMAES). Until now, NES could only be used f ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
Abstract. The recently introduced family of natural evolution strategies (NES), a novel stochastic descent method employing the natural gradient, is providing a more principled alternative to the wellknown covariance matrix adaptation evolution strategy (CMAES). Until now, NES could only be used for singleobjective optimization. This paper extends the approach to the multiobjective case, by first deriving a (1+1) hillclimber version of NES which is then used as the core component of a multiobjective optimization algorithm. We empirically evaluate the approach on a battery of benchmark functions and find it to be competitive with the stateoftheart. 1
Emergent proximodistal maturation through adaptive exploration
 In International Conference on Development and Learning (ICDL
, 2012
"... Abstract—Lifelong robot learning in the highdimensional real world requires guided and structured exploration mechanisms. In this developmental context, we investigate here the use of the recently proposed PI 2 CMAES episodic reinforcement learning algorithm, which is able to learn highdimensiona ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Lifelong robot learning in the highdimensional real world requires guided and structured exploration mechanisms. In this developmental context, we investigate here the use of the recently proposed PI 2 CMAES episodic reinforcement learning algorithm, which is able to learn highdimensional motor tasks through adaptive control of exploration. By studying PI 2 CMAES in a reaching task on a simulated arm, we observe two developmental properties. First, we show how PI 2 CMAES autonomously and continuously tunes the global exploration/exploitation tradeoff, allowing it to readapt to changing tasks. Second, we show how PI 2 CMAES spontaneously selforganizes a maturational structure whilst exploring the degreesoffreedom (DOFs) of the motor space. In particular, it automatically demonstrates the socalled proximodistal maturation observed in humans: after first freezing distal DOFs while exploring predominantly the most proximal DOF, it progressively frees exploration in DOFs along the proximodistal body axis. These emergent properties suggest the use of PI 2 CMAES as a general tool for studying reinforcement learning of skills in lifelong developmental learning contexts. I.
Policy Improvement Methods: Between BlackBox Optimization and Episodic Reinforcement Learning
"... ..."
Adaptive exploration for continual reinforcement learning
 In International Conference on Intelligent Robots and Systems (IROS
, 2012
"... Abstract — Most experiments on policy search for robotics focus on isolated tasks, where the experiment is split into two distinct phases: 1) the learning phase, where the robot learns the task through exploration; 2) the exploitation phase, where exploration is turned off, and the robot demonstrate ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract — Most experiments on policy search for robotics focus on isolated tasks, where the experiment is split into two distinct phases: 1) the learning phase, where the robot learns the task through exploration; 2) the exploitation phase, where exploration is turned off, and the robot demonstrates its performance on the task it has learned. In this paper, we present an algorithm that enables robots to continually and autonomously alternate between these phases. We do so by combining the ‘Policy Improvement with Path Integrals ’ direct reinforcement learning algorithm with the covariance matrix adaptation rule from the ‘CrossEntropy Method ’ optimization algorithm. This integration is possible because both algorithms iteratively update parameters with probabilityweighted averaging. A practical advantage of the novel algorithm, called PI 2CMA, is that it alleviates the user from having to manually tune the degree of exploration. We evaluate PI 2CMA’s ability to continually and autonomously tune exploration on two tasks. I.
Bayesian Nonparametric Multioptima Policy Search in Reinforcement Learning ∗
"... Skills can often be performed in many different ways. In order to provide robots with humanlike adaptation capabilities, it is of great interest to learn several ways of achieving the same skills in parallel, since eventual changes in the environment or in the robot can make some solutions unfeasib ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Skills can often be performed in many different ways. In order to provide robots with humanlike adaptation capabilities, it is of great interest to learn several ways of achieving the same skills in parallel, since eventual changes in the environment or in the robot can make some solutions unfeasible. In this case, the knowledge of multiple solutions can avoid relearning the task. This problem is addressed in this paper within the framework of Reinforcement Learning, as the automatic determination of multiple optimal parameterized policies. For this purpose, a model handling a variable number of policies is built using a Bayesian nonparametric approach. The algorithm is first compared to single policy algorithms on known benchmarks. It is then applied to a typical robotic problem presenting multiple solutions.
Comparative Evaluation of Reinforcement Learning with Scalar Rewards and Linear Regression with Multidimensional Feedback
"... Abstract. This paper presents a comparative evaluation of two learning approaches. The first approach is a conventional reinforcement learning algorithm for direct policy search which uses scalar rewards by definition. The second approach is a custom linear regression based algorithm that uses multi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a comparative evaluation of two learning approaches. The first approach is a conventional reinforcement learning algorithm for direct policy search which uses scalar rewards by definition. The second approach is a custom linear regression based algorithm that uses multidimensional feedback instead of a scalar reward. The two approaches are evaluated in simulation on a common benchmark problem: an aiming task where the goal is to learn the optimal parameters for aiming that result in hitting as close as possible to a given target. The comparative evaluation shows that the multidimensional feedback provides a significant advantage over the scalar reward, resulting in an orderofmagnitude speedup of the convergence. A realworld experiment with a humanoid robot confirms the results from the simulation and highlights the importance of multidimensional feedback for fast learning.
Policy Improvement: Between BlackBox Optimization and Episodic Reinforcement Learning
"... Abstract: Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and blackbox optimization (BBO). In recent years, benchmark comparisons between RL and BBO ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract: Policy improvement methods seek to optimize the parameters of a policy with respect to a utility function. There are two main approaches to performing this optimization: reinforcement learning (RL) and blackbox optimization (BBO). In recent years, benchmark comparisons between RL and BBO have been made, and there have been several attempts to specify which approach works best for which types of problem classes. In this article, we make several contributions to this line of research by: 1) Classifying several RL algorithms in terms of their algorithmic properties. 2) Showing how the derivation of ever more powerful RL algorithms displays a trend towards BBO. 3) Continuing this trend by applying two modifications to the stateoftheart PI 2 algorithm, which yields an algorithm we denote PI BB. We show that PI BB is a BBO algorithm. 4) Demonstrating that PI BB achieves similar or better performance than PI 2 on several evaluation tasks. 5) Analyzing why BBO outperforms RL on these tasks. Rather than making the case for BBO or RL – in general we expect their relative performance to depend on the task considered – we rather provide two algorithms in which such cases can be made, as the algorithms are identical in all respects except in being RL or BBO approaches to policy improvement. 1