Results 1  10
of
208
Perseus: Randomized pointbased value iteration for POMDPs
 Journal of Artificial Intelligence Research
, 2005
"... Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Pointbased approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent’s belief space. We present a ra ..."
Abstract

Cited by 204 (17 self)
 Add to MetaCart
(Show Context)
Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Pointbased approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agent’s belief space. We present a randomized pointbased value iteration algorithm called Perseus. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other pointbased methods, Perseus backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of Perseus in large scale POMDP problems. 1.
Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion
 in Proceedings of the IEEE International Conference on Robotics and Automation
, 2004
"... This paper presents a machine learning approach to optimizing a quadrupedal trot gait for forward speed. Given a parameterized walk designed for a specific robot, we propose using a form of policy gradient reinforcement learning to automatically search the set of possible parameters with the goal of ..."
Abstract

Cited by 149 (15 self)
 Add to MetaCart
(Show Context)
This paper presents a machine learning approach to optimizing a quadrupedal trot gait for forward speed. Given a parameterized walk designed for a specific robot, we propose using a form of policy gradient reinforcement learning to automatically search the set of possible parameters with the goal of finding the fastest possible walk. We implement and test our approach on a commercially available quadrupedal robot platform, namely the Sony Aibo robot. After about three hours of learning, all on the physical robots and with no human intervention other than to change the batteries, the robots achieved a gait faster than any previously known gait for the Aibo, significantly outperforming a variety of existing handcoded and learned solutions.
Machine learning for fast quadrupedal locomotion
 in The Nineteenth National Conference on Artificial Intelligence
, 2004
"... For a robot, the ability to get from one place to another is one of the most basic skills. However, locomotion on legged robots is a challenging multidimensional control problem. This paper presents a machine learning approach to legged locomotion, with all training done on the physical robots. The ..."
Abstract

Cited by 123 (34 self)
 Add to MetaCart
(Show Context)
For a robot, the ability to get from one place to another is one of the most basic skills. However, locomotion on legged robots is a challenging multidimensional control problem. This paper presents a machine learning approach to legged locomotion, with all training done on the physical robots. The main contributions are a specification of our fully automated learning environment and a detailed empirical comparison of four different machine learning algorithms for learning quadrupedal locomotion. The resulting learned walk is considerably faster than all previously reported handcoded walks for the same robot platform.
A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
, 2010
"... We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utilitybased se ..."
Abstract

Cited by 91 (11 self)
 Add to MetaCart
We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utilitybased selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments—active user modelling with preferences, and hierarchical reinforcement learning— and a discussion of the pros and cons of Bayesian optimization based on our experiences.
Experiments with InfiniteHorizon, PolicyGradient Estimation
 Journal of Artificial Intelligence Research
, 2001
"... In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of t ..."
Abstract

Cited by 81 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we present algorithms that perform gradient ascent of the average reward in a partially observable Markov decision process (POMDP). These algorithms are based on GPOMDP, an algorithm introduced in a companion paper (Baxter & Bartlett, 2001), which computes biased estimates of the performance gradient in POMDPs. The algorithm's chief advantages are that it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of biasvariance tradeoff, it requires no knowledge of the underlying state, and it can be applied to infinite state, control and observation spaces. We show how the gradient estimates produced by GPOMDP can be used to perform gradient ascent, both with a traditional stochasticgradient algorithm, and with an algorithm based on conjugategradients that utilizes gradient information to bracket maxima in line searches. Experimental results are presented illustrating both the theoretical results of Baxter and Bartlett (2001) on a toy problem, and practical aspects of the algorithms on a number of more realistic problems. 1.
Stochastic policy gradient reinforcement learning on a simple 3D biped
 Proc. of the 10th Int. Conf. on Intelligent Robots and Systems
, 2004
"... Abstract — We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blankslate using only trials implemented on our physical robot. The robot begins walking within a minute and learning converges in approximately 20 mi ..."
Abstract

Cited by 79 (7 self)
 Add to MetaCart
(Show Context)
Abstract — We present a learning system which is able to quickly and reliably acquire a robust feedback control policy for 3D dynamic walking from a blankslate using only trials implemented on our physical robot. The robot begins walking within a minute and learning converges in approximately 20 minutes. This success can be attributed to the mechanics of our robot, which are modeled after a passive dynamic walker, and to a dramatic reduction in the dimensionality of the learning problem. We reduce the dimensionality by designing a robot with only 6 internal degrees of freedom and 4 actuators, by decomposing the control system in the frontal and sagittal planes, and by formulating the learning problem on the discrete return map dynamics. We apply a stochastic policy gradient algorithm to this reduced problem and decrease the variance of the update using a statebased estimate of the expected cost. This optimized learning system works quickly enough that the robot is able to continually adapt to the terrain as it walks. I.
Incremental Natural ActorCritic Algorithms
"... We present four new reinforcement learning algorithms based on actorcritic and naturalgradient ideas, and provide their convergence proofs. Actorcritic reinforcement learning methods are online approximations to policy iteration in which the valuefunction parameters are estimated using temporal ..."
Abstract

Cited by 75 (8 self)
 Add to MetaCart
(Show Context)
We present four new reinforcement learning algorithms based on actorcritic and naturalgradient ideas, and provide their convergence proofs. Actorcritic reinforcement learning methods are online approximations to policy iteration in which the valuefunction parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior twotimescale convergence results for actorcritic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actorcritic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
PointBased Value Iteration for Continuous POMDPs
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for modelbased POMDPs are restricted to discrete states, actions, and observations, but many realworld problems such as, for instance, robot na ..."
Abstract

Cited by 74 (4 self)
 Add to MetaCart
We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for modelbased POMDPs are restricted to discrete states, actions, and observations, but many realworld problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewiselinear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of valueiteration algorithms. Relying on those properties, we extend the PERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend PERSEUS to deal with continuous action and observation sets by designing effective sampling approaches.
A generalized path integral control approach to reinforcement learning
 The Journal of Machine Learning Research
, 2010
"... With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. ..."
Abstract

Cited by 70 (14 self)
 Add to MetaCart
With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic HamiltonJacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as modelbased, semimodelbased, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradientbased policy learning and scalability to highdimensional control problems. Finally, a learning experiment on a simulated 12 degreeoffreedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI 2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory rollouts.