Results 1 
9 of
9
Modelling transition dynamics in mdps with rkhs embeddings
 In arXiv
, 2012
"... We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of ..."
Abstract

Cited by 20 (9 self)
 Add to MetaCart
(Show Context)
We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the underactuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with leastsquares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.
Towards learning hierarchical skills for multiphase manipulation tasks
 in International Conference on Robotics and Automation (ICRA
, 2015
"... Abstract—Most manipulation tasks can be decomposed into a sequence of phases, where the robot’s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The rob ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Most manipulation tasks can be decomposed into a sequence of phases, where the robot’s actions have different effects in each phase. The robot can perform actions to transition between phases and, thus, alter the effects of its actions, e.g. grasp an object in order to then lift it. The robot can thus reach a phase that affords the desired manipulation. In this paper, we present an approach for exploiting the phase structure of tasks in order to learn manipulation skills more efficiently. Starting with human demonstrations, the robot learns a probabilistic model of the phases and the phase transitions. The robot then employs modelbased reinforcement learning to create a library of motor primitives for transitioning between phases. The learned motor primitives generalize to new situations and tasks. Given this library, the robot uses a value function approach to learn a highlevel policy for sequencing the motor primitives. The proposed method was successfully evaluated on a real robot performing a bimanual grasping task. I.
Value Function Approximation in Noisy Environments Using Locally Smoothed Regularized Approximate Linear Programs
"... Recently, Petrik et al. demonstrated that L1Regularized Approximate Linear Programming (RALP) could produce value functions and policies which compared favorably to established linear value function approximation techniques like LSPI. RALP’s success primarily stems from the ability to solve the fe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Recently, Petrik et al. demonstrated that L1Regularized Approximate Linear Programming (RALP) could produce value functions and policies which compared favorably to established linear value function approximation techniques like LSPI. RALP’s success primarily stems from the ability to solve the feature selection and value function approximation steps simultaneously. RALP’s performance guarantees become looser if sampled next states are used. For very noisy domains, RALP requires an accurate model rather than samples, which can be unrealistic in some practical scenarios. In this paper, we demonstrate this weakness, and then introduce Locally Smoothed L1Regularized Approximate Linear Programming (LSRALP). We demonstrate that LSRALP mitigates inaccuracies stemming from noise even without an accurate model. We show that, given some smoothness assumptions, as the number of samples increases, error from noise approaches zero, and provide experimental examples of LSRALP’s success on common reinforcement learning benchmark problems. 1
Online Selective KernelBased Temporal Difference Learning
"... Abstract — In this paper, an online selective kernelbased temporal difference (OSKTD) learning algorithm is proposed to deal with large scale and/or continuous reinforcement learning problems. OSKTD includes two online procedures: online sparsification and parameter updating for the selective ker ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper, an online selective kernelbased temporal difference (OSKTD) learning algorithm is proposed to deal with large scale and/or continuous reinforcement learning problems. OSKTD includes two online procedures: online sparsification and parameter updating for the selective kernelbased value function. A new sparsification method (i.e., a kernel distancebased online sparsification method) is proposed based on selective ensemble learning, which is computationally less complex compared with other sparsification methods. With the proposed sparsification method, the sparsified dictionary of samples is constructed online by checking if a sample needs to be added to the sparsified dictionary. In addition, based on local validity, a selective kernelbased value function is proposed to select the best samples from the sample dictionary for the selective kernelbased value function approximator. The parameters of the selective kernelbased value function are iteratively updated by using the temporal difference (TD) learning algorithm combined with the gradient descent technique. The complexity of the online sparsification procedure in the OSKTD algorithm is O(n). In addition, two typical experiments (Maze and Mountain Car) are used to compare with both traditional and uptodate O(n) algorithms (GTD, GTD2, and TDC using the kernelbased value function), and the results demonstrate the effectiveness of our proposed algorithm. In the Maze problem, OSKTD converges to an optimal policy and converges faster than both traditional and uptodate algorithms. In the Mountain Car problem, OSKTD converges, requires less computation time compared with other sparsification methods, gets a better local optima than the traditional algorithms, and converges much faster than the uptodate algorithms. In addition, OSKTD can reach a competitive ultimate optima compared with the uptodate algorithms. Index Terms — Function approximation, online sparsification, reinforcement learning (RL), selective ensemble learning,
Sample Complexity and Performance Bounds for Nonparametric Approximate Linear Programming
"... One of the most difficult tasks in value function approximation for Markov Decision Processes is finding an approximation architecture that is expressive enough to capture the important structure in the value function, while at the same time not overfitting the training samples. Recent results in no ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
One of the most difficult tasks in value function approximation for Markov Decision Processes is finding an approximation architecture that is expressive enough to capture the important structure in the value function, while at the same time not overfitting the training samples. Recent results in nonparametric approximate linear programming (NPALP), have demonstrated that this can be done effectively using nothing more than a smoothness assumption on the value function. In this paper we extend these results to the case where samples come from real world transitions instead of the full Bellman equation, adding robustness to noise. In addition, we provide the first maxnorm, finite sample performance guarantees for any form of ALP. NPALP is amenable to problems with large (multidimensional) or even infinite (continuous) action spaces, and does not require a model to select actions using the resulting approximate solution. 1 Introduction and
Intelligent Autonomous Systems Machine Learning for Robot Grasping and Manipulation
, 2014
"... Robotics as a technology has an incredible potential for improving our everyday lives. Robots could perform household chores, such as cleaning, cooking, and gardening, in order to give us more time for other pursuits. Robots could also be used to perform tasks in hazardous environments, such as turn ..."
Abstract
 Add to MetaCart
(Show Context)
Robotics as a technology has an incredible potential for improving our everyday lives. Robots could perform household chores, such as cleaning, cooking, and gardening, in order to give us more time for other pursuits. Robots could also be used to perform tasks in hazardous environments, such as turning off a valve in an emergency or safely sorting our more dangerous trash. However, all of these applications would require the robot to perform manipulation tasks with various objects. Today’s robots are used primarily for performing specialized tasks in controlled scenarios, such as manufacturing. The robots that are used in today’s applications are typically designed for a single purpose and they have been preprogrammed with all of the necessary task information. In contrast, a robot working in a more general environment will often be confronted with new objects and scenarios. Therefore, in order to reach their full potential as autonomous physical agents, robots must be capable of learning versatile manipulation skills for different objects and situations. Hence, we have worked on a variety of manipulation skills to improve those capabilities of robots, and the results have lead to several new approaches, which are presented in this thesis Learning manipulation skills is, however, an open problem with many challenges that still need to
Learning Transition Dynamics in MDPs with Online Regression and Greedy Feature Selection∗
"... We present an approach to reinforcement learning in which the system dynamics are modelled using online linear regression between feature spaces, and a compact feature representation for the dynamics model is built incrementally using greedy feature selection. Candidate features are built online usi ..."
Abstract
 Add to MetaCart
We present an approach to reinforcement learning in which the system dynamics are modelled using online linear regression between feature spaces, and a compact feature representation for the dynamics model is built incrementally using greedy feature selection. Candidate features are built online using kernels centred at datapoints as they are discovered. We implement the model learning method in a policy iteration scheme. The complexity of each policy iteration (feature learning, model learning, value estimation and policy improvement) is independent of the total amount of data observed, and only linear in the amount of new data added per iteration. The approach therefore scales up to complex problems requiring a huge amount of data to learn well. We validate the approach on benchmark MDPs and simulated quadrocopter navigation. 1
Practical KernelBased Reinforcement Learning†
"... Kernelbased reinforcement learning (KBRL) stands out among approximate reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which is statistically consistent and ..."
Abstract
 Add to MetaCart
Kernelbased reinforcement learning (KBRL) stands out among approximate reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which is statistically consistent and converges to a unique solution. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to largescale or online domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernelbased stochastic factorization (KBSF) builds on a simple idea: when a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller than the original, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL’s model into an approximator of fixed size. This makes it possible to build an approximation that takes into account both