Results 1  10
of
14
Kernelbased least squares policy iteration for reinforcement learning
 IEEE Transactions on Neural Networks
, 2007
"... Abstract—In this paper, we present a kernelbased least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, nearoptimal control policies c ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we present a kernelbased least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, nearoptimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernelbased least squares temporaldifference algorithm called KLSTDQ is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTDQ solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTDQ algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALDbased kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for largescale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a doublelink underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance. Index Terms—Approximate dynamic programming, kernel methods, least squares, Markov decision problems (MDPs), reinforcement learning (RL).
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
Modelling transition dynamics in mdps with rkhs embeddings
 In arXiv
, 2012
"... We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of ..."
Abstract

Cited by 20 (9 self)
 Add to MetaCart
We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the underactuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with leastsquares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the NPDP method. Our approach achieves better performance in all experiments.
A NonParametric Approach to Dynamic Programming
"... In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a nonparametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a nonparametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several wellknown policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive LeastSquares Temporal Difference learning, Kernelized Temporal Difference learning, and a discretestate Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
Stochastic Kernel Temporal Difference for Reinforcement
 Learning.” IEEE International Workshop on Machine Learning for Signal Processing
"... This paper introduces a kernel adaptive filter using the stochastic gradient on temporal differences, kernel TD(λ), to estimate the stateaction value function Q in reinforcement learning. Kernel methods are powerful for solving nonlinear problems, but the growing computational complexity and memory ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
This paper introduces a kernel adaptive filter using the stochastic gradient on temporal differences, kernel TD(λ), to estimate the stateaction value function Q in reinforcement learning. Kernel methods are powerful for solving nonlinear problems, but the growing computational complexity and memory size limit their applicability on practical scenarios. To overcome this, the quantization approach introduced in [1] is applied. To help understand the behavior and illustrate the role of the parameters, we apply the algorithm on a 2dimentional spatial navigation task. Eligibility traces are commonly applied in TD learning to improve data efficiency, so the relations of eligibility trace λ and step size and filter size are observed. Moreover, kernel TD (0) is applied to neural decoding of an 8 target centerout reaching task performed by a monkey. Results show the method can effectively learn the brainstate action mapping for this task. Index Terms — Temporal difference learning, kernel methods, reinforcement learning, adaptive filtering 1.
Reinforcement learning and
"... dynamic programming using function approximators Preface Control systems are making a tremendous impact on our society. Though invisible to most users, they are essential for the operation of nearly all devices – from basic home appliances to aircraft and nuclear power plants. Apart from technical s ..."
Abstract
 Add to MetaCart
(Show Context)
dynamic programming using function approximators Preface Control systems are making a tremendous impact on our society. Though invisible to most users, they are essential for the operation of nearly all devices – from basic home appliances to aircraft and nuclear power plants. Apart from technical systems, the principles of control are routinely applied and exploited in a variety of disciplines such as economics, medicine, social sciences, and artificial intelligence. A common denominator in the diverse applications of control is the need to influence or modify the behavior of dynamic systems to attain prespecified goals. One approach to achieve this is to assign a numerical performance index to each state trajectory of the system. The control problem is then solved by searching for a control policy that drives the system along trajectories corresponding to the best value of the performance index. This approach essentially reduces the problem of finding good control policies to the search for solutions of a mathematical optimization problem.
Robust DataDriven Dynamic Programming
"... Abstract In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the costtogo functions. The resulting datadriven DP scheme is asymptotically consistent and admits an efficient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated costtogo functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in outofsample tests. To mitigate these small sample effects, we propose a robust datadriven DP scheme, which replaces the expectations in the DP recursions with worstcase expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various nonrobust schemes in outofsample tests across several application domains.
KERNEL TEMPORAL DIFFERENCES FOR REINFORCEMENT LEARNING WITH APPLICATIONS TO BRAIN MACHINE INTERFACES By
, 2013
"... 2 I dedicate this to my family for their endless support. ..."
(Show Context)
Value Function Approximation through Sparse Bayesian Modeling
"... Abstract. In this study we present a sparse Bayesian framework for value function approximation. The proposed method is based on the online construction of a dictionary of states which are collected during the exploration of the environment by the agent. A linear regression model is established for ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In this study we present a sparse Bayesian framework for value function approximation. The proposed method is based on the online construction of a dictionary of states which are collected during the exploration of the environment by the agent. A linear regression model is established for the observed partial discounted return of such dictionary states, where we employ the Relevance Vector Machine (RVM) and exploit its enhanced modeling capability due to the embedded sparsity properties. In order to speedup the optimization procedure and allow dealing with largescale problems, an incremental strategy is adopted. A number of experiments have been conducted on both simulated and real environments, where we took promising results in comparison with another Bayesian approach that uses Gaussian processes.