Results 1 
8 of
8
Offpolicy reinforcement learning with gaussian processes. Acta Automatica Sinica
, 2014
"... An offpolicy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian Processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guara ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
An offpolicy Bayesian nonparameteric approximate reinforcement learning framework, termed as GPQ, that employs a Gaussian Processes (GP) model of the value (Q) function is presented in both the batch and online settings. Sufficient conditions on GP hyperparameter selection are established to guarantee convergence of offpolicy GPQ in the batch setting, and theoretical and practical extensions are provided for the online case. Empirical results demonstrate GPQ has competitive learning speeds in addition to its convergence guarantees and its ability to automatically choose its own bases locations. 1
Acknowledgments
, 2014
"... Hiermit versichere ich, die vorliegende MasterThesis ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbeh ..."
Abstract
 Add to MetaCart
(Show Context)
Hiermit versichere ich, die vorliegende MasterThesis ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.
Basis Adaptation for Sparse Nonlinear Reinforcement Learning
"... This paper presents a new approach to representation discovery in reinforcement learning (RL) using basis adaptation. We introduce a general framework for basis adaptation as nonlinear separable leastsquares value function approximation based on finding Fréchet gradients of an error function using ..."
Abstract
 Add to MetaCart
This paper presents a new approach to representation discovery in reinforcement learning (RL) using basis adaptation. We introduce a general framework for basis adaptation as nonlinear separable leastsquares value function approximation based on finding Fréchet gradients of an error function using variable projection functionals. We then present a scalable proximal gradientbased approach for basis adaptation using the recently proposed mirrordescent framework for RL. Unlike traditional temporaldifference (TD) methods for RL, mirror descent based RL methods undertake proximal gradient updates of weights in a dual space, which is linked together with the primal space using a Legendre transform involving the gradient of a strongly convex function. Mirror descent RL can be viewed as a proximal TD algorithm using Bregman divergence as the distance generating function. We present a new class of regularized proximalgradient based TD methods, which combine feature selection through sparse L1 regularization and basis adaptation. Experimental results are provided to illustrate and validate the approach.
FiniteSample Analysis of Proximal Gradient TD Algorithms
"... In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primaldual saddlepoint objectiv ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primaldual saddlepoint objective functions. We then conduct a saddlepoint error analysis to obtain finitesample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finitesample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2MP, which use proximal “mirror maps ” to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for offpolicy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods. 1
SelfSupervised Online Metric Learning With Low Rank Constraint for Scene Categorization
, 2012
"... Abstract — Conventional visual recognition systems usually train an image classifier in a bath mode with all training data provided in advance. However, in many practical applications, only a small amount of training samples are available in the beginning and many more would come sequentially during ..."
Abstract
 Add to MetaCart
Abstract — Conventional visual recognition systems usually train an image classifier in a bath mode with all training data provided in advance. However, in many practical applications, only a small amount of training samples are available in the beginning and many more would come sequentially during online recognition. Because the image data characteristics could change over time, it is important for the classifier to adapt to the new data incrementally. In this paper, we present an online metric learning method to address the online scene recognition problem via adaptive similarity measurement. Given a number of labeled data followed by a sequential input of unseen testing samples, the similarity metric is learned to maximize the margin of the distance among different classes of samples. By considering the low rank constraint, our online metric learning model not only can provide competitive performance compared with the stateoftheart methods, but also guarantees convergence. A bilinear graph is also defined to model the pairwise similarity, and an unseen sample is labeled depending on the graphbased label propagation, while the model can also selfupdate using the more confident new samples. With the ability of online learning, our methodology can well handle the largescale streaming video data with the ability of incremental selfupdating. We evaluate our model to online scene categorization and experiments on various benchmark datasets and comparisons with stateoftheart methods demonstrate the effectiveness and efficiency of our algorithm. Index Terms — Low rank, online learning, metric learning, semisupervised learning, scene categorization
1Selfsupervised Online Metric Learning with Low Rank Constraint for Scene Categorization
"... © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to s ..."
Abstract
 Add to MetaCart
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at:
An Analysis of StateRelevance Weights and Sampling Distributions on L 1 Regularized Approximate Linear Programming Approximation Accuracy
"... Abstract Recent interest in the use of L 1 regularization in the use of value function approximation includes Petrik et al.'s introduction of L 1 Regularized Approximate Linear Programming (RALP). RALP is unique among L 1 regularized approaches in that it approximates the optimal value funct ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Recent interest in the use of L 1 regularization in the use of value function approximation includes Petrik et al.'s introduction of L 1 Regularized Approximate Linear Programming (RALP). RALP is unique among L 1 regularized approaches in that it approximates the optimal value function using offpolicy samples. Additionally, it produces policies which outperform those of previous methods, such as LSPI. RALP's value function approximation quality is affected heavily by the choice of staterelevance weights in the objective function of the linear program, and by the distribution from which samples are drawn; however, there has been no discussion of these considerations in the previous literature. In this paper, we discuss and explain the effects of choices in the staterelevance weights and sampling distribution on approximation quality, using both theoretical and experimental illustrations. The results provide insight not only onto these effects, but also provide intuition into the types of MDPs which are especially well suited for approximation with RALP.
Dantzig Selector with an Approximately Optimal Denoising Matrix and its Application in Sparse Reinforcement Learning
"... Abstract Dantzig Selector (DS) is widely used in compressed sensing and sparse learning for feature selection and sparse signal recovery. Since the DS formulation is essentially a linear programming optimization, many existing linear programming solvers can be simply applied for scaling up. The DS ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Dantzig Selector (DS) is widely used in compressed sensing and sparse learning for feature selection and sparse signal recovery. Since the DS formulation is essentially a linear programming optimization, many existing linear programming solvers can be simply applied for scaling up. The DS formulation can be explained as a basis pursuit denoising problem, wherein the data matrix (or measurement matrix) is employed as the denoising matrix to eliminate the observation noise. However, we notice that the data matrix may not be the optimal denoising matrix, as shown by a simple counterexample. This motivates us to pursue a better denoising matrix for defining a general DS formulation. We first define the optimal denoising matrix through a minimax optimization, which turns out to be an NPhard problem. To make the problem computationally tractable, we propose a novel algorithm, termed as "Optimal" Denoising Dantzig Selector (ODDS), to approximately estimate the optimal denoising matrix. Empirical experiments validate the proposed method. Finally, a novel sparse reinforcement learning algorithm is formulated by extending the proposed ODDS algorithm to temporal difference learning, and empirical experimental results demonstrate to outperform the conventional "vanilla" DSTD algorithm.