Results 1  10
of
50
A Natural Policy Gradient
"... We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy opt ..."
Abstract

Cited by 106 (0 self)
 Add to MetaCart
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as deo/ned by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Policy gradient methods for robotics
 In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS
, 2006
"... Abstract — The aquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely prestructured environments. However, to date only few existing reinforcement learning methods have been scaled into the d ..."
Abstract

Cited by 79 (19 self)
 Add to MetaCart
Abstract — The aquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely prestructured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of highdimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm. I.
Reinforcement Learning in POMDP's via Direct Gradient Ascent
 In Proc. 17th International Conf. on Machine Learning
, 2000
"... This paper discusses theoretical and experimental aspects of gradientbased approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCElike algorithm for estimating an approximation to the gradient of the average reward as a function of ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
This paper discusses theoretical and experimental aspects of gradientbased approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCElike algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of biasvariance tradeoff, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugategradient procedure to find local optima of the average reward. 1. Introduction "Reinforcement learning" is used to describe the general problem of training an agent to choose its actions so as to increase its longterm average reward. The structure of th...
Covariant Policy Search
, 2003
"... We investigate the problem of noncovariant behavior of policy gradient reinforcement learning algorithms. ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
We investigate the problem of noncovariant behavior of policy gradient reinforcement learning algorithms.
Learning to Trade via Direct Reinforcement
, 2001
"... We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent r ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TDlearning and Qlearning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize riskadjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing QLearning (a value function method). Realworld applications include an intradaily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and TBills.
Binetcauchy kernels on dynamical systems and its application to the analysis of dynamic scenes
 International Journal of Computer Vision
, 2005
"... Abstract. We derive a family of kernels on dynamical systems by applying the BinetCauchy theorem to trajectories of states. Our derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learning, including kernels derived from the behavioral framework, ..."
Abstract

Cited by 32 (12 self)
 Add to MetaCart
Abstract. We derive a family of kernels on dynamical systems by applying the BinetCauchy theorem to trajectories of states. Our derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learning, including kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the subspace angle approach. In the case of linear timeinvariant systems, we derive explicit formulae for computing the proposed BinetCauchy kernels by solving Sylvester equations, and relate the proposed kernels to existing kernels based on cepstrum coefficients and subspace angles. Besides their theoretical appeal, these kernels can be used efficiently in the comparison of video sequences of dynamic scenes that can be modeled as the output of a linear timeinvariant dynamical system. One advantage of our kernels is that they take the initial conditions of the dynamical systems into account. As a first example, we use our kernels to compare video sequences of dynamic textures. As a second example, we apply our kernels to the problem of clustering short clips of a movie. Experimental evidence shows superior performance of our kernels. Keywords: BinetCauchy theorem, ARMA models and dynamical systems, Sylvester
Exploration in Metric State Spaces
, 2003
"... We present metricE , a provably nearoptimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E algorithm of Kearns and ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
We present metricE , a provably nearoptimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E algorithm of Kearns and Singh, and assumes a black box for approximate planning. Unlike the original E , metric finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of the state space. Informally, the covering number is the number of neighborhoods required for accurate local modeling.
Reinforcement learning through modulation of spiketimingdependent synaptic plasticity
 Neural Computation
, 2007
"... The persistent modification of synaptic efficacy as a function of the relative timing of pre and postsynaptic spikes is a phenomenon known as spiketimingdependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
The persistent modification of synaptic efficacy as a function of the relative timing of pre and postsynaptic spikes is a phenomenon known as spiketimingdependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving rewardmodulated spiketimingdependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic Spike Response Model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrateandfire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), while the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of pre and postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both rate coded and temporally coded input and to learn a target output firing rate pattern. These learning rules are biologicallyplausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of rewardmodulated
A Survey of POMDP Solution Techniques
, 2000
"... this paper, we assume all actions take one unit of discrete time at some (unspecied) time scale. If we allow actions to take variable lengths of time, we end up with a semiMarkov model; see e.g., [SPS99]. ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
this paper, we assume all actions take one unit of discrete time at some (unspecied) time scale. If we allow actions to take variable lengths of time, we end up with a semiMarkov model; see e.g., [SPS99].
Policy Search using Paired Comparisons
 Journal of Machine Learning Research
, 2002
"... Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a determinist ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng and Jordan, 2000). We evaluate Pegasus, and new paired comparison methods, using the mountain car problem, and a difficult pursuerevader problem. We conclude that: (i) paired tests can improve performance of optimization procedures; (ii) several methods are available to reduce the `overfitting' effect found with Pegasus; (iii) adapting the number of trials used for each comparison yields faster learning; (iv) pairing also helps stochastic search methods such as differential evolution.