Results 1 - 10
of
39
A Natural Policy Gradient
"... We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy opt ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as deo/ned by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Reinforcement Learning in POMDP's via Direct Gradient Ascent
- In Proc. 17th International Conf. on Machine Learning
, 2000
"... This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of ..."
Abstract
-
Cited by 61 (2 self)
- Add to MetaCart
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward. 1. Introduction "Reinforcement learning" is used to describe the general problem of training an agent to choose its actions so as to increase its long-term average reward. The structure of th...
Policy gradient methods for robotics
- In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS
, 2006
"... Abstract — The aquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the d ..."
Abstract
-
Cited by 52 (14 self)
- Add to MetaCart
Abstract — The aquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of highdimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm. I.
Covariant Policy Search
, 2003
"... We investigate the problem of non-covariant behavior of policy gradient reinforcement learning algorithms. ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
We investigate the problem of non-covariant behavior of policy gradient reinforcement learning algorithms.
Learning to Trade via Direct Reinforcement
, 2001
"... We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent r ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-Learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.
Exploration in Metric State Spaces
, 2003
"... We present metric-E , a provably near-optimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E algorithm of Kearns and ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We present metric-E , a provably near-optimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E algorithm of Kearns and Singh, and assumes a black box for approximate planning. Unlike the original E , metric- finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of the state space. Informally, the covering number is the number of neighborhoods required for accurate local modeling.
Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes
- International Journal of Computer Vision
, 2005
"... Abstract. We derive a family of kernels on dynamical systems by applying the Binet-Cauchy theorem to trajectories of states. Our derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learning, including kernels derived from the behavioral framework, ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
Abstract. We derive a family of kernels on dynamical systems by applying the Binet-Cauchy theorem to trajectories of states. Our derivation provides a unifying framework for all kernels on dynamical systems currently used in machine learning, including kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the subspace angle approach. In the case of linear time-invariant systems, we derive explicit formulae for computing the proposed Binet-Cauchy kernels by solving Sylvester equations, and relate the proposed kernels to existing kernels based on cepstrum coefficients and subspace angles. Besides their theoretical appeal, these kernels can be used efficiently in the comparison of video sequences of dynamic scenes that can be modeled as the output of a linear time-invariant dynamical system. One advantage of our kernels is that they take the initial conditions of the dynamical systems into account. As a first example, we use our kernels to compare video sequences of dynamic textures. As a second example, we apply our kernels to the problem of clustering short clips of a movie. Experimental evidence shows superior performance of our kernels. Keywords: Binet-Cauchy theorem, ARMA models and dynamical systems, Sylvester
Policy Search using Paired Comparisons
- Journal of Machine Learning Research
, 2002
"... Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a determinist ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng and Jordan, 2000). We evaluate Pegasus, and new paired comparison methods, using the mountain car problem, and a difficult pursuer-evader problem. We conclude that: (i) paired tests can improve performance of optimization procedures; (ii) several methods are available to reduce the `overfitting' effect found with Pegasus; (iii) adapting the number of trials used for each comparison yields faster learning; (iv) pairing also helps stochastic search methods such as differential evolution.
Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning
- In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
, 2000
"... We model reinforcement learning as the problem of learning to control a Partially Observable Markov Decision Process ( ¢¡¤£¦¥ § ), and focus on gradient ascent approaches to this problem. In [3] ¢¡¤£¦¥§ we introduced, an algorithm for ¨ estimating the performance gradient of a ©¡¤£¦¥¤ ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
We model reinforcement learning as the problem of learning to control a Partially Observable Markov Decision Process ( ¢¡¤£¦¥ § ), and focus on gradient ascent approaches to this problem. In [3] ¢¡¤£¦¥§ we introduced, an algorithm for ¨ estimating the performance gradient of a ©¡¤£¦¥¤
A Multi-Agent, Policy-Gradient approach to Network Routing
- In: Proc. of the 18th Int. Conf. on Machine Learning
, 2001
"... Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. Olpomdp, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing unde ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Network routing is a distributed decision problem which naturally admits numerical performance measures, such as the average time for a packet to travel from source to destination. Olpomdp, a policy-gradient reinforcement learning algorithm, was successfully applied to simulated network routing under a number of network models. Multiple distributed agents (routers) learned cooperative behavior without explicit interagent communication, and they avoided behavior which was individually desirable, but detrimental to the group's overall performance. Furthermore, shaping the reward signal by explicitly penalizing certain patterns of sub-optimal behavior was found to dramatically improve the convergence rate.

