Results 1  10
of
142
LeastSquares Policy Iteration
 Journal of Machine Learning Research
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. ..."
Abstract

Cited by 301 (9 self)
 Add to MetaCart
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration.
Infinitehorizon policygradient estimation
 Journal of Artificial Intelligence Research
, 2001
"... Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a si ..."
Abstract

Cited by 153 (5 self)
 Add to MetaCart
Gradientbased approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in valuefunction methods. In this paper we introduce � � , a simulationbased algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes ( � s) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter � � (which has a natural interpretation in terms of biasvariance tradeoff), and requires no knowledge of the underlying state. We prove convergence of � � , and show how the correct choice of the parameter is related to the mixing time of the controlled �. We briefly describe extensions of � � to controlled Markov chains, continuous state, observation and control spaces, multipleagents, higherorder derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by � � can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward. 1.
A Natural Policy Gradient
"... We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy opt ..."
Abstract

Cited by 106 (0 self)
 Add to MetaCart
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as deo/ned by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
A tutorial on the crossentropy method
 Annals of Operations Research
, 2005
"... Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization ..."
Abstract

Cited by 104 (15 self)
 Add to MetaCart
Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization problems. A comprehensive list of references on crossentropy methods and applications is included.
KernelBased Reinforcement Learning
 Machine Learning
, 1999
"... We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernelbased approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the biasvariance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or nonparametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 91 (20 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.
Coordinated Reinforcement Learning
 In Proceedings of the ICML2002 The Nineteenth International Conference on Machine Learning
, 2002
"... We present several new algorithms for multiagent reinforcement learning. A common feature of these algorithms is a parameterized, structured representation of a policy or value function. This structure is leveraged in an approach we call coordinated reinforcement learning, by which agents coordinate ..."
Abstract

Cited by 84 (6 self)
 Add to MetaCart
We present several new algorithms for multiagent reinforcement learning. A common feature of these algorithms is a parameterized, structured representation of a policy or value function. This structure is leveraged in an approach we call coordinated reinforcement learning, by which agents coordinate both their action selection activities and their parameter updates. Within the limits of our parametric representations, the agents will determine a jointly optimal action without explicitly considering every possible action in their exponentially large joint action space. Our methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture. Our experimental results, comparing our approach to other RL methods, illustrate both the quality of the policies obtained and the additional benefits of coordination. 1.
Evolutionary function approximation for reinforcement learning
 Journal of Machine Learning Research
, 2006
"... Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��×Ø��×�×�ÒÚ�×Ø���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙ×ØÐ��ÖÒ Ñ�ÒØ���Ò×Ø�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ð×Ø��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�Ó ..."
Abstract

Cited by 72 (15 self)
 Add to MetaCart
Ø�ÓÒ�ÔÔÖÓÜ�Ñ�Ø�ÓÒ�ÒÓÚ�Ð�ÔÔÖÓ��ØÓ�ÙØÓÑ�Ø��ÐÐÝ× � Ø�ÓÒ�Ð���×�ÓÒ×Ì��×Ø��×�×�ÒÚ�×Ø���Ø�×�ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒ �Ò�ÓÖ�Ñ�ÒØÐ��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ö�Ø��×Ù�×�ØÓ�Ø��×�Ø�×� × ÁÒÑ�ÒÝÑ���Ò�Ð��ÖÒ�Ò�ÔÖÓ�Ð�Ñ×�Ò���ÒØÑÙ×ØÐ��ÖÒ Ñ�ÒØ���Ò×Ø�ÒØ��Ø�ÓÒÓ��ÚÓÐÙØ�ÓÒ�ÖÝ�ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ � Ù�Ð×Ø��Ø�Ö���ØØ�Ö��Ð�ØÓÐ��ÖÒÁÔÖ�×�ÒØ��ÙÐÐÝ�ÑÔÐ � Ø�ÓÒÛ���ÓÑ��Ò�×Æ��Ì�Ò�ÙÖÓ�ÚÓÐÙØ�ÓÒ�ÖÝÓÔØ�Ñ�Þ � Ð�Ø�Ò��ÙÒØ�ÓÒ�ÔÔÖÓÜ�Ñ�ØÓÖÖ�ÔÖ�×�ÒØ�Ø�ÓÒ×Ø��Ø�Ò��Ð� Ø�ÓÒØ��Ò�ÕÙ�Û�Ø�ÉÐ��ÖÒ�Ò��ÔÓÔÙÐ�ÖÌ�Ñ�Ø�Ó�Ì� � �Æ��ÒØ�Ò��Ú��Ù�ÐÐ��ÖÒ�Ò�Ì��×Ñ�Ø�Ó��ÚÓÐÚ�×�Ò��Ú� � ÓÔØ�Ñ�Þ�Ø�ÓÒ��ÐÐ�ÒØ��×�Ø��ÓÖÝ��Ú�ÐÓÔ�Ò��«�Ø�Ú�Ö��Ò �ÓÖÁÒ×Ø����ØÖ���Ú�×ÓÒÐÝÔÓ×�Ø�Ú��Ò�Ò���Ø�Ú�Ö�Û�Ö� × ÔÖÓ�Ð�Ñ××Ù��×ÖÓ�ÓØÓÒØÖÓÐ��Ñ�ÔÐ�Ý�Ò��Ò�×Ý×Ø�Ñ �ÒÛ���Ø�����ÒØÒ�Ú�Ö×��×�Ü�ÑÔÐ�×Ó�ÓÖÖ�Ø����Ú 1.
Natural ActorCritic
, 2007
"... In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a valu ..."
Abstract

Cited by 64 (10 self)
 Add to MetaCart
In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policygradient compatible function approximation. We show that several wellknown reinforcement learning methods such as the original ActorCritic and Bradtke’s Linear Quadratic QLearning are in fact Natural ActorCritic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.
Reinforcement Learning in POMDP's via Direct Gradient Ascent
 In Proc. 17th International Conf. on Machine Learning
, 2000
"... This paper discusses theoretical and experimental aspects of gradientbased approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCElike algorithm for estimating an approximation to the gradient of the average reward as a function of ..."
Abstract

Cited by 63 (2 self)
 Add to MetaCart
This paper discusses theoretical and experimental aspects of gradientbased approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCElike algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm's chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of biasvariance tradeoff, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugategradient procedure to find local optima of the average reward. 1. Introduction "Reinforcement learning" is used to describe the general problem of training an agent to choose its actions so as to increase its longterm average reward. The structure of th...