Results 1  10
of
116
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 132 (21 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.
Natural ActorCritic
, 2007
"... In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a valu ..."
Abstract

Cited by 95 (10 self)
 Add to MetaCart
(Show Context)
In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policygradient compatible function approximation. We show that several wellknown reinforcement learning methods such as the original ActorCritic and Bradtke’s Linear Quadratic QLearning are in fact Natural ActorCritic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 92 (10 self)
 Add to MetaCart
(Show Context)
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Incremental Natural ActorCritic Algorithms
"... We present four new reinforcement learning algorithms based on actorcritic and naturalgradient ideas, and provide their convergence proofs. Actorcritic reinforcement learning methods are online approximations to policy iteration in which the valuefunction parameters are estimated using temporal ..."
Abstract

Cited by 75 (8 self)
 Add to MetaCart
We present four new reinforcement learning algorithms based on actorcritic and naturalgradient ideas, and provide their convergence proofs. Actorcritic reinforcement learning methods are online approximations to policy iteration in which the valuefunction parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior twotimescale convergence results for actorcritic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actorcritic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
Algorithm Selection using Reinforcement Learning
 In ICML
, 2000
"... Many computational problems can be solved by multiple algorithms, with different algorithms fastest for different problem sizes, input distributions, and hardware characteristics. We consider the problem of algorithm selection: dynamically choose an algorithm to attack an instance of a problem ..."
Abstract

Cited by 65 (6 self)
 Add to MetaCart
Many computational problems can be solved by multiple algorithms, with different algorithms fastest for different problem sizes, input distributions, and hardware characteristics. We consider the problem of algorithm selection: dynamically choose an algorithm to attack an instance of a problem with the goal of minimizing the overall execution time. We formulate the problem as a kind of Markov decision process (MDP), and use ideas from reinforcement learning to solve it. This paper introduces a kind of MDP that models the algorithm selection problem by allowing multiple state transitions. The well known Qlearning algorithm is adapted for this case in a way that combines both MonteCarlo and Temporal Difference methods. Also, this work uses, and extends in a way to control problems, the LeastSquares Temporal Difference algorithm (LSTD(0)) of Boyan. The experimental study focuses on the classic problems of order statistic selection and sorting. The encouraging resu...
An Analysis of Linear Models, Linear ValueFunction Approximation, and Feature Selection for Reinforcement Learning
"... We show that linear valuefunction approximation is equivalent to a form of linear model approximation. We then derive a relationship between the modelapproximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or valuefunction im ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
We show that linear valuefunction approximation is equivalent to a form of linear model approximation. We then derive a relationship between the modelapproximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or valuefunction improvement. We also show how these results give insight into the behavior of existing featureselection algorithms. 1.
A Generalized Kalman Filter for Fixed Point Approximation and Efficient TemporalDifference
 Learning,” Proceedings of the International Joint Conference on Machine Learning
, 2001
"... The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an ..."
Abstract

Cited by 37 (2 self)
 Add to MetaCart
(Show Context)
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal–difference learning, and our developments lead to a possibly more efficient variant of temporal–difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
Sparse temporal difference learning using lasso
 In IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning
, 2007
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 35 (1 self)
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Incremental leastsquares temporal difference learning
 In Proceedings of the TwentyFirst National Conference on Artificial Intelligence (AAAI
, 2006
"... Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental leastsquares TD learning, or iLSTD. ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
(Show Context)
Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental leastsquares TD learning, or iLSTD. This method is more data efficient than conventional TD algorithms such as TD(0) and is more computationally efficient than nonincremental leastsquares TD methods such as LSTD (Bradtke & Barto 1996; Boyan 1999). In particular, we show that the pertimestep complexities of iLSTD and TD(0) are O(n), where n is the number of features, whereas that of LSTD is O(n 2). This difference can be decisive in modern applications of reinforcement learning where the use of a large number features has proven to be an effective solution strategy. We present empirical comparisons, using the test problem introduced by Boyan (1999), in which iLSTD converges faster than TD(0) and almost as fast as LSTD.