Results 1  10
of
59
LeastSquares Policy Iteration
 Journal of Machine Learning Research
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. ..."
Abstract

Cited by 301 (9 self)
 Add to MetaCart
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration.
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 91 (20 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Basis function adaptation in temporal difference reinforcement learning
 Annals of Operations Research
, 2005
"... Reinforcement Learning (RL) is an approach for solving complex multistage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
Reinforcement Learning (RL) is an approach for solving complex multistage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of prefixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (nonlinear) basis function parameters. We present two algorithms for this problem. The first uses a gradientbased approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.
Algorithms and Representations for Reinforcement Learning
, 2005
"... “If we knew what it was we were doing, it would not be called research, would it?” ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
“If we knew what it was we were doing, it would not be called research, would it?”
A Generalized Kalman Filter for Fixed Point Approximation and Efficient TemporalDifference
 Learning,” Proceedings of the International Joint Conference on Machine Learning
, 2001
"... The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal–difference learning, and our developments lead to a possibly more efficient variant of temporal–difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
Reinforcement learning for imitating constrained reaching movements
 RSJ Advanced Robotics
, 2007
"... The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots such that it can be accomplished by anyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot must then be able to reproduce the task even when unexpected perturbations occur. In this case, it has to learn a new solution. In this paper, we describe a system to teach to the robot constrained reaching tasks. Our system is based on a dynamical system generator modulated by a learned speed trajectory. This system is combined with a reinforcement learning module to allow the robot to adapt the trajectory when facing a new situation, for example in the presence of obstacles.
Improved Temporal Difference Methods with Linear Function Approximation
"... This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional d ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional discussions of such problems can be found in Chapters 12 and 6. The advantage of the method presented here is that this is the first iterative temporal difference method that converges without requiring a diminishing step size. The chapter discusses the connections with Suttonfls TD(λ) and with various versions of leastsquares that are based on valueiteration. It is shown using both analysis and experiments that the proposed method is substantially faster, simpler, and more reliable than TD(λ). Comparisons are also made with the LSTD method of Boyan and Bradtke and Barto.
Convergence Results for Some Temporal Difference Methods Based on Least Squares,” Lab. for Information and Decision Systems Report 2697
, 2006
"... We consider finitestate Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(λ). These are temporal difference methods for constructing a linear function approximation of the cost function o ..."
Abstract

Cited by 18 (9 self)
 Add to MetaCart
We consider finitestate Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(λ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinitehorizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(λ), as well as more reliable.