Results 1  10
of
77
Reinforcement learning for humanoid robotics
 Autonomous Robot
, 2003
"... Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and lea ..."
Abstract

Cited by 91 (20 self)
 Add to MetaCart
Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional analytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible way to aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatory when humanoid systems are supposed to become completely autonomous. While recent research in neural networks and statistical learning has focused mostly on learning from finite data sets without stringent constraints on computational efficiency, learning for humanoid robots requires a different setting, characterized by the need for realtime learning performance from an essentially infinite stream of incrementally arriving data. This paper demonstrates how even highdimensional learning problems of this kind can successfully be dealt with by techniques from nonparametric regression and locally weighted learning. As an example, we describe the application of one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the online learning of three problems in humanoid motor control: the learning of inverse dynamics models for modelbased control, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. All these examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate final peformance. We conclude that realtime learning for complex motor system like humanoid robots is possible with appropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities should be achievable in the near future. 1.
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 67 (9 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Natural ActorCritic
, 2007
"... In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a valu ..."
Abstract

Cited by 65 (10 self)
 Add to MetaCart
In this paper, we suggest a novel reinforcement learning architecture, the Natural ActorCritic. The actor updates are achieved using stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policygradient compatible function approximation. We show that several wellknown reinforcement learning methods such as the original ActorCritic and Bradtke’s Linear Quadratic QLearning are in fact Natural ActorCritic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.
Algorithms and Representations for Reinforcement Learning
, 2005
"... “If we knew what it was we were doing, it would not be called research, would it?” ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
“If we knew what it was we were doing, it would not be called research, would it?”
A Generalized Kalman Filter for Fixed Point Approximation and Efficient TemporalDifference
 Learning,” Proceedings of the International Joint Conference on Machine Learning
, 2001
"... The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
The traditional Kalman filter can be viewed as a recursive stochastic algorithm that approximates an unknown function via a linear combination of prespecified basis functions given a sequence of noisy samples. In this paper, we generalize the algorithm to one that approximates the fixed point of an operator that is known to be a Euclidean norm contraction. Instead of noisy samples of the desired fixed point, the algorithm updates parameters based on noisy samples of functions generated by application of the operator, in the spirit of Robbins–Monro stochastic approximation. The algorithm is motivated by temporal–difference learning, and our developments lead to a possibly more efficient variant of temporal–difference learning. We establish convergence of the algorithm and explore efficiency gains through computational experiments involving optimal stopping and queueing problems.
An Analysis of Linear Models, Linear ValueFunction Approximation, and Feature Selection for Reinforcement Learning
"... We show that linear valuefunction approximation is equivalent to a form of linear model approximation. We then derive a relationship between the modelapproximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or valuefunction im ..."
Abstract

Cited by 32 (3 self)
 Add to MetaCart
We show that linear valuefunction approximation is equivalent to a form of linear model approximation. We then derive a relationship between the modelapproximation error and the Bellman error, and show how this relationship can guide feature selection for model improvement and/or valuefunction improvement. We also show how these results give insight into the behavior of existing featureselection algorithms. 1.
Sparse temporal difference learning using lasso
 In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning
, 2007
"... Abstract — We consider the problem of online value function estimation in reinforcement learning. We concentrate on the function approximator to use. To try to break the curse of dimensionality, we focus on non parametric function approximators. We propose to fit the use of kernels into the tempora ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Abstract — We consider the problem of online value function estimation in reinforcement learning. We concentrate on the function approximator to use. To try to break the curse of dimensionality, we focus on non parametric function approximators. We propose to fit the use of kernels into the temporal difference algorithms by using regression via the LASSO. We introduce the equigradient descent algorithm (EGD) which is a direct adaptation of the one recently introduced in the LARS algorithm family for solving the LASSO. We advocate our choice of the EGD as a judicious algorithm for these tasks. We present the EGD algorithm in details as well as some experimental results.
Learning to select branching rules in the DPLL procedure for satisfiability
 In LICS/SAT
, 2001
"... The DPLL procedure is the most popular complete satisfiability (SAT) solver. While its worst case complexity is exponential, the actual running time is greatly affected by the ordering of branch variables during the search. Several branching rules have been proposed, but none is the best in all case ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
The DPLL procedure is the most popular complete satisfiability (SAT) solver. While its worst case complexity is exponential, the actual running time is greatly affected by the ordering of branch variables during the search. Several branching rules have been proposed, but none is the best in all cases. This work investigates the use of automated methods for choosing the most appropriate branching rule at each node in the search tree. We consider a reinforcementlearning approach where a value function, which predicts the performance of each branching rule in each case, is learned through trial runs on a typical problem set of the target class of SAT problems. Our results indicate that, provided sufficient training on a given class, the resulting strategy performs as well as (and, in some cases, better than) the best branching rule for that class. 1.
Reinforcement learning for sensing strategies
 in Proceedings of the International Confrerence on Intelligent Robots and Systems (IROS
, 2004
"... Abstract — Mobile robots often have to make decisions on where to point their sensors, which have limited range and coverage. A good sensing strategy allows the robot to collect useful information for its tasks. Most existing solutions to this active sensing problem choose the direction that maximal ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Abstract — Mobile robots often have to make decisions on where to point their sensors, which have limited range and coverage. A good sensing strategy allows the robot to collect useful information for its tasks. Most existing solutions to this active sensing problem choose the direction that maximally reduces the uncertainty in a single state variable. In more complex problem domains, however, uncertainties exist in multiple state variables, and they affect the performance of the robot in different ways. The robot thus needs to have more sophisticated sensing strategies in order to decide which uncertainties to reduce, and to make the correct tradeoffs. In this work, we apply least squares reinforcement learning methods to solve this problem. We implemented and tested the learning approach in the RoboCup domain, where the robot attempts to reach a ball and accurately kick it into the goal. We present experimental results that suggest our approach is able to learn highly effective sensing strategies. I.
Incremental leastsquares temporal difference learning
 In Proceedings of the TwentyFirst National Conference on Artificial Intelligence (AAAI
, 2006
"... Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental leastsquares TD learning, or iLSTD. ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
Approximate policy evaluation with linear function approximation is a commonly arising problem in reinforcement learning, usually solved using temporal difference (TD) algorithms. In this paper we introduce a new variant of linear TD learning, called incremental leastsquares TD learning, or iLSTD. This method is more data efficient than conventional TD algorithms such as TD(0) and is more computationally efficient than nonincremental leastsquares TD methods such as LSTD (Bradtke & Barto 1996; Boyan 1999). In particular, we show that the pertimestep complexities of iLSTD and TD(0) are O(n), where n is the number of features, whereas that of LSTD is O(n 2). This difference can be decisive in modern applications of reinforcement learning where the use of a large number features has proven to be an effective solution strategy. We present empirical comparisons, using the test problem introduced by Boyan (1999), in which iLSTD converges faster than TD(0) and almost as fast as LSTD.