Results 11  20
of
87
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories
 ANN OPER RES
, 2012
"... ..."
M.: Adaptive Reactive JobShop Scheduling with Learning Agents
 International Journal of Information Technology and Intelligent Computing
, 2007
"... Traditional approaches to solving jobshop scheduling problems assume full knowledge of the problem and search for a centralized solution for a single problem instance. Finding optimal solutions, however, requires an enormous computational effort, which becomes critical for large problem instance si ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
(Show Context)
Traditional approaches to solving jobshop scheduling problems assume full knowledge of the problem and search for a centralized solution for a single problem instance. Finding optimal solutions, however, requires an enormous computational effort, which becomes critical for large problem instance sizes and, in particular, in situations where frequent changes in the environment occur. In this article, we adopt an alternative view on production scheduling problems by modelling them as multiagent reinforcement learning problems. In fact, we interpret jobshop scheduling problems as sequential decision processes and attach to each resource an adaptive agent that makes its job dispatching decisions independently of the other agents and improves its dispatching behavior by trial and error employing a reinforcement learning algorithm. The utilization of concurrently and independently learning agents requires special care in the design of the reinforcement learning algorithm to be applied. Therefore, we develop a novel multiagent learning algorithm, that combines dataefficient batchmode reinforcement learning, neural networkbased value function approximation, and the use of an optimistic interagent coordination scheme. The evaluation of our learning framework focuses on numerous established Operations Research benchmark problems and shows that our approach can very well compete with alternative solution methods.
Parametric Value Function Approximation: a Unified View
"... Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subt ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
(Show Context)
Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This survey reviews and unifies state of the art methods for parametric value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive leastsquares approach. Index Terms—Reinforcement learning, value function approximation, survey. I.
Bridging the gap: Learning in the robocup simulation and midsize league
 In Proceedings of the 7th Portuguese Conference on Automatic Control (Controlo
, 2006
"... Abstract: In this paper, we discuss the application of reinforcement learning for autonomous robots using the RoboCup domain as benchmark. The paper compares successful learning approaches in simulation with learning on real robots and develops methodologies to overcome the additional problems in re ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract: In this paper, we discuss the application of reinforcement learning for autonomous robots using the RoboCup domain as benchmark. The paper compares successful learning approaches in simulation with learning on real robots and develops methodologies to overcome the additional problems in real world.
A CAUTIOUS APPROACH TO GENERALIZATION IN REINFORCEMENT LEARNING
"... In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of tr ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbilike algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in openloop. 1
Statistically Linearized LeastSquares Temporal Differences
 in Proceedings of the IEEE International Conference on Ultra Modern Control Systems (ICUMT 2010). Moscow (Russia): IEEE
, 2010
"... Abstract — A common drawback of standard reinforcement learning algorithms is their inability to scaleup to realworld problems. For this reason, a current important trend of research is (stateaction) value function approximation. A prominent value function approximator is the leastsquares tempor ..."
Abstract

Cited by 7 (7 self)
 Add to MetaCart
(Show Context)
Abstract — A common drawback of standard reinforcement learning algorithms is their inability to scaleup to realworld problems. For this reason, a current important trend of research is (stateaction) value function approximation. A prominent value function approximator is the leastsquares temporal differences (LSTD) algorithm. However, for technical reasons, linearity is mandatory: the parameterization of the value function must be linear (compact nonlinear representations are not allowed) and only the Bellman evaluation operator can be considered (imposing policyiterationlike schemes). In this paper, this restriction of LSTD is lifted thanks to a derivativefree statistical linearization approach. This way, nonlinear parameterizations and the Bellman optimality operator can be taken into account (this last point allows taking into account valueiterationlike schemes). The efficiency of the resulting algorithms are demonstrated using a linear parametrization and neural networks as well as on a Qlearninglike problem. A theoretical analysis is also provided. Index Terms — reinforcement learning, value function approximation, statistical linearization, neural networks. I.
ModelFree Monte Carlo–like Policy Evaluation
"... We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on th ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
We propose an algorithm for estimating the finitehorizon expected return of a closed loop control policy from an a priori given (offpolicy) sample of onestep transitions. It averages cumulated rewards along a set of “broken trajectories” made of onestep transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of onestep transitions. 1
A Brief Survey of Parametric value Function approximation
"... Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforce ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforcement learning is to compute an approximation of this value function when the system is too large for an exact representation. This survey reviews state of the art methods for (parametric) value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive
Learning to Drive a Real Car in 20 Minutes
"... The paper describes our first experiments on Reinforcement Learning to steer a real robot car. The applied method, Neural Fitted Q Iteration (NFQ) is purely datadriven based on data directly collected from reallife experiments, i.e. no transition model and no simulation is used. The RL approach is ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
The paper describes our first experiments on Reinforcement Learning to steer a real robot car. The applied method, Neural Fitted Q Iteration (NFQ) is purely datadriven based on data directly collected from reallife experiments, i.e. no transition model and no simulation is used. The RL approach is based on learning a neural Q value function, which means that no prior selection of the structure of the control law is required. We demonstrate, that the controller is able to learn a steering task in less than 20 minutes directly on the real car. We consider this as an important step towards the competitive application of neural Q function based RL methods in reallife environments. 1