Results 1  10
of
10
Doubly robust offpolicy evaluation for reinforcement learning.
, 2015
"... Abstract We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as offpolicy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actuall ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as offpolicy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most realworld applications. Despite the fundamental importance of the problem, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the socalled doubly robust estimator for bandits to sequential decisionmaking problems, which gets the best of both worlds: it is guaranteed to be unbiased and has low variance, and as a point estimator, it outperforms the most popular importancesampling estimator and its variants in most occasions. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the asymptotic lower bound in certain scenarios.
Reinforcement Learning of Heuristic EV Fleet Charging in a DayAhead Electricity Market
"... Abstract—This paper addresses the problem of defining a dayahead consumption plan for charging a fleet of electric vehicles (EVs), and following this plan during operation. A challenge herein is the beforehand unknown charging flexibility of EVs, which depends on numerous details about each EV (e. ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—This paper addresses the problem of defining a dayahead consumption plan for charging a fleet of electric vehicles (EVs), and following this plan during operation. A challenge herein is the beforehand unknown charging flexibility of EVs, which depends on numerous details about each EV (e.g., plugin times, power limitations, battery size, power curve, etc.). To cope with this challenge, EV charging is controlled during opertion by a heuristic scheme, and the resulting charging behavior of the EV fleet is learned by using batch mode reinforcement learning. Based on this learned behavior, a costeffective dayahead consumption plan can be defined. In simulation experiments, our approach is benchmarked against a multistage stochastic programming solution, which uses an exact model of each EVs charging flexibility. Results show that our approach is able to find a dayahead consumption plan
Doubly Robust Offpolicy Value Evaluation for Reinforcement Learning
"... Abstract We study the problem of offpolicy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL to realworld problems. Despite its importance, exi ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract We study the problem of offpolicy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL to realworld problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decisionmaking problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. We demonstrate the estimator's accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. We also provide theoretical results on the inherent hardness of the problem, and show that our estimator can match the lower bound in certain scenarios.
– Experimental Illustration Conclusions Batch Mode Reinforcement LearningReinforcement Learning Environment Agent
, 2012
"... Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environmentBatch Mode Reinforcement Learning All the available information is contained in a batch collection of data Batch mode RL aims at computing a (near)optimal policy from this collection ..."
Abstract
 Add to MetaCart
Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environmentBatch Mode Reinforcement Learning All the available information is contained in a batch collection of data Batch mode RL aims at computing a (near)optimal policy from this collection of data Batch mode RL Finite collection of trajectories of the agent Nearoptimal decision strategy Examples of BMRL problems: dynamic treatment regimes (inferred from clinical data), marketing optimization (based on customers histories), finance, etc...Objectives Main goal: Finding a "good " policy Many associated subgoals: – Evaluating the performance of a given policy – Computing performance guarantees – Computing safe policies – Choosing how to generate additional transitions
Structured KernelBased Reinforcement Learning
"... Kernelbased reinforcement learning (KBRL) is a popular approach to learning nonparametric value function approximations. In this paper, we present structured KBRL, a paradigm for kernelbased RL that allows for modeling independencies in the transition and reward models of problems. Realworld pro ..."
Abstract
 Add to MetaCart
Kernelbased reinforcement learning (KBRL) is a popular approach to learning nonparametric value function approximations. In this paper, we present structured KBRL, a paradigm for kernelbased RL that allows for modeling independencies in the transition and reward models of problems. Realworld problems often exhibit this structure and can be solved more efficiently when it is modeled. We make three contributions. First, we motivate our work, define a structured backup operator, and prove that it is a contraction. Second, we show how to evaluate our operator efficiently. Our analysis reveals that the fixed point of the operator is the optimal value function in a special factored MDP. Finally, we evaluate our method on a synthetic problem and compare it to two KBRL baselines. In most experiments, we learn better policies than the baselines from an order of magnitude less training data.
On Periodic Reference Tracking Using BatchMode Reinforcement Learning with Application to Gene Regulatory Network Control
"... AbstractIn this paper, we consider the periodic reference tracking problem in the framework of batchmode reinforcement learning, which studies methods for solving optimal control problems from the sole knowledge of a set of trajectories. In particular, we extend an existing batchmode reinforceme ..."
Abstract
 Add to MetaCart
(Show Context)
AbstractIn this paper, we consider the periodic reference tracking problem in the framework of batchmode reinforcement learning, which studies methods for solving optimal control problems from the sole knowledge of a set of trajectories. In particular, we extend an existing batchmode reinforcement learning algorithm, known as Fitted Q Iteration, to the periodic reference tracking problem. The presented periodic reference tracking algorithm explicitly exploits a priori knowledge of the future values of the reference trajectory and its periodicity. We discuss the properties of our approach and illustrate it on the problem of reference tracking for a synthetic biology gene regulatory network known as the generalised repressilator. This system can produce decaying but longlived oscillations, which makes it an interesting application for the tracking problem.
1On Periodic Reference Tracking Using BatchMode Reinforcement Learning with Application to Gene Regulatory Network Control
"... Abstract—In this paper, we consider the periodic reference tracking problem in the framework of batchmode reinforcement learning, which studies methods for solving optimal control problems from the sole knowledge of a set of trajectories. In particular, we extend an existing batchmode reinforcemen ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—In this paper, we consider the periodic reference tracking problem in the framework of batchmode reinforcement learning, which studies methods for solving optimal control problems from the sole knowledge of a set of trajectories. In particular, we extend an existing batchmode reinforcement learning algorithm, known as Fitted Q Iteration, to the periodic reference tracking problem. The presented periodic reference tracking algorithm explicitly exploits a priori knowledge of the future values of the reference trajectory and its periodicity. We discuss the properties of our approach and illustrate it on the problem of reference tracking for a synthetic biology gene regulatory network known as the generalised repressilator. This system can produce decaying but longlived oscillations, which makes it an interesting system for the tracking problem. In our companion paper we also take a look at the regulation problem of the toggle switch system, where the main goal is to drive the systems states to a specific bounded region in the state space. Index Terms—batchmode reinforcement learning; reference tracking; fitted Q iteration; synthetic biology; gene regulatory networks; generalised repressilator I.
Certified by:
"... Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments ..."
Abstract
 Add to MetaCart
(Show Context)
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 222024302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE