Results 1  10
of
33
Regularization and feature selection in leastsquares temporal difference learning
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
(Show Context)
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.
Regularized Policy Iteration
"... In this paper we consider approximate policyiterationbased reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of nonparametric methods with regularization, providing a convenient way to control the complexity of the function approx ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
(Show Context)
In this paper we consider approximate policyiterationbased reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of nonparametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2regularization to two widelyused policy evaluation methods: Bellman residual minimization (BRM) and leastsquares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate valuefunctions belong to a reproducing kernel Hilbert space. We also provide finitesample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
SampleEfficient Batch Reinforcement Learning for Dialogue Management Optimization
 ACM Transactions on Speech and Language Processing
, 2011
"... Spoken Dialogue Systems (SDS) are systems which have the ability to interact with human beings using natural language as the medium of interaction. A dialogue policy plays a crucial role in determining the functioning of the dialogue management module. Handcrafting the dialogue policy is not always ..."
Abstract

Cited by 19 (16 self)
 Add to MetaCart
(Show Context)
Spoken Dialogue Systems (SDS) are systems which have the ability to interact with human beings using natural language as the medium of interaction. A dialogue policy plays a crucial role in determining the functioning of the dialogue management module. Handcrafting the dialogue policy is not always an option considering the complexity of the dialogue task and the stochastic behavior of users. In recent years approaches based on Reinforcement Learning (RL) for policy optimization in dialogue management have been proved to be an efficient approach for dialogue policy optimization. Yet most of the conventional RL algorithms are data intensive and demand techniques such as user simulation. Doing so, additional modeling errors are likely to occur. This paper explores the possibility of using a set of approximate dynamic programming algorithms for policy optimization in SDS. Moreover, these algorithms are combined to a method for learning a sparse representation of the value function. Experimental results show that these algorithms when applied to dialogue management optimization are particularly sample efficient since they learn from few hundreds of dialogue examples. These algorithms learn in an offpolicy manner meaning that they can learn optimal policies with dialogue examples generated with a quite simple strategy. Thus they can learn good dialogue policies directly from data, avoiding user modeling errors.
Error Propagation for Approximate Policy and Value Iteration
"... We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared RadonNikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
Reinforcement learning algorithms for MDPs
, 2009
"... This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare increment ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
This article presents a survey of reinforcement learning algorithms for Markov Decision Processes (MDP). In the first half of the article, the problem of value estimation is considered. Here we start by describing the idea of bootstrapping and temporal difference learning. Next, we compare incremental and batch algorithmic variants and discuss the impact of the choice of the function approximation method on the success of learning. In the second half, we describe methods that target the problem of learning to control an MDP. Here online and active learning are discussed first, followed by a description of direct and actorcritic methods.
Parametric Value Function Approximation: a Unified View
"... Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subt ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
(Show Context)
Abstract—Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists of learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This survey reviews and unifies state of the art methods for parametric value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive leastsquares approach. Index Terms—Reinforcement learning, value function approximation, survey. I.
Regularized Fitted Qiteration: Application to Planning
"... Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider planning in a Markovian decision problem, i.e., the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Qiteration with penalized (or regularized) leastsquares regression as the regression subroutine to address the problem of controlling modelcomplexity. The algorithm is presented in detail for the case when the function space is a reproducingkernel Hilbert space underlying a userchosen kernel function. We derive bounds on the quality of the solution and argue that datadependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure. 1
A Brief Survey of Parametric value Function approximation
"... Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforce ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the socalled value function. An important subtopic of reinforcement learning is to compute an approximation of this value function when the system is too large for an exact representation. This survey reviews state of the art methods for (parametric) value function approximation by grouping them into three main categories: bootstrapping, residuals and projected fixedpoint approaches. Related algorithms are derived by considering one of the associated cost functions and a specific way to minimize it, almost always a stochastic gradient descent or a recursive