Results 1  10
of
83
LeastSquares Policy Iteration
 Journal of Machine Learning Research
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. ..."
Abstract

Cited by 301 (9 self)
 Add to MetaCart
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration.
Treebased batch mode reinforcement learning
 Journal of Machine Learning Research
, 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract

Cited by 134 (28 self)
 Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Qfunction. The Qfunction approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical treebased supervised learning methods (CART, Kdtree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of fourtuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
Least Squares Policy Evaluation Algorithms With Linear Function Approximation
 Theory and Applications
, 2002
"... We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradientlike algorithm involving leastsquares subproblems and a diminishing stepsize, which is based on the #policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(#) algorithm recently proposed by Boyan, which for # =0coincides with the linear leastsquares temporaldifference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(#), with probability 1, for every # [0, 1].
Automatic basis function construction for approximate dynamic programming and reinforcement learning
 In Cohen and Moore (2006
, 2006
"... We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. ..."
Abstract

Cited by 62 (2 self)
 Add to MetaCart
We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castañon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a highdimensional state space to a lowdimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lowerdimensional space. These are added as new features for the linear function approximator. This approach is applied to a highdimensional inventory control problem. 1.
Learning Evaluation Functions to Improve Optimization by Local Search
 Journal of Machine Learning Research
, 2000
"... This paper describes algorithms that learn to improve search performance on largescale optimization tasks. The main algorithm, Stage, works by learning an evaluation function that predicts the outcome of a local search algorithm, such as hillclimbing or Walksat, from features of states visited durin ..."
Abstract

Cited by 56 (0 self)
 Add to MetaCart
This paper describes algorithms that learn to improve search performance on largescale optimization tasks. The main algorithm, Stage, works by learning an evaluation function that predicts the outcome of a local search algorithm, such as hillclimbing or Walksat, from features of states visited during search. The learned evaluation function is then used to bias future search trajectories toward better optima on the same problem. Another algorithm, XStage, transfers previously learned evaluation functions to new, similar optimization problems. Empirical results are provided on seven largescale optimization domains: binpacking, channel routing, Bayesian network structurefinding, radiotherapy treatment planning, cartogram design, Boolean satisfiability, and Boggle board setup.
Basis function adaptation in temporal difference reinforcement learning
 Annals of Operations Research
, 2005
"... Reinforcement Learning (RL) is an approach for solving complex multistage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
Reinforcement Learning (RL) is an approach for solving complex multistage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of prefixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (nonlinear) basis function parameters. We present two algorithms for this problem. The first uses a gradientbased approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.
Regularization and feature selection in leastsquares temporal difference learning (full version). Available at http://ai.stanford.edu/˜kolter
, 2009
"... We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is la ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the LeastSquares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can overfit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally. 1.
Algorithms and Representations for Reinforcement Learning
, 2005
"... “If we knew what it was we were doing, it would not be called research, would it?” ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
“If we knew what it was we were doing, it would not be called research, would it?”
Temporaldifference networks
 In Advances in Neural Information Processing Systems 17
, 2005
"... We introduce a generalization of temporaldifference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
We introduce a generalization of temporaldifference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a randomwalk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learningefficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a nonMarkov problem. A very broad range of interpredictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporaldifference (TD) learning is widely used in reinforcement learning methods to learn momenttomoment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more dataefficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or