## Tree-based batch mode reinforcement learning (2005)

### Cached

### Download Links

- [www.montefiore.ulg.ac.be]
- [www.montefiore.ulg.ac.be]
- [jmlr.csail.mit.edu]
- [jmlr.org]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 145 - 29 self |

### BibTeX

@ARTICLE{Ernst05tree-basedbatch,

author = {Damien Ernst and Pierre Geurts and Louis Wehenkel and L. Littman},

title = {Tree-based batch mode reinforcement learning},

journal = {Journal of Machine Learning Research},

year = {2005},

volume = {6},

pages = {503--556}

}

### Years of Citing Articles

### OpenURL

### Abstract

Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.

### Citations

4363 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...e. it does not change from one iteration to another of the fitted Q iteration algorithm. 4.2.2 PRUNED CART TREE The classical CART algorithm is used to grow completely the tree from the training set (=-=Breiman et al., 1984-=-). This algorithm selects at a node the test (i.e., the cut-direction and cut-point) that maximizes the average variance reduction of the output variable (see Eqn (25) in Appendix A). The tree is prun... |

4118 | Reinforcement Learning: An Introduction - Sutton, Barto - 1998 |

2857 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...after the batch mode reinforcement learning problem in this context and we restate some classical results stemming from Bellman’s dynamic programming approach to optimal control theory (introduced in =-=Bellman, 1957-=-) and from which the fitted Q iteration algorithm takes its roots. 2.1 Batch Mode Reinforcement Learning Problem Formulation Let us consider a system having a discrete-time dynamics described by xt+1 ... |

2727 | Bagging predictors - Breiman - 1996 |

1623 | Random forests - Breiman - 2001 |

1398 |
Learning from delayed rewards
- Watkins
- 1989
(Show Context)
Citation Context ...ien Ernst, Pierre Geurts and Louis Wehenkel.sERNST, GEURTS AND WEHENKEL from this set a control policy which is as close as possible to an optimal policy. Inspired by the on-line Q-learning paradigm (=-=Watkins, 1989-=-), we will approach this batch mode learning problem by computing from the set of four-tuples an approximation of the so-called Q-function defined on the state-action space and by deriving from this l... |

1388 | Reinforcement learning: A survey - Kaelbling, Littman, et al. - 1996 |

1315 | Learning to predict by methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...s it possible to take full advantage in the context of reinforcement learning of the generalization capabilities of any regression algorithm, and this contrary to stochastic approximation algorithms (=-=Sutton, 1988-=-; Tsitsiklis, 1994) which can only use parametric function approximators (for example, linear combinations of feature vectors or neural networks). In the rest of this paper we will call this framework... |

864 | Optimization by Vector Space Methods
- Luenberger
(Show Context)
Citation Context ...≤ γ max (x,u)∈X×U |[ #F ∑ kT S((x l=1 l t,u l t),(x,u))max u ′ ∈U [K(xl t+1 ,u′ ) − K(x l t+1 ,u′ )]| ≤ γ max |K(x,u) − K(x,u)| (x,u)∈X×U = γ�K − K�∞ < �K − K�∞. By virtue of the fixed-point theorem (=-=Luenberger, 1969-=-) the sequence converges, independently of the initial conditions, to the function ˆQ : X ×U → R which is unique solution of the equation ˆQ = ˆH ˆQ. Appendix C. Definition of the Benchmark Optimal Co... |

375 | Generalization in reinforcement learning: Successful examples using sparse coarse coding - Sutton - 1996 |

334 | Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning 13(1):102–130 - Moore, Atkeson - 1993 |

320 | Least-squares policy iteration - Lagoudakis, Parr - 2003 |

291 |
Reinforcement Learning with Selective Perception and Hidden State
- McCallum
- 1995
(Show Context)
Citation Context ...ction. Several authors who adapted regression trees in other ways to reinforcement learning have suggested the use of other score criteria for example based on the violation of the Markov assumption (=-=McCallum, 1996-=-; Uther and Veloso, 1998) or on the combination of several error terms like the supervised, the Bellman, and the advantage error terms (Wang and Diettrich, 1999). Investigating the effect of such scor... |

268 | Generalization in Reinforcement Learning: Safely Approximating the Value Function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...methods, divergence to infinity problems were plaguing the fitted Q iteration algorithm (Section 5.3.3); such problems have already been highlighted in the context of approximate dynamic programming (=-=Boyan and Moore, 1995-=-). 4.5 Computation of maxu∈U ˆQN(x,u) when u Continuous In the case of a single regression tree, ˆQN(x,u) is a piecewise-constant function of its argument u, when fixing the state value x. Thus, to de... |

255 | Residual Algorithms: Reinforcement Learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ...function computed by the RL algorithm. For a given function ˆQ and a given state-action pair (x,u), the Bellman residual is defined to be the difference between the two sides of the Bellman equation (=-=Baird, 1995-=-), the Qfunction being the only function leading to a zero Bellman residual for every state-action pair. In our simulation, to estimate the quality of a function ˆQ, we exploit the Bellman residual co... |

236 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional statespaces - Moore, Atkeson - 1995 |

225 | Stable function approximation in dynamic programming
- Gordon
- 1995
(Show Context)
Citation Context ... a separate kernel for each value of the action. The work of Ormoneit and Sen is related to earlier work aimed to solve large-scale dynamic programming problems (see for example Bellman et al., 1973; =-=Gordon, 1995-=-b; Tsitsiklis and Van Roy, 1996; Rust, 1997). The main difference is that in these works the various elements that compose the optimal control problem are supposed to be known. We gave the name fitted... |

217 | PEGASUS:A policy search method for large MDPs and POMDPs
- Ng, Jordan
- 2000
(Show Context)
Citation Context ...ve fallen down, is now equal to creward(dangle(ψt) − dangle(ψt+1)) with 22. Several other papers treat the problems of balancing and/or balancing and riding a bicycle (e.g. Randløv and Alstrøm, 1998; =-=Ng and Jordan, 1999-=-; Lagoudakis and Parr, 2003b,a). The reader can refer to them in order to put the performances of fitted Q iteration in comparison with some other RL algorithms. In particular, he could refer to Randl... |

195 | Reinforcement Learning for Robots Using Neural Networks - Lin - 1993 |

168 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1993
(Show Context)
Citation Context ...to take full advantage in the context of reinforcement learning of the generalization capabilities of any regression algorithm, and this contrary to stochastic approximation algorithms (Sutton, 1988; =-=Tsitsiklis, 1994-=-) which can only use parametric function approximators (for example, linear combinations of feature vectors or neural networks). In the rest of this paper we will call this framework the fitted Q iter... |

144 | Extremely randomized trees
- Geurts, Ernst, et al.
- 2006
(Show Context)
Citation Context ... randomly) can be adapted to ensure the convergence of the sequence while leading to good approximation performances. On the other hand, another tree-based algorithm named extremely randomized trees (=-=Geurts et al., 2004-=-), will be found to perform consistently better than totally randomized trees even though it does not strictly ensure the convergence of the sequence of Q-function approximations. The remainder of thi... |

120 | Reinforcement learning with soft state aggregation - Singh, Jaakkola, et al. - 1995 |

109 | Kernel-based reinforcement learning
- Ormoneit, Sen
- 2002
(Show Context)
Citation Context ...ighbors method, partition and multi-partition methods, locally weighted averaging, linear, and multi-linear interpolation. They are collectively referred to as kernel-based methods (see Gordon, 1999; =-=Ormoneit and Sen, 2002-=-). 3.6 Related Work As stated in the Introduction, the idea of trying to approximate the Q-function from a set of fourtuples by solving a sequence of supervised learning problems may already be found ... |

96 | Technical update: Least-squares temporal difference learning - Boyan - 2002 |

93 | Using Randomization to Break the Curse of Dimensionality
- Rust
- 1997
(Show Context)
Citation Context ...on. The work of Ormoneit and Sen is related to earlier work aimed to solve large-scale dynamic programming problems (see for example Bellman et al., 1973; Gordon, 1995b; Tsitsiklis and Van Roy, 1996; =-=Rust, 1997-=-). The main difference is that in these works the various elements that compose the optimal control problem are supposed to be known. We gave the name fitted Q iteration to our algorithm given in Figu... |

89 | Practical Reinforcement Learning in Continuous Spaces - Smart, Kaelbling |

69 | Approximate Solutions to Markov Decision Processes
- Gordon
- 1999
(Show Context)
Citation Context ...y sparse sets of four-tuples. To overcome this generalization problem, a particularly attractive framework is the one used by Ormoneit and Sen (2002) which applies the idea of fitted value iteration (=-=Gordon, 1999-=-) to kernelbased reinforcement learning, and reformulates the Q-function determination problem as a sequence of kernel-based regression problems. Actually, this framework makes it possible to take ful... |

56 | Policy search by dynamic programming
- Bagnell, Kakade, et al.
- 2003
(Show Context)
Citation Context ...e complete generative model assumption of Gordon (footnote 9, page 511). 511sERNST, GEURTS AND WEHENKEL standard supervised classification problems has been developed (see Lagoudakis and Parr, 2003b; =-=Bagnell et al., 2003-=-), taking its roots from the policy iteration algorithm, another classical dynamic programming algorithm. Within this “reductionist” framework, the fitted Q iteration algorithm can be considered as a ... |

46 | Approximately optimal approximate reinforcement learning
- Kakade, Langford
(Show Context)
Citation Context ...nt works based on the idea of reductions of reinforcement learning to supervised learning (classification or regression) with various assumptions concerning the available a priori knowledge (see e.g. =-=Kakade and Langford, 2002-=-; Langford and Zadrozny, 2004, and the references therein). For example, assuming that a generative model is available, 10 an approach to solve the optimal control problem by reformulating it as a seq... |

46 |
Swing Up Control of the Acrobot
- Spong
- 1994
(Show Context)
Citation Context ...oes not. The system has four continuous state variables: two joint positions (θ1 and θ2) and two joint velocities ( θ1 ˙ and θ2). ˙ This system has been extensively studied by control engineers (e.g. =-=Spong, 1994-=-) as well as machine learning researchers (e.g. Yoshimoto et al., 1999). We have stated this control problem so that the optimal stationary policy brings the Acrobot quickly into a specified neighborh... |

44 | Tree based discretization for continuous state space reinforcement learning
- Uther, Veloso
- 1998
(Show Context)
Citation Context ...uthors who adapted regression trees in other ways to reinforcement learning have suggested the use of other score criteria for example based on the violation of the Markov assumption (McCallum, 1996; =-=Uther and Veloso, 1998-=-) or on the combination of several error terms like the supervised, the Bellman, and the advantage error terms (Wang and Diettrich, 1999). Investigating the effect of such score measures within the fi... |

35 |
Polynomial Approximation: A New Computational Technique
- Bellman, R, et al.
- 1973
(Show Context)
Citation Context ... action spaces and use a separate kernel for each value of the action. The work of Ormoneit and Sen is related to earlier work aimed to solve large-scale dynamic programming problems (see for example =-=Bellman et al., 1973-=-; Gordon, 1995b; Tsitsiklis and Van Roy, 1996; Rust, 1997). The main difference is that in these works the various elements that compose the optimal control problem are supposed to be known. We gave t... |

30 |
Some infinity theory for predictor ensembles
- Breiman
- 2000
(Show Context)
Citation Context ...tions for kernel-based supervised learning methods within the context of fitted Q iteration, and also in some of the material published in the supervised learning literature (e.g. Lin and Jeon, 2002; =-=Breiman, 2000-=-). More specifically, further investigation in order to characterize ensembles of regression trees with respect to consistency is particularly wishful, because of their good practical performances. In... |

28 | Random Forests and adaptive nearest neighbor
- Lin, Jeon
(Show Context)
Citation Context ...de consistency conditions for kernel-based supervised learning methods within the context of fitted Q iteration, and also in some of the material published in the supervised learning literature (e.g. =-=Lin and Jeon, 2002-=-; Breiman, 2000). More specifically, further investigation in order to characterize ensembles of regression trees with respect to consistency is particularly wishful, because of their good practical p... |

20 | Online fitted reinforcement learning
- Gordon
- 1995
(Show Context)
Citation Context ... a separate kernel for each value of the action. The work of Ormoneit and Sen is related to earlier work aimed to solve large-scale dynamic programming problems (see for example Bellman et al., 1973; =-=Gordon, 1995-=-b; Tsitsiklis and Van Roy, 1996; Rust, 1997). The main difference is that in these works the various elements that compose the optimal control problem are supposed to be known. We gave the name fitted... |

17 | Kernel-based reinforcement learning in average-cost problems - Ormoneit, Glynn - 2002 |

17 | Efficient Value Function Approximation Using Regression Trees - Dietterich, Wang - 1999 |

15 |
Iteratively Extending Time Horizon Reinforcement Learning
- Ernst, Geurts, et al.
- 2003
(Show Context)
Citation Context ...ning algorithm which yields an approximation of the Q-function corresponding to an infinite horizon optimal control problem with discounted rewards, by iteratively extending the optimization horizon (=-=Ernst et al., 2003-=-): • At the first iteration it produces an approximation of a Q1-function corresponding to a 1-step optimization. Since the true Q1-function is the conditional expectation of the instantaneous reward ... |

9 | Reducing T-step reinforcement learning to classification. Preprint at http://hunch.net/~jl/projects/reductions/ RL_to_class/colt_submission.ps
- Langford, Zadrozny
- 2003
(Show Context)
Citation Context ...of reductions of reinforcement learning to supervised learning (classification or regression) with various assumptions concerning the available a priori knowledge (see e.g. Kakade and Langford, 2002; =-=Langford and Zadrozny, 2004-=-, and the references therein). For example, assuming that a generative model is available, 10 an approach to solve the optimal control problem by reformulating it as a sequence of 9. Gordon supposes t... |

5 |
Approximate value iteration in the reinforcement learning context. application to electrical power system control
- Ernst, Glavic, et al.
- 2005
(Show Context)
Citation Context ...would be the same as the one estimated by a model-based algorithm using the same grid (see Ernst (2003), page 131 for the proof) which in turn can be shown to be equivalent to fitted Q iteration (see =-=Ernst et al., 2005-=-). 530sFigure 17: Score of ˆµ ∗ 50 TREE-BASED BATCH MODE REINFORCEMENT LEARNING J ˆµ∗ 50 ∞ 0.2 0.1 Pruned CART Tree 0.0 0.0 2.5 5. 7.5 −0.1 −0.2 −0.3 Totally Rand. Trees Kd-Tree (Best nmin) kNN (k = 2... |