## Basis function adaptation in temporal difference reinforcement learning (2005)

### Cached

### Download Links

- [www.ee.technion.ac.il]
- [iew3.technion.ac.il]
- [webee.technion.ac.il]
- [webee.technion.ac.il]
- [webee.technion.ac.il]
- [www.ee.technion.ac.il]
- DBLP

### Other Repositories/Bibliography

Venue: | Annals of Operations Research |

Citations: | 56 - 3 self |

### BibTeX

@ARTICLE{Menache05basisfunction,

author = {Ishai Menache and Shie Mannor and Nahum Shimkin},

title = {Basis function adaptation in temporal difference reinforcement learning},

journal = {Annals of Operations Research},

year = {2005},

volume = {134},

pages = {215--238}

}

### Years of Citing Articles

### OpenURL

### Abstract

Reinforcement Learning (RL) is an approach for solving complex multistage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradientbased approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

### Citations

4206 |
Neural Networks: A Comprehensive Foundation
- Haykin
(Show Context)
Citation Context ...res using adjustable weights. The mapping from states to features is often referred to as a basis function. A notable class of linear function approximators is that of Radial Basis Function networks (=-=Haykin, 1998-=-). In the RL context, linear architectures uniquely enjoy convergence results and performance guarantees, particularly for the problem of approximating the value of a fixed stationary policy (see Tsit... |

4114 | Reinforcement Learning: an Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ch for solving hard Markov Decision Problems (MDPs). This framework addresses in a unified manner the problems posed by an unknown environment and a large state space (Bertsekas and Tsitsiklis, 1996; =-=Sutton and Barto, 1998-=-). The underlying methods are based on Dynamic Programming, and include adaptive schemes that mimic either value iteration (such as Q-learning) or policy iteration (actor-critic methods). While the fo... |

1387 | Reinforcement learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context .... We finally describe experiments with CE based adaptation in Section 4.3. 4.1 General Setup The domain which has been chosen for the experiments is a discrete two dimensional maze-world domain (e.g. =-=Kaelbling et al., 1996).-=- In this domain an agent roams around the maze, trying to reach goal states as fast as possible. The agent receives a small negative reward for each step (−0.5), and a positive reward (+8) for reach... |

1313 | Learning to predict by the method of temporal difference
- Sutton
- 1989
(Show Context)
Citation Context ...ther is the generalization problem, assuming that limited experience does not provide sufficient data for each and every state. Both these issues are addressed by the Function Approximation approach (=-=Sutton, 1988-=-), which involves approximating the value functions by functional approximators with given architectures and a manageable number of adjustable parameters. Obviously, the success of this approach rests... |

1058 |
The EM Algorithm and Extensions
- McLachlan, Krishnan
- 1997
(Show Context)
Citation Context ..., it is only natural to perform a maximum likelihood estimation for the steady-state occupancy measure. We solved the above problem using the celebrated Expectation-Maximization (EM) algorithm (e.g., =-=McLachlan and Krishnan, 1997-=-), which gives a convenient solution for density estimation with a mixture of Gaussians. The linear weights of the system are still calculated by the LST D algorithm. We compared the above approaches ... |

796 |
Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific
- Bertsekas, N
- 1996
(Show Context)
Citation Context ... last decade into a major approach for solving hard Markov Decision Problems (MDPs). This framework addresses in a unified manner the problems posed by an unknown environment and a large state space (=-=Bertsekas and Tsitsiklis, 1996-=-; Sutton and Barto, 1998). The underlying methods are based on Dynamic Programming, and include adaptive schemes that mimic either value iteration (such as Q-learning) or policy iteration (actor-criti... |

505 |
Neuronlike adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ... be employed to estimate V π . The resulting algorithm, where on-line estimates of V π are used for policy improvement steps, is often referred to as an actor-critic architecture (e.g., Witten, 1977; =-=Barto et al., 1983-=-). In what follows we focus on the task of learning V π for a fixed policy π. Suppose we wish to approximate V π using a functional approximator of the form ˜Vr : S → IR, where r ∈ IR K is a tunable p... |

498 |
Dynamic Programming and Optimal Control. Athena Scientific
- Bertsekas
- 2002
(Show Context)
Citation Context ... t=0 Here 0 < γ < 1 is the discount factor, π is the agent’s policy (to be optimized), E π is the expectation operator induced by π, and s denotes the initial state. As is well known (Puterman, =-=1994; Bertsekas, 1995)-=-, an optimal policy exists within the class of (deterministic) stationary policies, that map states into actions. A randomized stationary policy can be identified with conditional probabilities π(a|s... |

239 | An analysis of temporal-difference learning with function approximation - Tsitsiklis, Roy - 1997 |

191 | Linear least-squares algorithms for temporal difference learning - Bradtke, Barto - 1996 |

148 |
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte Carlo Simulation
- Rubinstein, Kroese
- 2004
(Show Context)
Citation Context ...orithms for adjusting the basis function parameters: The first is a local, gradient-based method, while the second is based on the Cross Entropy (CE) method for global optimization (Rubinstein, 1999; =-=Rubinstein and Kroese, 2004-=-). To evaluate the performance of these algorithms, we use the common “grid world” shortest path problem as a convenient test case. Our first simulations show that the gradient-based algorithm quickly... |

144 | Feature-based methods for large scale dynamic programming - Tsitsiklis, Roy - 1996 |

125 | Automatic discovery of subgoals in reinforcement learning using diverse density
- McGovern, Barto
- 2001
(Show Context)
Citation Context ...(see Bertsekas and Tsitsiklis, 1996). The representative states (RS) should typically be separated relative to some state space metric, and possibly represent significant states, such as bottlenecks (=-=McGovern and Barto, 2001-=-; Menache et al., 2002) or uncommon states (see Ratitch and Precup, 2002 for novel criteria for “complex” states). The actual number of RS is determined according to available memory and computation r... |

120 | Reinforcement learning with soft state aggregation - Singh, Jaakkola, et al. - 1995 |

96 | Technical update: Least-squares temporal difference learning
- Boyan
- 2002
(Show Context)
Citation Context ...ble, that the above algorithm converges to a unique parameter vector r ∗ , and further that the approximation error is bounded by a fixed multiple of the optimal one. Finally, the LST D(λ) algorith=-=m (Boyan, 2002) is-=- a batch variant of T D(λ), which converges to the same weight vector as the iterative algorithm above. This algorithm computes the following K dimensional vector and a K × K matrix: bt = t� ziRi ... |

66 | Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems
- Nedic, Bertsekas
- 2003
(Show Context)
Citation Context ...ar architectures uniquely enjoy convergence results and performance guarantees, particularly for the problem of approximating the value of a fixed stationary policy (see Tsitsiklis and Van Roy, 1997, =-=Nedic and Bertsekas, 2001-=-). Yet, the approximation quality obviously hinges on the proper choice of the basis functions. In this paper we consider the possibility of on-line tuning of the basis functions. As is common in RBF ... |

62 | Nonlinear Programming, 2nd edition, Athena Scientific - Bertsekas - 1999 |

58 | Reinforcement learning applied to linear quadratic regulation - Bradtke - 1993 |

56 | Error bounds for approximate policy iteration
- Munos
- 2003
(Show Context)
Citation Context ... Markov chain induced by π is irreducible, the above algorithm converges to a unique parameter vector r ∗ . Furthermore, the approximation error is bounded by a fixed multiple of the optimal one (see =-=Munos, 2003-=- for tighter bounds). Finally, the LST D(λ) algorithm (Boyan, 2002) is a batch variant of T D(λ), which converges to the same weight vector as the iterative algorithm above. This algorithm computes th... |

47 | Consistency of HDP Applied to a Simple Reinforcement Learning Problem - Werbos - 1990 |

40 | Q-cut - dynamic discovery of sub-goals in reinforcement learning
- Menache, Mannor, et al.
- 2002
(Show Context)
Citation Context ...lis, 1996). The representative states (RS) should typically be separated relative to some state space metric, and possibly represent significant states, such as bottlenecks (McGovern and Barto, 2001; =-=Menache et al., 2002) or-=- uncommon states (see Ratitch and Precup, 2002 for novel criteria for “complex” states). The actual number of RS is determined according to available memory and computation resources. The score fu... |

31 | Exponentially Many Local Minima for Single Neurons
- Auer, Herbster, et al.
- 1996
(Show Context)
Citation Context ...iffer between different trials. This clearly indicates the existence of multiple local minima, as could be expected by the non-linear dependence of the basis functions on their parameters (see, e.g., =-=Auer et al., 1996-=-). To illustrate the extent of this problem, we plot in Figure 3 the mean square error of the estimated value with respect to the true value for the value estimation to which the algorithm converged. ... |

25 | The cross entropy method for fast policy search - Mannor, Rubinstein, et al. - 2003 |

21 | Model-Free Least-Squares Policy Iteration
- Lagoudakis, Parr
- 2001
(Show Context)
Citation Context ... The eligibility vector zt is updated as in (5). The approximating weight vector is calculated, when required, via r = A −1 b. This algorithm has been shown to give favorable convergence rates (e.g., =-=Lagoudakis and Parr, 2001-=-). 6s3 Evaluation Criteria Assume now that each of the basis functions ϕk in (3) has some pre-determined parametric form ϕθk . For example, for a Gaussian radial basis function of the form � ϕθk (s) =... |

20 | Using the Cross-Entropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks
- Helvik, Wittner
- 2001
(Show Context)
Citation Context ...e that several works on optimization of systems that are essentially MDPs were conducted by the CE community. Specifically, the buffer allocation problem (Allon et al., 2001) and robot path planning (=-=Helvik and Wittner, 2001-=-) were recently studied. The paper is organized as follows. We start in Section 1 with a short summary of necessary ideas concerning RL in large MDPs. We then present the basis function optimization p... |

17 | An overview of radial basis function networks - Ghosh, Nag |

8 | Characterizing Markov Decision Process
- Ratitch, Precup
(Show Context)
Citation Context ...uld typically be separated relative to some state space metric, and possibly represent significant states, such as bottlenecks (McGovern and Barto, 2001; Menache et al., 2002) or uncommon states (see =-=Ratitch and Precup, 2002 for novel crit-=-eria for “complex” states). The actual number of RS is determined according to available memory and computation resources. The score function now becomes M(θ) = � s∈RS β(s) � ˜V π θ (s)... |

5 | Application of the cross entropy method for buffer allocation problem in simulation based environment
- Allon, Ravin, et al.
- 2001
(Show Context)
Citation Context ...extended by Mannor et al. (2003). We also note that several works on optimization of systems that are essentially MDPs were conducted by the CE community. Specifically, the buffer allocation problem (=-=Allon et al., 2001-=-) and robot path planning (Helvik and Wittner, 2001) were recently studied. The paper is organized as follows. We start in Section 1 with a short summary of necessary ideas concerning RL in large MDPs... |

3 |
The cross-entropy metod for combinatorial and continuous optimization
- Rubinstein
- 1999
(Show Context)
Citation Context ...sider here two algorithms for adjusting the basis function parameters: The first is a local, gradient-based method, while the second is based on the Cross Entropy (CE) method for global optimization (=-=Rubinstein, 1999-=-). We present experiments with both adaptation methods for a grid world. The experiments demonstrate that both methods manage to improve the initial estimation quality. Our experiments show that the g... |

1 |
A tutorial on the cross-entropy method. (Avaialble from http://www.cs.utwente.nl/˜ptdeboer/ce/, submitted to the Annals of Operation Research
- de-Boer, Kroese, et al.
- 2002
(Show Context)
Citation Context .... Recall that the score function of (9) may be estimated for a fixed vector of parameters θ. This score can naturally serve as the score function for the CE method for optimization (Rubinstein, 1999;=-= de-Boer et al., 2002)-=-. We follow the conventions used in deBoer et al. (2002) and refer the reader to that tutorial. We assume that the set of parameters θ is drawn from probability density functions (pdfs), which have s... |

1 | Hierarchical structures in reinforcement learning. Unpublished master’s thesis - Menache - 2002 |

1 |
A tutorial on the cross-entropy method. (Available from http://iew3.technion.ac.il/CE/, to appear in the Annals of Operation Research
- de-Boer, Kroese, et al.
- 2004
(Show Context)
Citation Context ...n the score is small for d consecutive iterations, see Eq. (22). The stopping rule is slightly different than the standard stopping rule where ˆγm is required not to change for a few iterations (e.g. =-=de-Boer et al., 2004-=-). The reason for this deviation from the standard stopping rule is that the score function is stochastic, so the score of the elite samples might fluctuate randomly even after effective convergence w... |