## Fast online policy gradient learning with SMD gain vector adaptation (2006)

### Cached

### Download Links

- [nic.schraudolph.org]
- [cnl.salk.edu]
- [eprints.pascal-network.org]
- [books.nips.cc]
- [users.cecs.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | Advances in Neural Information Processing Systems 18 |

Citations: | 12 - 1 self |

### BibTeX

@INPROCEEDINGS{Schraudolph06fastonline,

author = {Nicol N. Schraudolph and Jin Yu and Douglas Aberdeen},

title = {Fast online policy gradient learning with SMD gain vector adaptation},

booktitle = {Advances in Neural Information Processing Systems 18},

year = {2006},

pages = {110--119},

publisher = {MIT Press}

}

### OpenURL

### Abstract

Reinforcement learning by direct policy gradient estimation is attractive in theory but in practice leads to notoriously ill-behaved optimization problems. We improve its robustness and speed of convergence with stochastic meta-descent, a gain vector adaptation method that employs fast Hessian-vector products. In our experiments the resulting algorithms outperform previously employed online stochastic, offline conjugate, and natural policy gradient methods. 1

### Citations

644 | A direct adaptive method for faster backpropagation learning: The RPROP algorithm - Riedmiller, Braun - 1993 |

339 |
Increased rates of convergence through learning rate adaptation. Neural Networks 1(4):295--308
- Jacobs
- 1988
(Show Context)
Citation Context ... of θ with its own positive gradient step size. We adapt γ by a simultaneous meta-level gradient ascent in the objective Rt. A straightforward implementation of this idea is the delta-delta algorithm =-=[5]-=-, which would update γ via γt+1 = γt + µ ∂Rt+1(θt+1) = γt + µ ∂γt ∂Rt+1(θt+1) · ∂θt+1 ∂θt+1 = γt + µgt+1 · gt , (3) ∂γt where µ ∈ R is a scalar meta-step size. In a nutshell, gains are decreased where... |

295 |
Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation
- Griewank
- 2000
(Show Context)
Citation Context ... parameters has O(n 2 ) entries, efficient indirect methods from algorithmic differentiation are available to compute its product with an arbitrary vector in the same time as 2–3 gradient evaluations =-=[12, 13]-=-. To improve stability, SMD employs an extended Gauss-Newton approximation of Ht for which a similar (even faster) technique is available [4]. An iteration of SMD — comprising (5), (2), and (7) — thus... |

290 | Natural gradient works efficiently in learning
- Amari
- 1998
(Show Context)
Citation Context ... of high noise tolerance and online learning. Experiments (Section 4) show that the resulting SMDPOMDP algorithm can greatly outperform OLPOMDP and CONJPOMDP. Kakade [15] has applied natural gradient =-=[16]-=- to GPOMDP, premultiplying the policy gradient by the inverse of the online estimate Ft = (1 − 1 t )Ft−1 + 1 t (δtδ ⊤ t + ɛI) (15) of the Fisher information matrix for the parameter update: θt+1 = θt ... |

153 | Infinite-horizon policy-gradient estimation
- Baxter, Bartlett
(Show Context)
Citation Context ...methods. 1 Introduction Policy gradient reinforcement learning (RL) methods train controllers by estimating the gradient of a long-term reward measure with respect to the parameters of the controller =-=[1]-=-. The advantage of policy gradient methods, compared to value-based RL, is that we avoid the often redundant step of accurately estimating a large number of values. Policy gradient methods are particu... |

134 |
Additive versus exponentiated gradient updates for linear prediction
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...nately such a simplistic approach has several problems: Firstly, (3) allows gains to become negative. This can be avoided by updating γ multiplicatively, e.g. via the exponentiated gradient algorithm =-=[6]-=-. Secondly, delta-delta’s cure is worse than the disease: individual gains are meant to address ill-conditioning, but (3) actually squares the condition number. The autocorrelation of the gradient mus... |

105 | A natural policy gradient
- Kakade
- 2002
(Show Context)
Citation Context ...te it while retaining the benefits of high noise tolerance and online learning. Experiments (Section 4) show that the resulting SMDPOMDP algorithm can greatly outperform OLPOMDP and CONJPOMDP. Kakade =-=[15]-=- has applied natural gradient [16] to GPOMDP, premultiplying the policy gradient by the inverse of the online estimate Ft = (1 − 1 t )Ft−1 + 1 t (δtδ ⊤ t + ɛI) (15) of the Fisher information matrix fo... |

71 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ... parameters has O(n 2 ) entries, efficient indirect methods from algorithmic differentiation are available to compute its product with an arbitrary vector in the same time as 2–3 gradient evaluations =-=[12, 13]-=-. To improve stability, SMD employs an extended Gauss-Newton approximation of Ht for which a similar (even faster) technique is available [4]. An iteration of SMD — comprising (5), (2), and (7) — thus... |

63 | Supersab: fast adaptive backpropagation with good scaling properties - tollenaere - 1990 |

60 | Experiments with infinite-horizon, policy-gradient estimation
- Baxter, Bartlett, et al.
- 2001
(Show Context)
Citation Context ...ptive controller that receives two state-dependent feature values as input, and is trained to maximize the expected average reward by policy gradient methods. Using the original code of Baxter et al. =-=[2]-=-, we replicated their experimental results for the OLPOMDP and CONJPOMDP algorithms on this simple POMDP. We can accurately reproduce all essential features of their graphed results on this problem [2... |

58 | Local gain adaptation in stochastic gradient descent
- Schraudolph
- 1999
(Show Context)
Citation Context ... practice policy gradient methods have shown slow convergence [2], not least due to the stochastic nature of the gradients being estimated. The stochastic meta-descent (SMD) gain adaptation algorithm =-=[3, 4]-=- can considerably accelerate the convergence of stochastic gradient descent. In contrast to other gain adaptation methods, SMD copes well not only with stochasticity, but also with non-i.i.d. sampling... |

42 | Gain adaptation beats least squares
- Sutton
- 1992
(Show Context)
Citation Context ...radients. Though such ad-hoc smoothing does improve performance, it does not properly capture long-term dependences, the average still being one of immediate, single-step effects. By contrast, Sutton =-=[11]-=- modeled the long-term effect of gains on future parameter values in a linear system by carrying the relevant partials forward in time, and found that the resulting gain adaptation can outperform a le... |

38 | Fast curvature matrix-vector products for second-order gradient descent
- Schraudolph
- 2002
(Show Context)
Citation Context ... practice policy gradient methods have shown slow convergence [2], not least due to the stochastic nature of the gradients being estimated. The stochastic meta-descent (SMD) gain adaptation algorithm =-=[3, 4]-=- can considerably accelerate the convergence of stochastic gradient descent. In contrast to other gain adaptation methods, SMD copes well not only with stochasticity, but also with non-i.i.d. sampling... |

38 | Acceleration techniques for the back–propagation algorithm - Silva, Almeida - 1990 |

15 |
Parameter adaptation in stochastic optimization
- Almeida, Langlois, et al.
- 1999
(Show Context)
Citation Context ...orrelation. Such sign-based methods [5, 7–9], however, do not cope well with stochastic approximation of the gradient since the non-linear sign function does not commute with the expectation operator =-=[10]-=-. More recent algorithms [3, 4, 10] therefore use multiplicative (hence linear) normalization factors to condition the meta-level update. Finally, (3) fails to take into account that gain changes affe... |

8 | Combining conjugate direction methods with stochastic approximation of gradients
- Schraudolph, Graepel
- 2003
(Show Context)
Citation Context ...tion of search directions, using a noise-tolerant line search to find the approximately best scalar step size in a given search direction. Since conjugate gradient methods are very sensitive to noise =-=[14]-=-, CONJPOMDP must average gt over many steps to obtain a reliable gradient measurement; this makes the algorithm inherently inefficient (cf. Section 4). OLPOMDP, on the other hand, is robust to noise b... |