## Kernel-Based Reinforcement Learning (1999)

Venue: | Machine Learning |

Citations: | 103 - 1 self |

### BibTeX

@INPROCEEDINGS{Ormoneit99kernel-basedreinforcement,

author = {Dirk Ormoneit and Saunak Sen},

title = {Kernel-Based Reinforcement Learning},

booktitle = {Machine Learning},

year = {1999},

pages = {161--178}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.

### Citations

2633 |
Dynamic programming
- Bellman
- 1957
(Show Context)
Citation Context ...f being in x is independent of t in this case because we are always facing an innite planning horizon, and the value function can be dened as the unique solution of thesxedpoint equation J = J [3]. Intuitively, J (x) equals the expected utility of being in x at any time and choosing optimal actions at any future state as in thesnitehorizon case. In the rest of the paper, when there is no dan... |

1212 |
Markov Decision Processes: Discrete Stochastic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...with the states i = 1; : : : ; N whose transition probability matrix is known. Hence, the usual contraction arguments to demonstrate the convergence of dynamic programming apply to this approximation =-=[21, 4]-=-. In principle, we may thus recover the value function of an arbitrary MDP by usingsner andsner partitions of the state space which corresponds to the construction of a sequence of piecewise constant ... |

1003 |
A Probabilistic Theory of Pattern Recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ... such that it is zero outside asxed range. Using a nearest neighbor kernel, for example, the weight matrix is sparse and the complexity of (5) reduces to O(lmM) where l is the number of neighbors [9]=-=-=-. This computational requirement is suciently small to accommodate many real-world applications where the number of observations issxed. For online problems, however, we require an algorithm whose com... |

563 |
A stochastic approximation method
- Robbins, Monro
- 1951
(Show Context)
Citation Context ...D SAUNAK SEN Estimates of Q t;a (x) may be computed by stochastic approximation of the parameters in a suitable model of the action-value function Q, for example, in a neural network architecture [2=-=2, 34]-=-. Unfortunately, proofs for the convergence of the resulting algorithms can only be obtained in special cases [6, 14, 32]. Furthermore, these estimates are typically inconsistent in the sense that the... |

523 |
Local polynomial modelling and its applications
- Fan, Gijbels
- 1996
(Show Context)
Citation Context ...ction of a sequence of piecewise constant approximations of J . Practically, it is well-known that piecewise constant function estimates are typically inferior to more elaborate \smoothing" meth=-=ods [10]-=-. Suppose S a is a collection of m a historical state transitions from x s to y a s , generated using action a. To apply smoothing to reinforcement learning, we replace the events B i with \fuzzy even... |

342 | Dynamic Programming and
- Bertsekas
- 2000
(Show Context)
Citation Context ...with the states i = 1; : : : ; N whose transition probability matrix is known. Hence, the usual contraction arguments to demonstrate the convergence of dynamic programming apply to this approximation =-=[21, 4]-=-. In principle, we may thus recover the value function of an arbitrary MDP by usingsner andsner partitions of the state space which corresponds to the construction of a sequence of piecewise constant ... |

319 | Policy gradient methods for reinforcement learning with function approximation
- Sutton, McAllester, et al.
(Show Context)
Citation Context ...a continuous-space framework. By contrast, recently-advocated \direct" policy search or perturbation methods can, 2 DIRK ORMONEIT AND SAUNAK SEN by construction, be optimal at most in a local se=-=nse [28, 33-=-]. Besides establishing consistency we also derive the limiting distribution of the value function estimate. This is useful in studying the bias-variance tradeo in reinforcement learning. We alsosnd t... |

252 | A.: Generalization in reinforcement learning: Safely approximating the value function. Advances in neural information processing systems
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...orks) to represent the value function of the underlying Markov Decision Process (MDP). For a detailed discussion of this problem, as well as a list of exceptions, the interested reader is referred to =-=[5, 31]-=-. By adopting a non-parametric perspective on reinforcement learning, we suggest an algorithm that always converges to a unique solution. This algorithm assigns value function estimates to the states ... |

235 |
Optimal global rates of convergence for nonparametric regression
- Stone
- 1982
(Show Context)
Citation Context ...nt learning algorithm that does not incorporate prior knowledge can \break" the \curse-ofdimensionality ". This is because the lower bounds for the complexity of non-linear regression are ex=-=ponential [26-=-] and reinforcement learning is at least as hard as 12 DIRK ORMONEIT AND SAUNAK SEN nonlinear regression (an arbitrary regression problem can be reduced to a trivial one-step MDP; see also [23]). How... |

177 | On actor-critic algorithms
- Konda, Tsitsiklis
- 2003
(Show Context)
Citation Context ...a continuous-space framework. By contrast, recently-advocated \direct" policy search or perturbation methods can, 2 DIRK ORMONEIT AND SAUNAK SEN by construction, be optimal at most in a local se=-=nse [28, 33-=-]. Besides establishing consistency we also derive the limiting distribution of the value function estimate. This is useful in studying the bias-variance tradeo in reinforcement learning. We alsosnd t... |

153 |
Learning to predict by the methods of temporal di erences
- Sutton
- 1988
(Show Context)
Citation Context ...al control problems. The temporal-dierence learning algorithm is due to Sutton and the idea to directly approximate the \action-value function" that is also used in this work wassrst used by Watk=-=ins [27, 34]-=-. With regard to theoretical results, Tsitsiklis and Van Roy [32] prove the convergence of a stochastic algorithm for the estimation of the value function in optimal stopping problems. For practical a... |

121 | Reinforcement learning for dynamic channel allocation in cellular telephone systems - Singh, Bertsekas - 1997 |

120 | Estimating Portfolio and Consumption Choice: A Conditional Euler Equations Approach
- Brandt
- 1999
(Show Context)
Citation Context ...ng problems. For practical applications of simulation-based optimal control in Finance, see Longsta and Schwartz's paper on American option pricing [15] and Brandt's work on Optimal Portfolio Choice [=-=7]-=-. While our method addresses both discounted- and average-cost problems, we focus on discounted-costs here and refer the reader interested in average-costs to other work [16, 17]. The remainder of thi... |

94 | Efficient Learning and Planning Within the Dyna Framework
- Ping, Williams
- 1993
(Show Context)
Citation Context ...is more widely applicable. Other related ideas can be found in [35, 30, 6, 14, 19]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [8], Peng [=-=20-=-], and Atkeson, Moore, and Schaal [1] apply locally weighted regression to physical control problems. The temporal-dierence learning algorithm is due to Sutton and the idea to directly approximate the... |

86 | Using randomization to break the curse of dimensionality
- Rust
- 1997
(Show Context)
Citation Context ...ymptotic formula for the bias increase which could help understand this issue in a more general framework. In the context of reinforcement learning, local averaging has been suggested in work by Rust =-=[23-=-] and Gordon [11], making the assumption that the transition probabilities of the MDP are known and can be used for learning. Our approach is fundamentally dierent in that kernel-based reinforcement l... |

82 | L.: Practical reinforcement learning in continuous spaces
- Smart, Kaelbling
- 2000
(Show Context)
Citation Context ...sed approximations, and trees. Yet another interesting possibility is to use locally weighted regression in place of the local averaging rule (4). Practical applications of this idea are described in =-=[25]-=-. Locally weighted regression can be shown to eliminate much of the bias at the boundaries of the state-space and it is sometimes believed to lead to superior performance in regression problems [12, 1... |

77 | Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives
- Tsitsiklis, Roy
- 1999
(Show Context)
Citation Context ... to Sutton and the idea to directly approximate the \action-value function" that is also used in this work wassrst used by Watkins [27, 34]. With regard to theoretical results, Tsitsiklis and Van=-= Roy [32]-=- prove the convergence of a stochastic algorithm for the estimation of the value function in optimal stopping problems. For practical applications of simulation-based optimal control in Finance, see L... |

66 | Approximate Solutions to Markov Decision Processes
- Gordon
- 1999
(Show Context)
Citation Context ... for the bias increase which could help understand this issue in a more general framework. In the context of reinforcement learning, local averaging has been suggested in work by Rust [23] and Gordon =-=[11-=-], making the assumption that the transition probabilities of the MDP are known and can be used for learning. Our approach is fundamentally dierent in that kernel-based reinforcement learning only rel... |

60 |
Active exploration in dynamic environments
- Thrun, Moller
(Show Context)
Citation Context ...ch is fundamentally dierent in that kernel-based reinforcement learning only relies on the sample trajectories of the MDP. Therefore it is more widely applicable. Other related ideas can be found in [=-=35, 30, 6, 14, 19-=-]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [8], Peng [20], and Atkeson, Moore, and Schaal [1] apply locally weighted regression to phys... |

55 | Reinforcement learning Applied to Linear Quadratic Regulation
- Bradtke
- 1993
(Show Context)
Citation Context ...ch is fundamentally dierent in that kernel-based reinforcement learning only relies on the sample trajectories of the MDP. Therefore it is more widely applicable. Other related ideas can be found in [=-=35, 30, 6, 14, 19-=-]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [8], Peng [20], and Atkeson, Moore, and Schaal [1] apply locally weighted regression to phys... |

53 |
Reinforcement learning with highdimensional continuous actions
- Baird, Klopf
- 1993
(Show Context)
Citation Context ...licit knowledge of the transition density p(yjx; a). For this purpose it is convenient to introduce the action-value function Q t;a (x) = E t [r(X t+1 ; x; a) + J t+1 (X t+1 )jX t = x; a t = a]: (2) Q t;a is the value of taking action a at time t and optimal actions in all future times. Denition (2) can be abbreviated by using an operator a that maps J t+1 to Q t;a ; i.e. Q t;a = a J ... |

46 |
Consistency of HDP applied to simple reinforcement learning problems
- Werbos
- 1990
(Show Context)
Citation Context ...ch is fundamentally dierent in that kernel-based reinforcement learning only relies on the sample trajectories of the MDP. Therefore it is more widely applicable. Other related ideas can be found in [=-=35, 30, 6, 14, 19-=-]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [8], Peng [20], and Atkeson, Moore, and Schaal [1] apply locally weighted regression to phys... |

27 |
Feature-based methods for large-scale dynamic programming
- Tsitsiklis, Roy
- 1996
(Show Context)
Citation Context ...orks) to represent the value function of the underlying Markov Decision Process (MDP). For a detailed discussion of this problem, as well as a list of exceptions, the interested reader is referred to =-=[5, 31]-=-. By adopting a non-parametric perspective on reinforcement learning, we suggest an algorithm that always converges to a unique solution. This algorithm assigns value function estimates to the states ... |

22 |
Learning to control a dynamic physical system
- Connell, Utgoff
- 1987
(Show Context)
Citation Context ...refore it is more widely applicable. Other related ideas can be found in [35, 30, 6, 14, 19]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [=-=8-=-], Peng [20], and Atkeson, Moore, and Schaal [1] apply locally weighted regression to physical control problems. The temporal-dierence learning algorithm is due to Sutton and the idea to directly appr... |

20 |
Neurogammon Wins Computer Olympiad
- Tesauro
- 1989
(Show Context)
Citation Context ... learning has been applied successfully to a variety of practical applications including prominent examples such as Tesauro's Neurogammon or Singh and Bertsekas' dynamic channel allocation algorithms =-=[29, 24]-=-. A fundamental obstacle to a widespread application of reinforcement learning to industrial problems is that reinforcement learning algorithms frequently fail to converge to a solution. This is parti... |

16 | Reinforcement Learning and Distributed Local Model Synthesis
- Landelius
- 1997
(Show Context)
Citation Context |

16 |
Valuing American options by simulation: A simple least squares approach, Review of Financial Studies 14(1
- Longsta¤, Schwartz
- 2001
(Show Context)
Citation Context ... estimation of the value function in optimal stopping problems. For practical applications of simulation-based optimal control in Finance, see Longsta and Schwartz's paper on American option pricing [=-=15]-=- and Brandt's work on Optimal Portfolio Choice [7]. While our method addresses both discounted- and average-cost problems, we focus on discounted-costs here and refer the reader interested in average-... |

15 | Kernel-based reinforcement learning in average-cost problem”s
- Ormoneit
(Show Context)
Citation Context ...n Optimal Portfolio Choice [7]. While our method addresses both discounted- and average-cost problems, we focus on discounted-costs here and refer the reader interested in average-costs to other work =-=[16, 17]-=-. The remainder of this work is organized as follows. In Section 2, we review basic facts about Markov Decision Processes. In Sections 3 and 4, we introduce the kernel-based reinforcement learning ope... |

10 |
Central limit theorems for C(S)-valued random variables
- Jain, Marcus
- 1975
(Show Context)
Citation Context ...) converges to ~ k(z; x) uniformly almost surely as m goes to innity. We will show that the random function ~ a (x) = a (x) E[ a (x)] satises a Central Limit Theorem by applying a Theorem 1 of [13], where a (x) = ~ k(Z; x)[r(Y; Z; a) + J(Y )]: Then it follows that, 1= p m P m s=1 ~ a;s (x) = p m( ^ a J(x) E[ ^ a J(x)]) converges in distribution to a Gaussian measure on C([b; 1 b] d ) as m!... |

10 | Convergence of reinforcement learning with general function approximators
- Papavassiliou, Russell
- 1999
(Show Context)
Citation Context |

8 | UOptid kernel shapes for local linear regression
- Ormoneit, Hastie
- 1999
(Show Context)
Citation Context ...n [25]. Locally weighted regression can be shown to eliminate much of the bias at the boundaries of the state-space and it is sometimes believed to lead to superior performance in regression problems =-=[12, 18]. Fro-=-m a mathematical perspective, it is well-known that locally weighted regression can be interpreted as a special case of local averaging using the notion of \equivalent kernels" [10]. However, loc... |

1 |
Locally weighted regression for control
- Atkeson, Moore, et al.
- 1997
(Show Context)
Citation Context ...ted ideas can be found in [35, 30, 6, 14, 19]. Baird and Klopf [2] apply the nearest neighbor algorithm to reinforcement learning, and Connell and Utgo [8], Peng [20], and Atkeson, Moore, and Schaal [=-=1] app-=-ly locally weighted regression to physical control problems. The temporal-dierence learning algorithm is due to Sutton and the idea to directly approximate the \action-value function" that is als... |

1 |
Local regression: Autmatic kernel carpentry
- Hastie, Loader
- 1993
(Show Context)
Citation Context ...n [25]. Locally weighted regression can be shown to eliminate much of the bias at the boundaries of the state-space and it is sometimes believed to lead to superior performance in regression problems =-=[12, 18]. Fro-=-m a mathematical perspective, it is well-known that locally weighted regression can be interpreted as a special case of local averaging using the notion of \equivalent kernels" [10]. However, loc... |

1 |
Kernel-based reinforcement learning in average-reward problems
- Ormoneit, Glynn
- 2000
(Show Context)
Citation Context ...n Optimal Portfolio Choice [7]. While our method addresses both discounted- and average-cost problems, we focus on discounted-costs here and refer the reader interested in average-costs to other work =-=[16, 17]-=-. The remainder of this work is organized as follows. In Section 2, we review basic facts about Markov Decision Processes. In Sections 3 and 4, we introduce the kernel-based reinforcement learning ope... |