Results 1 - 10
of
53
Tree-based batch mode reinforcement learning
- Journal of Machine Learning Research
, 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract
-
Cited by 93 (22 self)
- Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
Kernel-Based Reinforcement Learning
- Machine Learning
, 1999
"... We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
A stochastic mesh method for pricing high-dimensional American options
- Journal of Computational Finance
, 1997
"... High-dimensional problems frequently arise in the pricing of derivative securities – for example, in pricing options on multiple underlying assets and in pricing term structure derivatives. American versions of these options, ie, where the owner has the right to exercise early, are particularly chal ..."
Abstract
-
Cited by 60 (6 self)
- Add to MetaCart
High-dimensional problems frequently arise in the pricing of derivative securities – for example, in pricing options on multiple underlying assets and in pricing term structure derivatives. American versions of these options, ie, where the owner has the right to exercise early, are particularly challenging to price. We introduce a stochastic mesh method for pricing high-dimensional American options when there is a finite, but possibly large, number of exercise dates. The algorithm provides point estimates and confidence intervals; we provide conditions under which these estimates converge to the correct values as the computational effort increases. Numerical results illustrate the performance of the method. 1
Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning
- Proc. of the 20th International Conference on Machine Learning
, 2003
"... We present a novel Bayesian approach to the problem of value function estimation in continuous state spaces. We de ne a probabilistic generative model for the value function by imposing a Gaussian prior over value functions and assuming a Gaussian noise model. ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
We present a novel Bayesian approach to the problem of value function estimation in continuous state spaces. We de ne a probabilistic generative model for the value function by imposing a Gaussian prior over value functions and assuming a Gaussian noise model.
Comparing Solution Methods for Dynamic Equilibrium Economies
- Journal of Economic Dynamics and Control
, 2006
"... This paper compares solution methods for dynamic equilibrium economies. We compute and simulate the stochastic neoclassical growth model with leisure choice using Undetermined Coefficients in levels and in logs, Finite Elements, Chebyshev Polynomials, Second and Fifth Order Perturbations and Value F ..."
Abstract
-
Cited by 39 (15 self)
- Add to MetaCart
This paper compares solution methods for dynamic equilibrium economies. We compute and simulate the stochastic neoclassical growth model with leisure choice using Undetermined Coefficients in levels and in logs, Finite Elements, Chebyshev Polynomials, Second and Fifth Order Perturbations and Value Function Iteration for several calibrations. We document the performance of the methods in terms of computing time, implementation complexity and accuracy and we present some conclusions about our preferred approaches based on the reported evidence.
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems -- ranging from dynamic resource allocation in telecommunication networks to financial risk management -- can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
In principle, a wide variety of sequential decision problems -- ranging from dynamic resource allocation in telecommunication networks to financial risk management -- can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporal-difference learning -- a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
Modeling Model Uncertainty
- Journal of the European Economic Assocation
, 2003
"... Recently there has been a great deal of interest in studying monetary policy under model uncertainty. We point out that different assumptions about the uncertainty may result in drastically different “robust ” policy recommendations. Therefore, we develop new methods to analyze uncertainty about the ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
Recently there has been a great deal of interest in studying monetary policy under model uncertainty. We point out that different assumptions about the uncertainty may result in drastically different “robust ” policy recommendations. Therefore, we develop new methods to analyze uncertainty about the parameters of a model, the lag specification, the serial correlation of shocks, and the effects of real time data in one coherent structure. We consider both parametric and nonparametric specifications of this structure and use them to estimate the uncertainty in a small model of the US economy. We then use our estimates to compute robust Bayesian and minimax monetary policy rules, which are designed to perform well in the face of uncertainty. Our results suggest that the aggressiveness recently found in robust policy rules is likely to be caused by overemphasizing uncertainty about economic dynamics at low frequencies.
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervised-learning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervised-learning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize long-term utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce short-term utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better game-playing strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach near-optimal strategies
Solving Factored MDPs with Hybrid State and Action Variables
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2006
"... Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model t ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a new hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function by a linear combination of basis functions and optimize its weights by linear programming.
Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
- Advances in Neural Information Processing Systems
, 2000
"... Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algo ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable.

