Results 1  10
of
13
Treebased batch mode reinforcement learning
 Journal of Machine Learning Research
, 2005
"... Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state a ..."
Abstract

Cited by 134 (28 self)
 Add to MetaCart
Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the socalled Qfunction based on a set of fourtuples (xt,ut,rt,xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Qfunction. The Qfunction approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical treebased supervised learning methods (CART, Kdtree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of fourtuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.
KernelBased Reinforcement Learning
 Machine Learning
, 1999
"... We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernelbased approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the biasvariance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or nonparametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Reinforcement Learning by Policy Search
, 2000
"... One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations could be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies being searched is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate various architectures for controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multiagent system. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience reuse. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
Learning from scarce experience
 Proceedings of the Nineteenth International Conference on Machine Learning
, 2002
"... Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the results of following that very policy. This requires a large ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the results of following that very policy. This requires a large number of interactions with the environment as different polices are considered. We present a family of algorithms based on likelihood ratio estimation that use data gathered when executing one policy (or collection of policies) to estimate the value of a different policy. The algorithms combine estimation and optimization stages. The former utilizes experience to build a nonparametric representation of an optimized function. The latter performs optimization on this estimate. We show positive empirical results and provide the sample complexity bound. 1.
Hoeffding's Inequality for Uniformly Ergodic Markov Chains
, 2002
"... We provide a generalization of Hoeffding's inequality to partial sums that are derived from a uniformly ergodic Markov chain. Our exponential inequality on the deviation of these sums from their expectation is particularly useful in situations where we require uniform control on the constants app ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
We provide a generalization of Hoeffding's inequality to partial sums that are derived from a uniformly ergodic Markov chain. Our exponential inequality on the deviation of these sums from their expectation is particularly useful in situations where we require uniform control on the constants appearing in the bound.
Reinforcement Learning using KernelBased Stochastic Factorization
"... Kernelbased reinforcementlearning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large pr ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Kernelbased reinforcementlearning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernelbased stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcementlearning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL’s model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcementlearning algorithms, namely leastsquares policy iteration and fitted Qiteration. 1
Discretized Approximations for POMDP with Average Cost
"... In this paper, we propose a new lower approximation scheme for POMDP with discounted and average cost criterion. The approximating functions are determined by their values at a finite number of belief points, and can be computed efficiently using value iteration algorithms for finitestate MDP. Whil ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper, we propose a new lower approximation scheme for POMDP with discounted and average cost criterion. The approximating functions are determined by their values at a finite number of belief points, and can be computed efficiently using value iteration algorithms for finitestate MDP. While for discounted problems several lower approximation schemes have been proposed earlier, ours seems the first of its kind for average cost problems. We focus primarily on the average cost case, and we show that the corresponding approximation can be computed efficiently using multichain algorithms for finitestate MDP. We give a preliminary analysis showing that regardless of the existence of the optimal average cost J ∗ in the POMDP, the approximation obtained is a lower bound of the liminf optimal average cost function, and can also be used to calculate an upper bound on the limsup optimal average cost function, as well as bounds on the cost of executing the stationary policy associated with the approximation. We show the convergence of the cost approximation, when the optimal average cost is constant and the optimal differential cost is continuous.
A Study on Architecture, Algorithms, and Applications of Approximate Dynamic Programming Based Approach to Optimal Control
, 2004
"... ..."
to Electrical Power System Control. ∗
"... Copyright c○2005 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepres ..."
Abstract
 Add to MetaCart
Copyright c○2005 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress, which has been given certain exclusive rights by the
Creating Algorithmic Traders with Hierarchical Reinforcement Learning
, 2008
"... There has recently been a considerable amount of research into algorithmic traders that learn [7, 27, 21, 19]. A variety of machine learning techniques have been used, including reinforcement learning [20, 11, 19, 5, 21]. We propose a reinforcement learning agent that can adapt to underlying market ..."
Abstract
 Add to MetaCart
There has recently been a considerable amount of research into algorithmic traders that learn [7, 27, 21, 19]. A variety of machine learning techniques have been used, including reinforcement learning [20, 11, 19, 5, 21]. We propose a reinforcement learning agent that can adapt to underlying market regimes by observing the market through signals generated at short and long timescales, and by using the CHQ algorithm [23], a hierarchical method which allows the agent to change its strategies after observing certain signals. We hypothesise that reinforcement learning agents using hierarchical reinforcement learning are superior to standard reinforcement learning agents in markets with regime change. This was tested through a market simulation based on data from the Russell 2000 index [4]. A significant difference was only found in the trivial case, and we concluded that a difference does not exist for our agent design. It was also observed and empirically verified that our standard agent learns different strategies depending on how much information it is given and whether it is charged a commission cost for trading. We therefore provide a novel example of an adaptive algorithmic trader. i Acknowledgements First and foremost, I must thank Dr. Subramanian Ramamoorthy for supervising and motivating this research, and for providing sensible suggestions when I found myself low on ideas. I would also like to thank my family and friends for cheering me up and giving me moral support. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.