Results 1 
6 of
6
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
"... We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log Π  + log Π) regret with respect to a comparison set of polici ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log Π  + log Π) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. 1
Linear programming for largescale Markov decision problems
 In Proceedings of the International Conference on Machine Learning
, 2014
"... We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a lowdimensional family of policie ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a lowdimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over stateaction pairs, and we consider a neighborhood of a lowdimensional subset of the set of stationary distributions (defined in terms of stateaction features) as the comparison class. We propose a technique based on stochastic convex optimization and give bounds that show that the performance of our algorithm approaches the best achievable by any policy in the comparison class. Most importantly, this result depends on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithm in a queuing application. 1.
Bayesian Optimal Control of Smoothly Parameterized Systems
"... We study Bayesian optimal control of a general class of smoothly parameterized Markov decision problems (MDPs). We propose a lazy version of the socalled posterior sampling method, a method that goes back to Thompson and Strens, more recently studied by Osband, Russo and van Roy. While Osband et ..."
Abstract
 Add to MetaCart
(Show Context)
We study Bayesian optimal control of a general class of smoothly parameterized Markov decision problems (MDPs). We propose a lazy version of the socalled posterior sampling method, a method that goes back to Thompson and Strens, more recently studied by Osband, Russo and van Roy. While Osband et al. derived a bound on the (Bayesian) regret of this method for undiscounted total cost episodic, finite state and action problems, we consider the continuing, average cost setting with no cardinality restrictions on the state or action spaces. While in the episodic setting, it is natural to switch to a new policy at the episodeends, in the continuing average cost framework we must introduce switching points explicitly and in a principled fashion, or the regret could grow linearly. Our lazy method introduces these switching points based on monitoring the uncertainty left about the unknown parameter. To develop a suitable and easytocompute uncertainty measure, we introduce a new “average local smoothness ” condition, which is shown to be satisfied in common examples. Under this, and some additional mild conditions, we derive rateoptimal bounds on the regret of our algorithm. Our general approach allows us to use a single algorithm and a single analysis for a wide range of problems, such as finite MDPs or linear quadratic regulation, both being instances of smoothly parameterized MDPs. The effectiveness of our method is illustrated by means of a simulated example. 1
Online learning in MDPs with side information
, 2014
"... We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an episodic structure, where each episode corresponds to a pati ..."
Abstract
 Add to MetaCart
(Show Context)
We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an episodic structure, where each episode corresponds to a patient/customer. Our objective is to compete with the optimal dynamic policy that can take side information into account. We propose a computationally efficient algorithm and show that its regret is at most O( T), where T is the number of rounds. To best of our knowledge, this is the first regret bound for this setting. 1