Results 1 
5 of
5
Online Markov decision processes under bandit feedback.
 In Advances in Neural Information Processing Systems 23: 2010.,
, 2010
"... Abstract We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step t ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
Abstract We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other stateaction pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a noregret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T ) 1/3 , giving the first rigorously proved regret bound for the problem.
1Neural Computation, to appear. An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
"... We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward function chan ..."
Abstract
 Add to MetaCart
We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.
Reinforcement Learning in Robust Markov Decision Processes
"... An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of wellbehaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions ..."
Abstract
 Add to MetaCart
An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of wellbehaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
Online learning in MDPs with side information
, 2014
"... We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an episodic structure, where each episode corresponds to a pati ..."
Abstract
 Add to MetaCart
We study online learning of finite Markov decision process (MDP) problems when a side information vector is available. The problem is motivated by applications such as clinical trials, recommendation systems, etc. Such applications have an episodic structure, where each episode corresponds to a patient/customer. Our objective is to compete with the optimal dynamic policy that can take side information into account. We propose a computationally efficient algorithm and show that its regret is at most O( T), where T is the number of rounds. To best of our knowledge, this is the first regret bound for this setting. 1
Tracking Adversarial Targets
"... We study linear control problems with quadratic losses and adversarially chosen tracking targets. We present an efficient algorithm for this problem and show that, under standard conditions on the linear system, its regret with respect to an optimal linear policy grows as O(log2 T), where T is the ..."
Abstract
 Add to MetaCart
We study linear control problems with quadratic losses and adversarially chosen tracking targets. We present an efficient algorithm for this problem and show that, under standard conditions on the linear system, its regret with respect to an optimal linear policy grows as O(log2 T), where T is the number of rounds of the game. We also study a problem with adversarially chosen transition dynamics; we present an exponentiallyweighted average algorithm for this problem, and we give regret bounds that grow as O( T).