Results 1  10
of
21
Online Markov decision processes under bandit feedback.
 In Advances in Neural Information Processing Systems 23: 2010.,
, 2010
"... Abstract We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step t ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
Abstract We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other stateaction pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a noregret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T ) 1/3 , giving the first rigorously proved regret bound for the problem.
Online bandit learning against an adaptive adversary: from regret to policy regret
 IN PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2012
"... Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adve ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm’s performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret. 1.
Online Learning for Global Cost Functions
, 2009
"... We consider an online learning setting where at each time step the decision maker has to choose how to distribute the future loss between k alternatives, and then observes the loss of each alternative. Motivated by load balancing and job scheduling, we consider a global cost function (over the losse ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
We consider an online learning setting where at each time step the decision maker has to choose how to distribute the future loss between k alternatives, and then observes the loss of each alternative. Motivated by load balancing and job scheduling, we consider a global cost function (over the losses incurred by each alternative), rather than a summation of the instantaneous losses as done traditionally in online learning. Such global cost functions include the makespan (the maximum over the alternatives) and the Ld norm (over the alternatives). Based on approachability theory, we design an algorithm that guarantees vanishing regret for this setting, where the regret is measured with respect to the best static decision that selects the same distribution over alternatives at every time step. For the special case of makespan cost we devise a simple and efficient algorithm. In contrast, we show that for concave global cost functions, such as Ld norms for d < 1, the worstcase average regret does not vanish.
Prediction by RandomWalk Perturbation
"... We propose a version of the followtheperturbedleader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We propose a version of the followtheperturbedleader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the number of experts. More importantly, it is shown that the forecaster changes its prediction at most O ( √ n log N) times, in expectation. We also extend the analysis to online combinatorial optimization and show that even in this more general setting, the forecaster rarely switches between experts while having a regret of nearoptimal order.
The adversarial stochastic shortest path problem with unknown transition probabilities
, 2012
"... ..."
Online Markov decision processes with KullbackLeibler control cost
 Proceedings of the American Control Conference
, 2012
"... Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setup of Todorov, the stateaction cost at each time step is a sum of a state cost and a control cost given by the KullbackLeibler (KL) divergence between the agent’s nextstate distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained. I.
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
"... We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log Π  + log Π) regret with respect to a comparison set of polici ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O( T log Π  + log Π) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. 1
An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
"... Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward fu ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Deterministic Better Rates for Any Adversarial Deterministic MDP MDPs with adversarial rewards and bandit feedback.
 In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence,
, 2012
"... Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, w ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T 3/4 ). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.
Online Learning in Markov Decision Processes with Changing Cost Sequences
"... In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approxi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing nearoptimal regretbounds (in particular, we take into account the computational costs of performing approximate projections in MD 2). In the case of fullinformation feedback, our results complement existing ones. In the case of banditinformation feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the statevisitation probabilities are uniformly bounded away from zero under all policies. 1.