### Random-Walk Perturbations for Online Combinatorial Optimization

, 2013

"... Abstract-We study online combinatorial optimization problems where a learner is interested in minimizing its cumulative regret in the presence of switching costs. To solve such problems, we propose a version of the follow-the-perturbedleader algorithm in which the cumulative losses are perturbed by ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract-We study online combinatorial optimization problems where a learner is interested in minimizing its cumulative regret in the presence of switching costs. To solve such problems, we propose a version of the follow-the-perturbedleader algorithm in which the cumulative losses are perturbed by independent symmetric random walks. In the general setting, our forecaster is shown to enjoy near-optimal guarantees on both quantities of interest, making it the best known efficient algorithm for the studied problem. In the special case of prediction with expert advice, we show that the forecaster achieves an expected regret of the optimal order O( √ n log N ) where n is the time horizon and N is the number of experts, while guaranteeing that the predictions are switched at most O( √ n log N ) times, in expectation. Index Terms-Online learning, Online combinatorial optimization, Follow the Perturbed Leader, Random walk I. PRELIMINARIES In this paper we study the problem of online prediction with expert advice (see The usual goal for the standard prediction problem is to devise an algorithm such that the cumulative loss L n = Parameters: set of actions S ⊆ R d , number of rounds n; The environment chooses the loss vector t ∈ [0, 1] d for all t = 1, . . . , n. For all t = 1, 2, . . . , n, repeat 1) The forecaster chooses a probability distribution p t over S. 2) The forecaster draws an action V t randomly according to p t . 3) The environment reveals t . 4) The forecaster suffers loss V T t t . with high probability (where probability is with respect to the forecaster's randomization). Since we do not make any assumption on how the environment generates the losses t , we cannot hope to minimize the above loss. Instead, a meaningful goal is to minimize the performance gap between our algorithm and the strategy that selects the best action chosen in hindsight. This performance gap is called the regret and is defined formally as where we have also introduced the notation L * n = min v∈S v T n t=1 t . To gain simplicity in the presentation, we restrict our attention to the case of online combinatorial optimization in which S ⊂ {0, 1} d , that is, each action is represented as a binary vector. This special case arguably contains most important applications such as the online shortest path problem. In this example, a fixed directed acyclic graph of d edges is given with two distinguished vertices u and w. The forecaster, at every time instant t, chooses a directed path from u to w. Such a path is represented by its binary incidence vector v ∈ {0, 1} d . The components of the loss vector t ∈ [0, 1] d represent losses assigned to the d edges and v T t is the total loss assigned to the path v. Another (non-essential) simplifying assumption is that every action v ∈ S has the same number of 1's: v 1 = m for all v ∈ S. The value of m plays an important role in the bounds presented in the paper. A fundamental special case of the framework above is prediction with expert advice. In this setting, we have m = 1, d = N , and the learner has access to the unit vectors S = {e i } N i=1 as the decision set. Minimizing the regret in this setting is a well-studied problem (see the book of Cesa-Bianchi

### Heteroscedastic Sequences: Beyond Gaussianity Technion, Haifa, Israel

"... Abstract We address the problem of sequential prediction in the heteroscedastic setting, when both the signal and its variance are assumed to depend on explanatory variables. By applying regret minimization techniques, we devise an efficient online learning algorithm for the problem, without assumi ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract We address the problem of sequential prediction in the heteroscedastic setting, when both the signal and its variance are assumed to depend on explanatory variables. By applying regret minimization techniques, we devise an efficient online learning algorithm for the problem, without assuming that the error terms comply with a specific distribution. We show that our algorithm can be adjusted to provide confidence bounds for its predictions, and provide an application to ARCH models. The theoretical results are corroborated by an empirical study.

### Regret Minimization in Nonstationary Markov Decision Processes

"... We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect again ..."

Abstract
- Add to MetaCart

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect against stationary policies. Unlike previous works, the guarantees depend critically on the variability of the uncertainty in the transition probabilities, but hold regardless of arbitrary changes in rewards and transition probabilities. First, we use an approach based on robust dynamic programming and extend it to the case where reward observation is limited to the actual state-action trajectory. Next, we present a computationally efficient simulation-based Q-learning style algorithm that requires neither prior knowledge nor estimation of the transition probabilities. We show both probabilistic performance guarantees and deterministic guarantees on the expected performance1. I.

### TTIC

"... We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and effi ..."

Abstract
- Add to MetaCart

We consider a Markov decision process with deterministic state transition dynamics, adversarially generated rewards that change arbitrarily from round to round, and a bandit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T 3/4√log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain assumption, which dominates much of the previous work on this topic. 1

### The Online Loop-free . . .

"... We consider a stochastic extension of the loop-free shortest path problem with adversarial rewards. In this episodic Markov decision problem an agent traverses through an acyclic graph with random transitions: at each step of an episode the agent chooses an action, receives some reward, and arrives ..."

Abstract
- Add to MetaCart

We consider a stochastic extension of the loop-free shortest path problem with adversarial rewards. In this episodic Markov decision problem an agent traverses through an acyclic graph with random transitions: at each step of an episode the agent chooses an action, receives some reward, and arrives at a random next state, where the reward and the distribution of the next state depend on the actual state and the chosen action. We consider the bandit situation when only the reward of the just visited state-action pair is revealed to the agent. For this problem we develop algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability α> 0 under all policies, we give an algorithm and prove that its regret is O(L 2 √ T|A|/α), whereT is the number of episodes,Adenotes the (finite) set of actions, andLis the length of the longest path in the graph. Variants of the algorithm are given that improve the dependence on the transition probabilities under specific conditions. The results are also extended to variations of the problem, including the case when the agent competes with time varying policies.

### TTIC

"... We consider a Markov decision process with deterministic state transition dynamics, ad-versarially generated rewards that change ar-bitrarily from round to round, and a ban-dit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and e ..."

Abstract
- Add to MetaCart

We consider a Markov decision process with deterministic state transition dynamics, ad-versarially generated rewards that change ar-bitrarily from round to round, and a ban-dit feedback model in which the decision maker only observes the rewards it receives. In this setting, we present a novel and efficient online decision making algorithm named MarcoPolo. Under mild assumptions on the structure of the transition dynamics, we prove that MarcoPolo enjoys a regret of O(T 3/4 log T) against the best deterministic policy in hindsight. Specifically, our analysis does not rely on the stringent unichain as-sumption, which dominates much of the pre-vious work on this topic. 1

### 1Neural Computation, to appear. An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

"... We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret | the difference from the best xed policy. The difficulty of online MDP learning is that the reward function chan ..."

Abstract
- Add to MetaCart

(Show Context)
We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret | the difference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.

### Reinforcement Learning in Robust Markov Decision Processes

"... An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions ..."

Abstract
- Add to MetaCart

An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially ad-versarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1