Results 1  10
of
18
Online bandit learning against an adaptive adversary: from regret to policy regret
 IN PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2012
"... Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adve ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm’s performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret. 1.
An Efficient Algorithm for Learning with SemiBandit Feedback
, 1305
"... Abstract. We consider the problem of online combinatorial optimization undersemibanditfeedback. Thegoal ofthelearner is tosequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the problem of online combinatorial optimization undersemibanditfeedback. Thegoal ofthelearner is tosequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the FollowthePerturbedLeader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with ddimensional binary vectors with at most m nonzero entries, we show that the expected regret of our algorithm after T rounds is O(m √ dT logd). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m 3/2 √ T logd), gaining a factor of √ d/m over previous bounds for this algorithm.
Prediction by RandomWalk Perturbation
"... We propose a version of the followtheperturbedleader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We propose a version of the followtheperturbedleader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the number of experts. More importantly, it is shown that the forecaster changes its prediction at most O ( √ n log N) times, in expectation. We also extend the analysis to online combinatorial optimization and show that even in this more general setting, the forecaster rarely switches between experts while having a regret of nearoptimal order.
Online Learning under Delayed Feedback
"... Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, webbased learning problems. In this paper we provide a systematic study of the topic, and analyze the effect of delay on the regret of online learning algorithms. Somewha ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, webbased learning problems. In this paper we provide a systematic study of the topic, and analyze the effect of delay on the regret of online learning algorithms. Somewhat surprisingly, it turns out that delay increases the regret in a multiplicative way in adversarial problems, and in an additive way in stochastic problems. We give metaalgorithms that transform, in a blackbox fashion, algorithms developed for the nondelayed case into ones that can handle the presence of delays in the feedback loop. Modifications of the wellknown UCB algorithm are also developed for the bandit problem with delayed feedback, with the advantage over the metaalgorithms that they can be implemented with lower complexity. 1.
The adversarial stochastic shortest path problem with unknown transition probabilities
, 2012
"... ..."
MULTIARMED BANDIT PROBLEMS UNDER DELAYED FEEDBACK
, 2012
"... In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this thesis, the multiarmed bandit (MAB) problem in online learning is studied, when the feedback information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that uses a nondelayed MAB algorithm as a blackbox. We also give a method to generalize the theoretical guarantees of nondelayed UCBtype algorithms to the delayed stochastic setting. Assuming the delays are independent of the rewards, we upper bound the penalty in the performance of these algorithms (measured by “regret”) by an additive term depending on the delays. When the rewards are chosen in an adversarial manner, we give a blackbox style algorithm using multiple instances
Online Markov decision processes with KullbackLeibler control cost
 Proceedings of the American Control Conference
, 2012
"... Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper considers an online (realtime) control problem that involves an agent performing a discretetime random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the setup of Todorov, the stateaction cost at each time step is a sum of a state cost and a control cost given by the KullbackLeibler (KL) divergence between the agent’s nextstate distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained. I.
An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
"... Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward fu ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the timedependent decisionmaking policy of an agent that minimizes the regret  the difference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Deterministic Better Rates for Any Adversarial Deterministic MDP MDPs with adversarial rewards and bandit feedback.
 In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence,
, 2012
"... Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, w ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the stateoftheart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T 3/4 ). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.
Online Learning in Markov Decision Processes with Changing Cost Sequences
"... In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approxi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing nearoptimal regretbounds (in particular, we take into account the computational costs of performing approximate projections in MD 2). In the case of fullinformation feedback, our results complement existing ones. In the case of banditinformation feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the statevisitation probabilities are uniformly bounded away from zero under all policies. 1.