Results 1 - 10
of
18
Online bandit learning against an adaptive adversary: from regret to policy regret
- IN PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2012
"... Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm’s ability to learn is the game-theoretic notion of regret. We argue that the standard definition of re-gret becomes inadequate if the adve ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widely-accepted formal definition of an online algorithm’s ability to learn is the game-theoretic notion of regret. We argue that the standard definition of re-gret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s ac-tions. We define the alternative notion of pol-icy regret, which attempts to provide a more meaningful way to measure an online algo-rithm’s performance against adaptive adver-saries. Focusing on the online bandit set-ting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded mem-ory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algo-rithm with a sublinear policy regret bound. We extend this result to other variants of re-gret, such as switching regret, internal regret, and swap regret. 1.
An Efficient Algorithm for Learning with Semi-Bandit Feedback
, 1305
"... Abstract. We consider the problem of online combinatorial optimization undersemi-banditfeedback. Thegoal ofthelearner is tosequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
Abstract. We consider the problem of online combinatorial optimization undersemi-banditfeedback. Thegoal ofthelearner is tosequentially select its actions from a combinatorial decision set so as to minimize its cumulative loss. We propose a learning algorithm for this problem based on combining the Follow-the-Perturbed-Leader (FPL) prediction method with a novel loss estimation procedure called Geometric Resampling (GR). Contrary to previous solutions, the resulting algorithm can be efficiently implemented for any decision set where efficient offline combinatorial optimization is possible at all. Assuming that the elements of the decision set can be described with d-dimensional binary vectors with at most m non-zero entries, we show that the expected regret of our algorithm after T rounds is O(m √ dT logd). As a side result, we also improve the best known regret bounds for FPL in the full information setting to O(m 3/2 √ T logd), gaining a factor of √ d/m over previous bounds for this algorithm.
Prediction by Random-Walk Perturbation
"... We propose a version of the follow-the-perturbed-leader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We propose a version of the follow-the-perturbed-leader online prediction algorithm in which the cumulative losses are perturbed by independent symmetric random walks. The forecaster is shown to achieve an expected regret of the optimal order O ( √ n log N) where n is the time horizon and N is the number of experts. More importantly, it is shown that the forecaster changes its prediction at most O ( √ n log N) times, in expectation. We also extend the analysis to online combinatorial optimization and show that even in this more general setting, the forecaster rarely switches between experts while having a regret of near-optimal order.
Online Learning under Delayed Feedback
"... Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, web-based learning problems. In this paper we provide a systematic study of the topic, and analyze the effect of delay on the regret of online learning algorithms. Somewha ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, web-based learning problems. In this paper we provide a systematic study of the topic, and analyze the effect of delay on the regret of online learning algorithms. Somewhat surprisingly, it turns out that delay increases the regret in a multiplicative way in adversarial problems, and in an additive way in stochastic problems. We give meta-algorithms that transform, in a black-box fashion, algorithms developed for the non-delayed case into ones that can handle the presence of delays in the feedback loop. Modifications of the well-known UCB algorithm are also developed for the bandit problem with delayed feedback, with the advantage over the meta-algorithms that they can be implemented with lower complexity. 1.
The adversarial stochastic shortest path problem with unknown transition probabilities
, 2012
"... ..."
MULTI-ARMED BANDIT PROBLEMS UNDER DELAYED FEEDBACK
, 2012
"... In this thesis, the multi-armed bandit (MAB) problem in online learning is studied, when the feed-back information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this thesis, the multi-armed bandit (MAB) problem in online learning is studied, when the feed-back information is not observed immediately but rather after arbitrary, unknown, random delays. In the “stochastic” setting when the rewards come from a fixed distribution, an algorithm is given that uses a non-delayed MAB algorithm as a black-box. We also give a method to generalize the theoretical guarantees of non-delayed UCB-type algorithms to the delayed stochastic setting. Assuming the delays are independent of the rewards, we upper bound the penalty in the performance of these algorithms (measured by “regret”) by an additive term depending on the delays. When the rewards are chosen in an adversarial manner, we give a black-box style algorithm using multiple instances
Online Markov decision processes with Kullback-Leibler control cost
- Proceedings of the American Control Conference
, 2012
"... Abstract—This paper considers an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the set-u ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—This paper considers an online (real-time) control problem that involves an agent performing a discrete-time random walk over a finite state space. The agent’s action at each time step is to specify the probability distribution for the next state given the current state. Following the set-up of Todorov, the state-action cost at each time step is a sum of a state cost and a control cost given by the Kullback-Leibler (KL) divergence between the agent’s next-state distribution and that determined by some fixed passive dynamics. The online aspect of the problem is due to the fact that the state cost functions are generated by a dynamic environment, and the agent learns the current state cost only after selecting an action. An explicit construction of a computationally efficient strategy with small regret (i.e., expected difference between its actual total cost and the smallest cost attainable using noncausal knowledge of the state costs) under mild regularity conditions is presented, along with a demonstration of the performance of the proposed strategy on a simulated target tracking problem. A number of new results on Markov decision processes with KL control cost are also obtained. I.
An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
"... Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret | the dif-ference from the best xed policy. The difficulty of online MDP learning is that the reward fu ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We consider the learning problem under an online Markov decision process (MDP), which is aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret | the dif-ference from the best xed policy. The difficulty of online MDP learning is that the reward function changes over time. In this paper, we show that a simple online policy gradient algorithm achieves regret O( p T) for T steps under a certain concavity assumption and O(log T) under a strong concavity assumption. To the best of our knowledge, this is the rst work to give an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the online policy gradient method through experiments.
Deterministic Better Rates for Any Adversarial Deterministic MDP MDPs with adversarial rewards and bandit feedback.
- In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence,
, 2012
"... Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-theart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, w ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-theart forward in two ways: First, it attains a regret of O(T 2/3 ) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T 3/4 ). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.
Online Learning in Markov Decision Processes with Changing Cost Sequences
"... In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approxi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and banditinformation. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD 2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD 2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the statevisitation probabilities are uniformly bounded away from zero under all policies. 1.