Results 1  10
of
26
Improved secondorder bounds for prediction with expert advice
 In COLT
, 2005
"... Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bou ..."
Abstract

Cited by 46 (9 self)
 Add to MetaCart
Abstract. This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bounds for the wellknown exponentially weighted average forecaster and for a new forecaster with a different multiplicative update rule. Our analysis has two main advantages: first, no preliminary knowledge about the payoff sequence is needed, not even its range; second, our bounds are expressed in terms of sums of squared payoffs, replacing larger firstorder quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence. 1.
Minimizing regret with label efficient prediction
 IEEE Trans. Inform. Theory
, 2005
"... Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number o ..."
Abstract

Cited by 39 (6 self)
 Add to MetaCart
Abstract. We investigate label efficient prediction, a variant of the problem of prediction with expert advice, proposed by Helmbold and Panizza, in which the forecaster does not have access to the outcomes of the sequence to be predicted unless he asks for it, which he can do for a limited number of times. We determine matching upper and lower bounds for the best possible excess error when the number of allowed queries is a constant. We also prove that a query rate of order (ln n)(ln ln n) 2 /n is sufficient for achieving Hannan consistency, a fundamental property in gametheoretic prediction models. Finally, we apply the label efficient framework to pattern classification and prove a label efficient mistake bound for a randomized variant of Littlestone’s zerothreshold Winnow algorithm. 1
The Karmed Dueling Bandits Problem
"... We study a partialinformation onlinelearning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
We study a partialinformation onlinelearning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., userperceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves (almost) informationtheoretically optimal regret bounds (up to a constant factor). 1
FPL analysis for adaptive bandits
 In 3rd Symposium on Stochastic Algorithms, Foundations and Applications (SAGA’05
, 2005
"... wwwalg.ist.hokudai.ac.jp / ∼jan A main problem of “Follow the Perturbed Leader ” strategies for online decision problems is that regret bounds are typically proven against oblivious adversary. In partial observation cases, it was not clear how to obtain performance guarantees against adaptive adver ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
wwwalg.ist.hokudai.ac.jp / ∼jan A main problem of “Follow the Perturbed Leader ” strategies for online decision problems is that regret bounds are typically proven against oblivious adversary. In partial observation cases, it was not clear how to obtain performance guarantees against adaptive adversary, without worsening the bounds. We propose a conceptually simple argument to resolve this problem. Using this, a regret bound of O(t 2 3) for FPL in the adversarial multiarmed bandit problem is shown. This bound holds for the common FPL variant using only the observations from designated exploration rounds. Using all observations allows for the stronger bound of O ( √ t), matching the best bound known so far (and essentially the known lower bound) for adversarial bandits. Surprisingly, this variant does not even need explicit exploration, it is selfstabilizing. However the sampling probabilities have to be either externally provided or approximated to sufficient accuracy, using O(t2 log t) samples in each step.
2007) “Strategies for Prediction under Imperfect Monitoring,” mimeo
"... We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possib ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We propose simple randomized strategies for sequential prediction under imperfect monitoring, that is, when the forecaster does not have access to the past outcomes but rather to a feedback signal. The proposed strategies are consistent in the sense that they achieve, asymptotically, the best possible average reward. It was Rustichini [25] who first proved the existence of such consistent predictors. The forecasters presented here offer the first constructive proof of consistency. Moreover, the proposed algorithms are computationally efficient. We also establish upper bounds for the rates of convergence. In the case of deterministic feedback, these rates are optimal up to logarithmic terms.
Learning to play partiallyspecified equilibrium
, 2007
"... In a partiallyspecified correlated equilibrium (PSCE) the players are partially informed of the conditional strategies of the other players, and they best respond to the worstcase possible strategy. We construct a decentralized procedure that converges to PSCE when the monitoring is imperfect. Thi ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In a partiallyspecified correlated equilibrium (PSCE) the players are partially informed of the conditional strategies of the other players, and they best respond to the worstcase possible strategy. We construct a decentralized procedure that converges to PSCE when the monitoring is imperfect. This procedure is based on minimizing conditional regret when players obtain noisy signals that depend on the actions that have been previously played.
2005, ‘ReducedVariance Payoff Estimation in Adversarial Bandit Problems
 In: Proceedings of the ECML’05 workshop on Reinforcement Learning in NonStationary Environments. (in print
, 2005
"... Abstract. A natural way to compare learning methods in nonstationary environments is to compare their regret. In this paper we consider the regret of algorithms in adversarial multiarmed bandit problems. We propose several methods to improve the performance of the baseline exponentially weighted av ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. A natural way to compare learning methods in nonstationary environments is to compare their regret. In this paper we consider the regret of algorithms in adversarial multiarmed bandit problems. We propose several methods to improve the performance of the baseline exponentially weighted average forecaster by changing the payoffestimation methods. We argue that improved performance can be achieved by constructing payoff estimation methods that produce estimates with low variance. Our arguments are backed up by both theoretical and empirical results. In fact, our empirical results show that significant performance gains are possible over the baseline algorithm. 1
Learning, regret minimization, and equilibria
 In
, 2007
"... Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive to work each day, or repeated play of a game against an opponent with an unknown strategy. In this chapter we describe learning algorithms with strong guarantees for settings o ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive to work each day, or repeated play of a game against an opponent with an unknown strategy. In this chapter we describe learning algorithms with strong guarantees for settings of this type, along with connections to gametheoretic equilibria when all players in a system are simultaneously adapting in such a manner. We begin by presenting algorithms for repeated play of a matrix game with the guarantee that against any opponent, they will perform nearly as well as the best fixed action in hindsight (also called the problem of combining expert advice or minimizing external regret). In a zerosum game, such algorithms are guaranteed to approach or exceed the minimax value of the game, and even provide a simple proof of the minimax theorem. We then turn to algorithms that minimize an even stronger form of regret, known as internal or swap regret. We present a general reduction showing how to convert any
Robust approachability and regret minimization in games with partial monitoring
 In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of JMLR: Workshop and Conference Proceedings
, 2011
"... Approachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tack ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Approachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward that belongs to a set, rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring anddevelop simple andefficient algorithms (i.e., withconstant perstepcomplexity) for this setup. We finally consider external regret and internal regret in repeated games with partial monitoring and derive regretminimizing strategies based on approachability theory.