Results 1  10
of
11
Contextual Bandits with Similarity Information
 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract

Cited by 57 (9 self)
 Add to MetaCart
In a multiarmed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a timeinvariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the contextarm pairs which bounds from above the difference between the respective expected payoffs. Prior work
On the fundamental limits of adaptive sensing
, 2011
"... Suppose we can sequentially acquire arbitrary linear measurements of an ndimensional vector x resulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is known to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy wh ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Suppose we can sequentially acquire arbitrary linear measurements of an ndimensional vector x resulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is known to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy which cleverly selects the next row of A based on what has been previously observed should do far better than a nonadaptive strategy which sets the rows of A ahead of time, thus not trying to learn anything about the signal in between observations. This paper shows that the folk theorem is false. We prove that the advantages offered by clever adaptive strategies and sophisticated estimation procedures—no matter how intractable—over classical compressed acquisition/recovery schemes are, in general, minimal.
Multiclass classification with bandit feedback using adaptive regularization
 In ICML
, 2011
"... We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of rightorwrong, rather then the true label. Our algorithm is based on the 2ndorder Perceptron, and uses upperconfidence bounds ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of rightorwrong, rather then the true label. Our algorithm is based on the 2ndorder Perceptron, and uses upperconfidence bounds to trade off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where instances are chosen adversarially, while the labels are chosen according to a linear probabilistic model, which is also chosen adversarially. We show a regret of O ( √ T logT), which improvesoverthe current best bounds of O(T 2/3) in the fully adversarial setting. We evaluate our algorithm on nine realworld text classification problems, obtaining stateoftheart results, even comparedwith nonbandit online algorithms, especially when label noise is introduced. 1.
THE MULTIARMED BANDIT PROBLEM WITH COVARIATES
, 2013
"... We consider a multiarmed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multiarmed bandit problem, this setting allows for dynamically changing rewards that better describe applic ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
We consider a multiarmed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multiarmed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably “localized ” static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Bounded regret in stochastic multiarmed bandits
 JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL (2013) 1–13
, 2013
"... We study the stochastic multiarmed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bou ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
We study the stochastic multiarmed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bounded regret of order1/ ∆ is not possible if one only knowsµ (⋆).
Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
, 2013
"... Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a lea ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learningtorank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a nontrivial common generalization of two multiarmed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a nearoptimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches.
Contextual Multiarmed Bandits for Web Server Defense
"... Abstract—In this paper we argue that contextual multiarmed bandit algorithms could open avenues for designing selflearning security modules for computer networks and related tasks. The paper has two contributions: a conceptual and an algorithmical one. The conceptual contribution is to formulate t ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—In this paper we argue that contextual multiarmed bandit algorithms could open avenues for designing selflearning security modules for computer networks and related tasks. The paper has two contributions: a conceptual and an algorithmical one. The conceptual contribution is to formulate the realworld problem of preventing HTTPbased attacks on web servers as a oneshot sequential learning problem, namely as a contextual multiarmed bandit. Our second contribution is to present CMABFAS, a new and computationally very cheap algorithm for general contextual multiarmed bandit learning that specifically targets domains with finite actions. We illustrate how CMABFAS could be used to design a fully selflearning meta filter for web servers that does not rely on feedback from the enduser (i.e., does not require labeled data) and report first convincing simulation results. I.