Results 1 - 10
of
11
Contextual Bandits with Similarity Information
- 24TH ANNUAL CONFERENCE ON LEARNING THEORY
, 2011
"... In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work ha ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context – a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the context-arm pairs which bounds from above the difference between the respective expected payoffs. Prior work
On the fundamental limits of adaptive sensing
, 2011
"... Suppose we can sequentially acquire arbitrary linear measurements of an n-dimensional vector x resulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is known to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy wh ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
(Show Context)
Suppose we can sequentially acquire arbitrary linear measurements of an n-dimensional vector x resulting in the linear model y = Ax + z, where z represents measurement noise. If the signal is known to be sparse, one would expect the following folk theorem to be true: choosing an adaptive strategy which cleverly selects the next row of A based on what has been previously observed should do far better than a nonadaptive strategy which sets the rows of A ahead of time, thus not trying to learn anything about the signal in between observations. This paper shows that the folk theorem is false. We prove that the advantages offered by clever adaptive strategies and sophisticated estimation procedures—no matter how intractable—over classical compressed acquisition/recovery schemes are, in general, minimal.
Multiclass classification with bandit feedback using adaptive regularization
- In ICML
, 2011
"... We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of right-orwrong, rather then the true label. Our algorithm is based on the 2nd-order Perceptron, and uses upper-confidence bounds ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
(Show Context)
We present a new multiclass algorithm in the bandit framework, where after making a prediction, the learning algorithm receives only partial feedback, i.e., a single bit of right-orwrong, rather then the true label. Our algorithm is based on the 2nd-order Perceptron, and uses upper-confidence bounds to trade off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where instances are chosen adversarially, while the labels are chosen according to a linear probabilistic model, which is also chosen adversarially. We show a regret of O ( √ T logT), which improvesoverthe current best bounds of O(T 2/3) in the fully adversarial setting. We evaluate our algorithm on nine real-world text classification problems, obtaining state-of-the-art results, even comparedwith non-bandit online algorithms, especially when label noise is introduced. 1.
THE MULTI-ARMED BANDIT PROBLEM WITH COVARIATES
, 2013
"... We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit prob-lem, this setting allows for dynamically changing rewards that better describe applic ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit prob-lem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparamet-ric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decom-poses the global problem into suitably “localized ” static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Bounded regret in stochastic multi-armed bandits
- JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL (2013) 1–13
, 2013
"... We study the stochastic multi-armed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bou ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We study the stochastic multi-armed bandit problem when one knows the valueµ (⋆) of an optimal arm, as a well as a positive lower bound on the smallest positive gap∆. We propose a new randomized policy that attains a regret uniformly bounded over time in this setting. We also prove several lower bounds, which show in particular that bounded regret is not possible if one only knows ∆, and bounded regret of order1/ ∆ is not possible if one only knowsµ (⋆).
Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
, 2013
"... Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a lea ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather unsatisfyingly lacked theoretical foundations, or do not scale. We present a learning-to-rank formulation that optimizes the fraction of satisfied users, with several scalable algorithms that explicitly takes document similarity and ranking context into account. Our formulation is a non-trivial common generalization of two multi-armed bandit models from the literature: ranked bandits (Radlinski et al., 2008) and Lipschitz bandits (Kleinberg et al., 2008b). We present theoretical justifications for this approach, as well as a near-optimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches.
Contextual Multi-armed Bandits for Web Server Defense
"... Abstract—In this paper we argue that contextual multi-armed bandit algorithms could open avenues for designing self-learning security modules for computer networks and related tasks. The paper has two contributions: a conceptual and an algorithmical one. The conceptual contribution is to formulate t ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—In this paper we argue that contextual multi-armed bandit algorithms could open avenues for designing self-learning security modules for computer networks and related tasks. The paper has two contributions: a conceptual and an algorithmical one. The conceptual contribution is to formulate the real-world problem of preventing HTTP-based attacks on web servers as a one-shot sequential learning problem, namely as a contextual multi-armed bandit. Our second contribution is to present CMABFAS, a new and computationally very cheap algorithm for general contextual multi-armed bandit learning that specifically targets domains with finite actions. We illustrate how CMABFAS could be used to design a fully self-learning meta filter for web servers that does not rely on feedback from the end-user (i.e., does not require labeled data) and report first convincing simulation results. I.