Results 1 
7 of
7
Combinatorial multiarmed bandit and its extension to probabilistically triggered arms.
 Journal of Machine Learning Research,
, 2016
"... Abstract We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are ob ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract We define a general framework for a large class of combinatorial multiarmed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more base arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline (α, β)approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability β generates an α fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize (α, β)approximation regret, which is the difference in total expected reward between the αβ fraction of expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) distributiondependent regret, where n is the number of rounds played, and we further provide distributionindependent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in an earlier paper on combinatorial bandits * . A preliminary version of this paper has appeared in ICML
Cascading bandits: Learning to rank in the cascade model
 In Proceedings of the 32nd International Conference on Machine Learning
"... A search engine usually outputs a list of K web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
A search engine usually outputs a list of K web pages. The user examines this list, from the first web page to the last, and chooses the first attractive page. This model of user behavior is known as the cascade model. In this paper, we propose cascading bandits, a learning variant of the cascade model where the objective is to identify K most attractive items. We formulate our problem as a stochastic combinatorial partial monitoring problem. We propose two algorithms for solving it, CascadeUCB1 and CascadeKLUCB. We also prove gapdependent upper bounds on the regret of these algorithms and derive a lower bound on the regret in cascading bandits. The lower bound matches the upper bound of CascadeKLUCB up to a logarithmic factor. We experiment with our algorithms on several problems. The algorithms perform surprisingly well even when our modeling assumptions are violated. 1.
Stochastic online greedy learning with semibandit feedbacks.
 In Advances in Neural Information Processing Systems,
, 2015
"... Abstract The greedy algorithm is extensively studied in the field of combinatorial optimization for decades. In this paper, we address the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the gre ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract The greedy algorithm is extensively studied in the field of combinatorial optimization for decades. In this paper, we address the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We then propose two online greedy learning algorithms with semibandit feedbacks, which use multiarmed bandit and pure exploration bandit policies at each level of greedy learning, one for each of the regret metrics respectively. Both algorithms achieve O(log T ) problemdependent regret bound (T being the time horizon) for a general class of combinatorial structures and reward functions that allow greedy solutions. We further show that the bound is tight in T and other problem instance parameters.
Efficient learning in largescale combinatorial semibandits.
 In Proceedings of the 32nd International Conference on Machine Learning,
, 2015
"... Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract A stochastic combinatorial semibandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efficient learning in largescale combinatorial semibandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (CombLinTS) and Combinatorial Linear UCB (CombLinUCB). Both algorithms are computationally efficient as long as the offline version of the combinatorial problem can be solved efficiently. We establish that CombLinTS and CombLinUCB are also provably statistically efficient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate CombLinTS on a variety of problems with thousands of items. Our experiment results demonstrate that CombLinTS is scalable, robust to the choice of algorithm parameters, and significantly outperforms the best of our baselines.
MultiplePlay Bandits in the PositionBased Model
"... Abstract Sequentially learning to place items in multiposition displays or lists is a task that can be cast into the multipleplay semibandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, ..."
Abstract
 Add to MetaCart
Abstract Sequentially learning to place items in multiposition displays or lists is a task that can be cast into the multipleplay semibandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, much of the content may have been simply ignored by the user. The present work proposes to exploit available information regarding the display position bias under the socalled Positionbased click model (PBM). We first discuss how this model differs from the Cascade model and its variants considered in several recent works on multipleplay bandits. We then provide a novel regret lower bound for this model as well as computationally efficient algorithms that display good empirical and theoretical performance.
Playing Repeated Security Games with No Prior Knowledge *
"... ABSTRACT This paper investigates repeated security games with unknown (to the defender) game payoffs and attacker behaviors. As existing work assumes prior knowledge about either the game payoffs or the attacker's behaviors, they are not suitable for tackling our problem. Given this, we propos ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT This paper investigates repeated security games with unknown (to the defender) game payoffs and attacker behaviors. As existing work assumes prior knowledge about either the game payoffs or the attacker's behaviors, they are not suitable for tackling our problem. Given this, we propose the first efficient defender strategy, based on an adversarial online learning framework, that can provably achieve good performance guarantees without any prior knowledge. In particular, we prove that our algorithm can achieve low performance loss against the best fixed strategy on hindsight (i.e., having full knowledge of the attacker's moves). In addition, we prove that our algorithm can achieve an efficient competitive ratio against the optimal adaptive defender strategy. We also show that for zerosum security games, our algorithm achieves efficient results in approximating a number of solution concepts, such as algorithmic equilibria and the minimax value. Finally, our extensive numerical results demonstrate that, without having any prior information, our algorithm still achieves good performance, compared to stateoftheart algorithms from the literature on security games, such as SUQR
Contextual Combinatorial Cascading Bandits
"... Abstract We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by some stopp ..."
Abstract
 Add to MetaCart
Abstract We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by some stopping criterion. In online recommendation, the stopping criterion might be the first item a user selects; in network routing, the stopping criterion might be the first edge blocked in a path. We consider position discounts in the list order, so that the agent's reward is discounted depending on the position where the stopping criterion is met. We design a UCBtype algorithm, C 3 UCB, for this problem, prove an nstep regret bound O( √ n) in the general setting, and give finer analysis for two special cases. Our work generalizes existing studies in several directions, including contextual information, position discounts, and a more general cascading bandit model. Experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.