## Multi-armed Bandit Problems with Dependent Arms (2007)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [imls.engr.oregonstate.edu]
- [www.machinelearning.org]
- DBLP

### Other Repositories/Bibliography

Venue: | Proceedings of the 24th International Conference on Machine Learning |

Citations: | 21 - 1 self |

### BibTeX

@INPROCEEDINGS{Pandey07multi-armedbandit,

author = {Sandeep Pandey and Deepayan Chakrabarti and Deepak Agarwal},

title = {Multi-armed Bandit Problems with Dependent Arms},

booktitle = {Proceedings of the 24th International Conference on Machine Learning},

year = {2007}

}

### OpenURL

### Abstract

We provide a framework to exploit dependencies among arms in multi-armed bandit problems, when the dependencies are in the form of a generative model on clusters of arms. We find an optimal MDP-based policy for the discounted reward case, and also give an approximation of it with formal error guarantee. We discuss lower bounds on regret in the undiscounted reward scenario, and propose a general two-level bandit policy for it. We propose three different instantiations of our general policy and provide theoretical justifications of how the regret of the instantiated policies depend on the characteristics of the clusters. Finally, we empirically demonstrate the efficacy of our policies on large-scale realworld and synthetic data, and show that they significantly outperform classical policies designed for bandits with independent arms. 1.

### Citations

1198 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ... policy measures the loss it incurs compared to a policy that always pulls the optimal arm, i.e., the arm with the highest θi. Next, we give an equivalent formulation of our dependent bandit problem (=-=Puterman, 2005-=-), which is later used in deriving the optimal solution. Equivalent State-space formulation: Associated with each arm i at time t is a state xi(t) containing sufficient statistics for the posterior di... |

324 | Information-based objective functions for active data selection
- MacKay
- 1992
(Show Context)
Citation Context ..., where the goal is typically to build a classifier over the entire input space by sequentially choosing new examples to get labeled from portions of the space where the classifier confidence is low (=-=MacKay, 1992-=-; Schneider & Moore, 2002). The number of examples needed to achieve a given prediction accuracy has been studied thoroughly (Dasgupta, 2005). However, dependent arms have not been studied in this con... |

266 |
Asymptotically efficient adaptive allocation rules
- Lai, Robbins
- 1985
(Show Context)
Citation Context ...to be independent of each other. The objective is to pull arms sequentially so as to maximize the total reward. Many policies have been proposed for this problem under the independent-arm assumption (=-=Lai & Robbins, 1985-=-; P.Auer et al., 2002). In this paper we drop this assumption and focus on the bandit problem where the arms are dependent. For example, consider a simple bandit instance which has 3 arms, with succes... |

89 | Coarse sample complexity bounds for active learning
- Dasgupta
- 2005
(Show Context)
Citation Context ...portions of the space where the classifier confidence is low (MacKay, 1992; Schneider & Moore, 2002). The number of examples needed to achieve a given prediction accuracy has been studied thoroughly (=-=Dasgupta, 2005-=-). However, dependent arms have not been studied in this context. Time t x 1 x 2 x 3 Arm 2 Arm 1 Arm 3 x’ 1 x’ 2 x" 1 x" 2 x 3 x 3 Success Failure Time t+1 Figure 1. State evolution in the dependent b... |

71 |
Sample mean based index policies with o(logn) regret for the multi-armed bandit problem
- Agrawal
- 1995
(Show Context)
Citation Context ...(log T ) 2 ) for a large class of priors, where T is the total number of arm pulls (Lai & Robbins, 1985; Lai, 1987). Policies to achieve the lower bound have also been developed (Lai & Robbins, 1985; =-=Agrawal, 1995-=-; P.Auer et al., 2002; Kocsis & Szepesvári, 2006). In particular, the UCB1 scheme (P.Auer et al., 2002) achieves the O(log T ) bound on regret uniformly instead of asymptotically. The dependent bandit... |

34 |
Adaptive treatment allocation and the multi-armed bandit problem
- Lai
- 1987
(Show Context)
Citation Context ...nd on regret has been shown to be Ω(log T ) while the average Bayes risk is bounded below by Ω((log T ) 2 ) for a large class of priors, where T is the total number of arm pulls (Lai & Robbins, 1985; =-=Lai, 1987-=-). Policies to achieve the lower bound have also been developed (Lai & Robbins, 1985; Agrawal, 1995; P.Auer et al., 2002; Kocsis & Szepesvári, 2006). In particular, the UCB1 scheme (P.Auer et al., 200... |

24 | Bandit problems with side observations
- Wang, Kulkarni, et al.
- 2005
(Show Context)
Citation Context ...e UCB1 scheme (P.Auer et al., 2002) achieves the O(log T ) bound on regret uniformly instead of asymptotically. The dependent bandit problem is also related to bandit problems with side observations (=-=Wang et al., 2005-=-). However, the latter assumes a separate process {Xt} that provide additional information about the reward process at each time point; no such separate information about the reward process is present... |

17 | Bandits for taxonomies: A model-based approach
- Pandey, Agarwal, et al.
- 2007
(Show Context)
Citation Context ...) is unknown, TLP still uses the fact that the arms are partitioned into clusters, and performs well as a result. We note that one specific instance of TLP was proposed in (Kocsis & Szepesvári, 2006; =-=Pandey et al., 2007-=-) but our work is significantly different: (1) our TLP formulation is more general, (2) we propose other instantiations of the general TLP, and (3) we investigate the performance of this model in term... |

14 | Four proofs of Gittins’ multi-armed bandit theorem. Applied Probability Trust
- Frostig, Weiss
- 1999
(Show Context)
Citation Context ...d problem that maximizes the expected total discounted reward is obtained by decoupling and solving k independent one-armed problems, dramatically reducing the dimension of the state space (see also (=-=Frostig & Weiss, 1999-=-)). For the finite horizon undiscounted reward scenario, the asymptotic lower bound on regret has been shown to be Ω(log T ) while the average Bayes risk is bounded below by Ω((log T ) 2 ) for a large... |

8 |
Optimal stopping and dynamic allocation
- Chang, Lai
- 1987
(Show Context)
Citation Context ...oblems are not just specific to our MDP-based policy; even the best known approximations for Gittins’ index policy under the independence assumption break down when observations are few and α > 0.95 (=-=Chang & Lai, 1987-=-). Such long time horizons are better handled under the undiscounted reward scenario; indeed, several policies for undiscounted reward actually approximate the Gittins’ index for discounted reward, in... |

6 | 2002. Finite-time analysis of the multiarmed bandit problem - Auer, Cesa-Bianchi, et al. |

2 |
Bandit based monte-carlo planning. ECML
- Kocsis, Szepesvári
- 2006
(Show Context)
Citation Context ...iors, where T is the total number of arm pulls (Lai & Robbins, 1985; Lai, 1987). Policies to achieve the lower bound have also been developed (Lai & Robbins, 1985; Agrawal, 1995; P.Auer et al., 2002; =-=Kocsis & Szepesvári, 2006-=-). In particular, the UCB1 scheme (P.Auer et al., 2002) achieves the O(log T ) bound on regret uniformly instead of asymptotically. The dependent bandit problem is also related to bandit problems with... |

2 |
Active learning in discrete input spaces. The 34th Interface Symposium
- Schneider, Moore
- 2002
(Show Context)
Citation Context ...al is typically to build a classifier over the entire input space by sequentially choosing new examples to get labeled from portions of the space where the classifier confidence is low (MacKay, 1992; =-=Schneider & Moore, 2002-=-). The number of examples needed to achieve a given prediction accuracy has been studied thoroughly (Dasgupta, 2005). However, dependent arms have not been studied in this context. Time t x 1 x 2 x 3 ... |