#### DMCA

## The K-armed Dueling Bandits Problem

### Cached

### Download Links

Citations: | 29 - 7 self |

### Citations

12170 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...KL(p; q) ≥ 1 3 ln ( ) 1 − 3q(E) 1 e . Applying this lemma with the event Ej, we have KL(q1; qj) ≥ 1 3 ln ( 1 3o(T a−1 ) − ) 1 e = Ω(log T ) (16) On the other hand, by the chain rule for KL divergence =-=[CT99]-=-, we have KL(q1; qj) = Eq1 [nj,T ]KL(1/2 + ɛ; 1/2 − ɛ) ≤ 16ɛ 2 Eq1[nj,T ] (17) Combining (16) and (17) proves the lemma. Proof of Theorem 4. Let φ be any algorithm for the dueling bandits problem. If ... |

2166 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...e that E[ ˆ Pt] = 1/2 + ɛi,j. This tells us P (1/2 + ɛi,j /∈ Ĉt) is bounded above by the probability that ˆPt deviates from its expected value by at least ct. An application of Hoeffding’s inequality =-=[Hoe63]-=- shows that this probability is bounded above by 2 exp(−2tc 2 t ) = 2 exp(−2 log(1/δ)) ≤ δ 3 In other words, ˆ Pi,j is the fraction of these t comparisons in which bi was the winner.for δ ≤ 1/2. Thus... |

789 | Finite-time analysis of the multiarmed bandit problem
- Auer, Cesa-Bianchi, et al.
(Show Context)
Citation Context ...LR85] and nonstochastic [ACBFS02] cases. The vast literature on this topic includes algorithms whose regret is within a constant factor of the information-theoretic lower bound in the stochastic case =-=[ACBF02]-=- and within a O( √ log n) factor of the best such lower bound in the non-stochastic case [ACBFS02]. Our use of upper confidence bounds in designing algorithms for the dueling bandits problem is prefig... |

701 | Y.: An efficient boosting algorithm for combining preferences - Freund, Iyer, et al. - 2004 |

500 |
Asymptotically efficient adaptive allocation rules
- Lai, Robbins
- 1985
(Show Context)
Citation Context ...t attention to other points. 2 Related Work Regret-minimizing algorithms for multi-armed bandit problems and their generalizations have been intensively studied for many years, both in the stochastic =-=[LR85]-=- and nonstochastic [ACBFS02] cases. The vast literature on this topic includes algorithms whose regret is within a constant factor of the information-theoretic lower bound in the stochastic case [ACBF... |

463 | Some aspects of the sequential design of experiments
- Robbins
- 1952
(Show Context)
Citation Context ...orithm that achieves (almost) information-theoretically optimal regret bounds (up to a constant factor). 1 Introduction In partial information online learning problems (also known as bandit problems) =-=[Rob52]-=-, an algorithm must choose, in each of T consecutive iterations, one of K possible bandits (strategies). For conventional bandit problems, in every iteration, each bandit receives a real-valued payoff... |

404 | Learning to order things
- Cohen, Schapire, et al.
- 1999
(Show Context)
Citation Context ...ed learning to rank. Typically, a preference function is first learned using a set of i.i.d. training examples, and subsequent pre-dictions are made to minimize the number of mis-ranked pairs (e.g., =-=[CSS99]-=-). Most prior work assume access to a training set with absolute labels (e.g., of relevance or utility) on individual examples, with pairwise preferences generated using inputs with labels from differ... |

298 | A support vector method for multivariate performance measures - Joachims - 2005 |

172 | Using confidence bounds for exploitation-exploration trade-offs - Auer |

95 | How Does Click-through Data Reflect Retrieval Quality
- Radlinski, Kurup, et al.
- 2008
(Show Context)
Citation Context ...at such relative comparison statements can be derived from observable user behavior. For example, to elicit whether a search-engine user prefers ranking r1 over r2 for a given query, Radlinski et al. =-=[RKJ08]-=- showed how to present an interleaved ranking of r1 and r2 so that clicks indicate which of the two is preferred by the user. This ready availability of pairwise comparison feedback in applications wh... |

79 | Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems
- Even-Dar, Mannor, et al.
(Show Context)
Citation Context ...n rather than the PAC objective) but there are clear similarities between our IF1 and IF2 algorithms and the Successive Elimination and Median Eliminiation algorithms developed for the PAC setting in =-=[EDMM06]-=-. There are also some clear differences between the algorithms: these are discussed in Section 5.1. The difficulty of the dueling bandits problem stems from the fact that the algorithm has no way of d... |

78 | The epoch-greedy algorithm for contextual multi-armed bandits
- Langford, Zhang
(Show Context)
Citation Context ...d is motivated by practical considerations from information retrieval applications. Future directions include finding other reasonable notions of regret in this framework (e.g., via contextualization =-=[LZ07]-=-), and designing algorithms that achieve low-regret when the set of bandits is very large (a special case of this is addressed in [YJ09]). Acknowledgments The work is funded by NSF Award IIS-0812091. ... |

71 |
Computing with noisy information
- Feige, Peleg, et al.
- 1994
(Show Context)
Citation Context ...t. We call this the K-armed Dueling Bandits Problem, which can also be viewed as a regret-minimization version of the classical problem of finding the maximum element of a set using noisy comparisons =-=[FRPU94]-=-. A canonical application example is an intranet-search system that is installed for a new customer. Among K built-in retrieval functions, the search engine needs to select the one that provides the b... |

66 | The sample complexity of exploration in the multi-armed bandit problem - Mannor, Tsitsiklis |

44 | Regret bounds for sleeping experts and bandits
- Kleinberg, Niculescu-Mizil, et al.
- 2010
(Show Context)
Citation Context ... than a reduction from the standard case. Theorem 4. Any algorithm φ for the dueling bandits problem has R φ ( ) K T = Ω log T , ɛ where ɛ = minb̸=b ∗ P (b∗ , b). The proof is motivated by Lemma 5 of =-=[KNMS08]-=-. Fix ɛ > 0 and define the following family of problem instances. In instance ˜qj, let bj be the best bandit, and order the remaining bandits by their indices. Let P (bi ≻ bk) = ɛ whenever bi ≻ bk. No... |

40 | Regret minimization under partial monitoring
- Cesa-Bianchi, Lugosi, et al.
- 2004
(Show Context)
Citation Context ... the fact that the algorithm has no way of directly observing the costs of the actions it chooses. It is an example of a partial monitoring problem, a class of regret-minimization problems defined in =-=[CBLS06]-=-, in which an algorithm (the “forecaster”) chooses actions and then observes feedback signals that depend on the actions chosen by the forecaster and by an unseen opponent (the “environment”). This pa... |

38 | Interactively optimizing information retrieval systems as a dueling bandits problem
- Yue, Joachims
- 2009
(Show Context)
Citation Context ...the regret minimization setting considered here, because they devote undue effort to comparing elements that are far from the maximum. This point is discussed further in Section 5.1. Yue and Joachims =-=[YJ09]-=- simultaneously studied a continuous version of the Dueling Bandits Problem, where bandits (e.g., retrieval functions) are characterized using a compact parameter space. For that setting, they propose... |

36 |
Noisy binary search and its applications
- Karp, Kleinberg
- 2007
(Show Context)
Citation Context ...oretic optimum up to a 1+o(1) factor. When the probability of error depends on the pair of elements being compared (as in our dueling bandits problem), Adler et al. [AGHB + 94] and Karp and Kleinberg =-=[KK07]-=- present algorithms that achieve the information-theoretic optimum (up to constant factors) for the problem of selecting the maximum and for binary search, respectively. Our results can be seen as ext... |

34 | Robust reductions from ranking to classification - Balcan, Bansal, et al. - 2007 |

16 |
The bayesian learner is optimal for noisy binary search (and pretty good for quantum as well
- Ben-Or, Hassidim
- 2008
(Show Context)
Citation Context ... expected cost (up to constant factors) for many basic problems such as sorting, searching, and selecting the maximum. The upper bound for noisy binary search has been improved in a very recent paper =-=[BOH08]-=- that achieves the informationtheoretic optimum up to a 1+o(1) factor. When the probability of error depends on the pair of elements being compared (as in our dueling bandits problem), Adler et al. [A... |

13 | Selection in the presence of noise: The design of playoff systems - Adler, Gemmell, et al. - 1994 |

8 | Boosting the area under the roc curve - Long, Servedio - 2007 |

6 | Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for taxonomies: A model-based approach - Pandey - 2007 |

3 |
How do we get weak action dependence for learning with partial observations? Blog post: http: //hunch.net/?p=421
- Langford
- 2008
(Show Context)
Citation Context ...g K) comparisons in expectation. In Section 5 we provide insight about why existing methods suffer high regret in our setting. Thus, our results provide theoretical support for Langford’s observation =-=[Lan08]-=- about a qualitative difference between algorithms for supervised learning and those for learning from partial observations: in the supervised setting, “holistic information is often better,” whereas ... |

1 |
bandit problem
- Auer, Cesa-Bianchi, et al.
(Show Context)
Citation Context ...ts. 2 Related Work Regret-minimizing algorithms for multi-armed bandit problems and their generalizations have been intensively studied for many years, both in the stochastic [LR85] and nonstochastic =-=[ACBFS02]-=- cases. The vast literature on this topic includes algorithms whose regret is within a constant factor of the information-theoretic lower bound in the stochastic case [ACBF02] and within a O( √ log n)... |