• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Finite-time analysis of the multiarmed bandit problem (2002)

Cached

  • Download as a PDF

Download Links

  • [www.cs.berkeley.edu]
  • [www.eecs.berkeley.edu]
  • [www.eecs.berkeley.edu]
  • [www.cs.berkeley.edu]
  • [web.engr.oregonstate.edu]
  • [www.neurocolt.org]
  • [www.cs.berkeley.edu]
  • [www.cs.berkeley.edu]
  • [homes.dsi.unimi.it]
  • [www.cs.ualberta.ca]
  • [homes.di.unimi.it]
  • [mercurio.srv.di.unimi.it]
  • [www.ualberta.ca]
  • [www.ualberta.ca]
  • [mercurio.srv.di.unimi.it]
  • [homes.di.unimi.it]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Peter Auer , Paul Fischer , Jyrki Kivinen
Venue:Machine Learning
Citations:817 - 15 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@INPROCEEDINGS{Auer02finite-timeanalysis,
    author = {Peter Auer and Paul Fischer and Jyrki Kivinen},
    title = {Finite-time analysis of the multiarmed bandit problem},
    booktitle = {Machine Learning},
    year = {2002}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Keywords: bandit problems, adaptive allocation rules, finite horizon regret 1.

Keyphrases

bandit problem    finite-time analysis    efficient policy    popular measure    exploration versus exploitation dilemma    reward distribution    bounded support    optimal policy    policy success    many others    exploration exploitation dilemma    multi-armed bandit problem    finite horizon    optimal logarithmic regret    profitable action    first one    adaptive allocation rule   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University