#### DMCA

## Adaptive ε-greedy exploration in reinforcement learning based on value differences (2010)

Venue: | Berlin / Heidelberg |

Citations: | 3 - 1 self |

### Citations

5471 | Reinforcement Learning. An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...nowledge prevents from maximizing the long-term reward because selected actions may not be optimal. For this reason, the described problem is well-known as the dilemma of exploration and exploitation =-=[1]-=-. This paper addresses the issue of adaptive exploration in RL and elaborates on a method for controlling the amount of exploration on basis of the agent’s uncertainty. For this, the proposed VDBE met... |

1670 | Learning from Delayed Rewards - Watkins - 1989 |

473 |
Dynamic Programming: Deterministic and Stochastic models
- Bertsekas
- 1987
(Show Context)
Citation Context ...orithms are Q-learning [2] or Sarsa [9] from the temporal difference approach, and which are typically used when the environment model is unknown. Other algorithms of the dynamic programming approach =-=[10]-=- are used when a model of the environment is available and therefore usually converge faster. In the following, we use a version of temporal difference learning with respect to a single-state MDP, whi... |

464 | Some aspects of the sequential design of experiments
- Robbins
- 1952
(Show Context)
Citation Context ...(s,a) Fig. 1. Graph of f(s, a, σ) in dependence of various sensitivities. 4 Experiments and Results A typical scenario for evaluating exploration/exploitation methods is the multiarmed bandit problem =-=[1, 12]-=-. In this example a casino player can choose at each time step among n different levers of an n-armed bandit (slot machine analogy) with the goal of maximizing the cumulative reward within a series of... |

379 | On-line Q-learning using connectionist systems
- Rummery, Niranjan
(Show Context)
Citation Context ...ntinuous learning tasks. 2.1 Learning the Q function Value functions are learned through the agent’s interaction with its environment. For this, frequently used algorithms are Q-learning [2] or Sarsa =-=[9]-=- from the temporal difference approach, and which are typically used when the environment model is unknown. Other algorithms of the dynamic programming approach [10] are used when a model of the envir... |

293 | R-max - A general polynomial time algorithm for near-optimal reinforcement learning
- Brafman, Tennenholtz
(Show Context)
Citation Context ...e requirements. 1.1 Related Work In the literature, many different approaches exists in order to balance the ratio of exploration/exploitation in RL: many methods utilize counters [3], model learning =-=[4]-=- or reward comparison in a biologically-inspired manner [5]. In practice, however, it turns out that the ε-greedy [2] method is often the method of first choice as reported by Sutton [6]. The reason f... |

148 | Efficient exploration in reinforcement learning
- Thrun
- 1992
(Show Context)
Citation Context ... and computation time requirements. 1.1 Related Work In the literature, many different approaches exists in order to balance the ratio of exploration/exploitation in RL: many methods utilize counters =-=[3]-=-, model learning [4] or reward comparison in a biologically-inspired manner [5]. In practice, however, it turns out that the ε-greedy [2] method is often the method of first choice as reported by Sutt... |

95 |
Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches
- Awerbuch, Kleinberg
- 2004
(Show Context)
Citation Context ...res” one of the other levers to improve the estimates. Some real-world problems analogous to the multi-armed bandit problem are, e.g., adaptive routing in networks with the goal of delay minimization =-=[13]-=- or the economic problem of selecting the best supplier on the basis of incomplete information [14]. 4.1 Experiment Setup The VDBE method is compared and evaluated on a set of 2000 randomly generated ... |

62 | Multi-armed bandit algorithms and empirical evaluation
- Vermorel, Mohri
- 2005
(Show Context)
Citation Context ...(1) the method does not require to memorize any exploration specific data and (2) is known to achieve near optimal results in many applications by the hand-tuning of only a single parameter, see e.g. =-=[7]-=-. Even though the ε-greedy method is reported to be widely used, the literature still lacks on methods of adapting the method’s exploration rate on basis of the learning progress. Only a few methods s... |

34 |
Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming
- GEORGE, POWELL
- 2006
(Show Context)
Citation Context ...since small step-sizes cause long learning times, and however, large step-sizes cause oscillations in the value function. A more detailed overview of these and other step-size methods can be found in =-=[11]-=-. 2.2 Exploration/Exploitation Strategies Two widely used methods for balancing exploration/exploitation are ε-greedy and softmax [1]. With ε-greedy, the agent selects at each time step a random actio... |

31 | Control of exploitation-exploration metaparameter in reinforcement learning
- Ishii, Yoshida, et al.
- 2002
(Show Context)
Citation Context ...fferent approaches exists in order to balance the ratio of exploration/exploitation in RL: many methods utilize counters [3], model learning [4] or reward comparison in a biologically-inspired manner =-=[5]-=-. In practice, however, it turns out that the ε-greedy [2] method is often the method of first choice as reported by Sutton [6]. The reason for this seems to be due to the fact that (1) the method doe... |

10 | Exploitation vs. exploration: Choosing a supplier in an environment of incomplete information, Decision Support Systems
- Azoulay-Schwartz, Kraus, et al.
(Show Context)
Citation Context ...lti-armed bandit problem are, e.g., adaptive routing in networks with the goal of delay minimization [13] or the economic problem of selecting the best supplier on the basis of incomplete information =-=[14]-=-. 4.1 Experiment Setup The VDBE method is compared and evaluated on a set of 2000 randomly generated 10-armed bandit task as described in [1]. Each selection of a lever returns a stochastic reward dra... |

2 |
Interview with Richard
- Heidrich-Meisner
- 2009
(Show Context)
Citation Context ...odel learning [4] or reward comparison in a biologically-inspired manner [5]. In practice, however, it turns out that the ε-greedy [2] method is often the method of first choice as reported by Sutton =-=[6]-=-. The reason for this seems to be due to the fact that (1) the method does not require to memorize any exploration specific data and (2) is known to achieve near optimal results in many applications b... |

2 |
Improving the exploration strategy in bandit algorithms
- Caelen, Bontempi
- 2007
(Show Context)
Citation Context ...od is reported to be widely used, the literature still lacks on methods of adapting the method’s exploration rate on basis of the learning progress. Only a few methods such as ε-first or decreasing-ε =-=[8]-=- consider “time” in order to reduce the exploration probability, but what is known to be less related to the true learning progress. For example, why should an agent be less explorative in unknown par... |