A Comparison of Exploration/Exploitation Techniques for a Q-Learning Agent in the Wumpus World
BibTeX
@MISC{Friesen_acomparison,
author = {A. Friesen},
title = {A Comparison of Exploration/Exploitation Techniques for a Q-Learning Agent in the Wumpus World},
year = {}
}
OpenURL
Abstract
The Q-Learning algorithm, suggested by Watkins [1], has become one of the most popular reinforcement learning algorithms due to its relatively simple implementation and the complexity reduction gained by the use of a model-free method. However, Q-Learning does not specify how to trade off exploration of the world for exploitation of the developed policy. Multiple such tradeoffs are possible and preference of one over the other should depend mainly on whether a fast, but less accurate convergence to a policy is desired or whether a slower convergence to a more accurate policy is better. This paper will present the results of several exploration vs. exploitation (EE) methods within the contrived environment of the Wumpus World [2]. 1. Wumpus World The Wumpus World is a simple grid-world environment that can contain multiple hazards (in the form of pits and the wumpus itself) and a goal. The hazard and goal reward values are fairly arbitrary, but the values chosen were-100 and 100, respectively. Some experimentation was done with unbalanced hazard and goal rewards (i.e.-10000 for a hazard and +100 for a goal) but this did not significantly affect the results. In the actual Wumpus World, the agent can move in any of the four directions (North, South, East, and West) and can shoot a single arrow in one of these directions to attempt to eliminate the wumpus. The arrow action was removed from the version used in these tests as it added too much complexity. Thus the wumpus functions like a pit, in that if the agent moves onto the wumpus square then the agent is killed. However, to add some complexity to the world, the agent’s actions were altered to be non-deterministic. Thus, if the agent attempts to move East, there is a certain probability that it will actually move South instead. Some experimentation was done with the different possible combinations of transition probabilities. More random transitions lowered the agent’s average score, while more deterministic transitions had the opposite effect. Furthermore, more random transitions caused the agent to play more cautiously by moving as far away from hazards as possible so as to lessen the probability that it would accidentally be killed.
Keyphrases
wumpus world exploration exploitation technique q-learning agent random transition much complexity wumpus function agent average score transition probability agent action multiple tradeoff actual wumpus world wumpus square model-free method several exploration v simple grid-world environment contrived environment complexity reduction popular reinforcement arrow action accurate policy simple implementation developed policy goal reward value deterministic transition accurate convergence multiple hazard q-learning algorithm different possible combination unbalanced hazard opposite effect single arrow certain probability goal reward