Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
, 1993
Abstract

Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function, at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function specifies at that state and what is obtained by a onestep lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions defined on stateaction pairs, as are used in Qlearning. One significant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi...