## A generalization error for Q-learning

### Cached

### Download Links

- [dept.stat.lsa.umich.edu]
- [dept.stat.lsa.umich.edu]
- [www.jmlr.org]
- [jmlr.org]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 14 - 5 self |

### BibTeX

@ARTICLE{Murphy_ageneralization,

author = {S. A. Murphy},

title = {A generalization error for Q-learning},

journal = {Journal of Machine Learning Research},

year = {}

}

### OpenURL

### Abstract

Summary. Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

### Citations

3773 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...ting is Q-learning c○2005 Susan A. Murphy.MURPHY (Watkins, 1989) since the actions in the training set are chosen according to a (non-optimal) exploration policy; Q-learning is an off-policy method (=-=Sutton and Barto, 1998-=-). When the observables are vectors of continuous variables or are otherwise of high dimension, Q-learning must be combined with function approximation. The contributions of this paper are as follows.... |

2611 | Dynamic Programming - Bellman - 1957 |

1321 | Learning from Delayed Rewards - Watkins - 1989 |

745 | Nonlinear Programming - Bertsekas - 2004 |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ...e in value functions or more specifically the average generalization error. Here the generalization error for batch Q-learning is defined analogous to the generalization error in supervised learning (=-=Schapire et al., 1998-=-); it is the average diffe! rence in value when using the optimal policy as compared to using the greedy policy (from Q-learning) in generating a separate test set. The performance guarantees are anal... |

341 | Weak Convergence and Empirical Processes - Vaart, Wellner - 1996 |

310 |
Neural Network Learning: Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...policy as compared to using the greedy policy (from Q-learning) in generating a separate test set. The performance guarantees are analogous to performance guarantees available in supervised learning (=-=Anthony and Bartlett, 1999-=-). The upper bounds on the average generalization error permit an additional contribution. These upper bounds illuminate the mismatch between Q-learning with function approximation and the goal of fin... |

217 | An Analysis of Temporal-Difference Learning with Function Approximation - Tsitsiklis, Roy - 1997 |

171 | A sparse sampling algorithm for near-optimal planning in large Markov decision processes - Kearns, Mansour, et al. - 2002 |

153 | Infinite-horizon policy-gradient estimation - Baxter, Bartlett - 2001 |

134 | Feature-based methods for large scale dynamic programming - Tsitsiklis, Roy - 1996 |

111 | Approximate planning in large POMDPs via reusable trajectories - Kearns, Mansour, et al. - 2000 |

102 | Kernel-based reinforcement learning - Ormoneit, Sen |

89 | On the sample complexity of reinforcement learning - Kakade - 2003 |

45 | Advantage Updating
- Baird
- 1993
(Show Context)
Citation Context ...At), ft+1; this conditional distribution does not depend on the policy.) In Section 4 we express the difference in value functions for policy ˜π and policy π in terms of the advantages (as defined in =-=Baird, 1993-=-). The time t advantage is µπ,t(ot,at) = Qπ,t(ot,at) −Vπ,t(ot,at−1). The advantage can be interpreted as the gain in performance obtained by following action at at time t and thereafter policy π as co... |

32 | Efficient reinforcement learning - Fiechter |

23 | Learning from scarce experience
- Peshkin, Shelton
(Show Context)
Citation Context ...tion error: maxo[V ∗ (o) −Vπ(o)] (Bertsekas and Tsitsiklis, 1996). However here we consider an average generalization error as in Kakade (2003) (see also Fiechter, 1997; Kearns, Mansour and Ng, 2000; =-=Peshkin and Shelton, 2002-=-); that is R o [V ∗ (o) −Vπ(o)]dF(o) for a specified distribution F on the observation space. The choice of F with density f = f0 ( f0 is the density of O0 in likelihoods (1) and (2)) is particularly ... |

12 | Expected Mistake Bound Model for On-Line Reinforcement Learning
- Fiechter
- 1997
(Show Context)
Citation Context ... π is evaluated in terms of maximum generalization error: maxo[V ∗ (o) −Vπ(o)] (Bertsekas and Tsitsiklis, 1996). However here we consider an average generalization error as in Kakade (2003) (see also =-=Fiechter, 1997-=-; Kearns, Mansour and Ng, 2000; Peshkin and Shelton, 2002); that is R o [V ∗ (o) −Vπ(o)]dF(o) for a specified distribution F on the observation space. The choice of F with density f = f0 ( f0 is the d... |

10 |
Background and rationale for the sequenced treatment alternative to relieve depression (STAR*D) study. Psychiatric Clinics of North America 26(3):457-494, 2003. 27 CN Fiechter: Efficient Reinforcement Learning
- Fava, Rush
- 1994
(Show Context)
Citation Context ... number of ongoing large clinical trials for chronic disorders in which, each time an individual relapses, the individual is re-randomized to one of several further treatments (Schneider et al.,2001; =-=Fava et al., 2003-=-; Thall et al., 2000). These are finite horizon problems with T generally quite small, T = 2 − 4, with known exploration policy. Scientists want to estimate the best “strategies,” i.e. policies, for m... |

9 | Bellman Dynamic Programming - E - 1957 |

6 | National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE - Schneider, Tariot, et al. |

5 | Using behavioral reinforcement to improve methadone treatment participation - Brooner, Kidorf - 2002 |

5 | Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials - Thall, Sung, et al. - 2002 |

4 | Evaluating multiple treatment courses in clinical trials. Statistics and Medicine - Thall, Millikan, et al. - 2000 |

3 | Texas Medication Algorithm Project, Phase 3 (TMAP-3): Rationale and study design - Rush, Crismon, et al. - 2003 |

2 | Tsitsiklis. Dynamic catalog mailing policies. unpublished manuscript, Available electronically at http://web.mit.edu/jnt/www/Papers/P-03-sun-catalogrev2.pdf - Simester, Sun, et al. - 2003 |

2 |
Less is more
- Altfeld, Walker
- 2001
(Show Context)
Citation Context ...t, Ot+1) for rt a reward function and for each 0 ≤ t ≤ T (if the Markov assumption holds then replace Ot with Ot and At with At). We assume that the rewards are bounded, taking values in the interval =-=[0, 1]-=-. We assume the trajectories are sampled at random according to a fixed distribution denoted by P . Thus the trajectories are generated by one fixed distribution. This distribution is composed of the ... |

1 |
Less is more? STI in acute and chronic HIV-1 infection. Nature Medicine 7:881–884
- Altfeld, Walker
- 2001
(Show Context)
Citation Context ... training set of trajectories are not unusual and can be expected to increase due to the widespread use of policies in the social and behavioral/medical sciences (see, for example, Rush et al., 2003; =-=Altfeld and Walker, 2001-=-; Brooner, and Kidorf, 2002); at this time these policies are formulated using expert opinion, clinical experience and/or theoretical models. However there is growing interest in formulating these pol... |

1 | TMAP Research Group. Texas medication algorithm project, phase 3 (TMAP-3): Rationale and study design - Rush, Crismon, et al. |

1 | et al. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE - Schneider, Tariot - 2001 |

1 | Dynamic catalog mailing policies. unpublished manuscript, Available electronically at http://web.mit.edu/jnt/www/Papers/P-03sun-catalog-rev2.pdf - Simester, Tsitsiklis - 2003 |

1 | HG Sung and EH Estey. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials - Thall - 2002 |

1 | der Vaart and JA Wellner. Weak Convergence and Empirical Processes - van - 1996 |