Learning from delayed rewards (1989)

by C Watkins