Learning from delayed rewards (1989)

by J C Watkins