## Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding (1996)

### Cached

### Download Links

- [ftp.cs.umass.edu]
- [www-anw.cs.umass.edu]
- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | Advances in Neural Information Processing Systems 8 |

Citations: | 354 - 18 self |

### BibTeX

@INPROCEEDINGS{Sutton96generalizationin,

author = {Richard S. Sutton},

title = {Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding},

booktitle = {Advances in Neural Information Processing Systems 8},

year = {1996},

pages = {1038--1044},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

On large problems, reinforcement learning systems must use parameterized function approximators such as neural networks in order to generalize between similar situations and actions. In these cases there are no strong theoretical results on the accuracy of convergence, and computational results have been mixed. In particular, Boyan and Moore reported at last year's meeting a series of negative results in attempting to apply dynamic programming together with function approximation to simple control problems with continuous state spaces. In this paper, we present positive results for all the control tasks they attempted, and for one that is significantly larger. The most important differences are that we used sparse-coarse-coded function approximators (CMACs) whereas they used mostly global function approximators, and that we learned online whereas they learned offline. Boyan and Moore and others have suggested that the problems they encountered could be solved by using actual outcomes (...

### Citations

1321 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...pproximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke & Singh, 1995; Sutton, 1988; =-=Watkins, 1989-=-). Many of these methods, e.g., dynamic programming and temporal-difference learning, build their estimates in part on the basis of other estimates. This may be worrisome because, in practice, the est... |

1227 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...and Function Approximation Reinforcement learning is a broad class of optimal control methods based on estimating value functions from experience, simulation, or search (Barto, Bradtke & Singh, 1995; =-=Sutton, 1988-=-; Watkins, 1989). Many of these methods, e.g., dynamic programming and temporal-difference learning, build their estimates in part on the basis of other estimates. This may be worrisome because, in pr... |

368 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context ...iklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (Lin, 1991; =-=Tesauro, 1992-=-; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The experiments in this paper are part of narrowing the answer... |

311 | Robot Dynamics and Control - Spong, Vidyasagar - 2004 |

287 | On-line Q-learning using connectionist systems - Rummery, Niranjan - 1994 |

279 | Improving elevator performance using reinforcement learning - Crites, Barto - 1996 |

276 | Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching - Lin - 1992 |

251 | Generalization in reinforcement learning: Safely approximating the value function - Boyan, Moore - 1995 |

246 | Temporal credit assignment in reinforcement learning - Sutton - 1984 |

237 | Residual Algorithms: Reinforcement Learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ...sed as the targets for other estimates, it seems possible that the ultimate result might be very poor estimates, or even divergence. Indeed some such methods have been shown to be unstable in theory (=-=Baird, 1995-=-; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in ... |

207 | Stable Function Approximation in Dynamic Programming
- Gordon
- 1995
(Show Context)
Citation Context ...rgets for other estimates, it seems possible that the ultimate result might be very poor estimates, or even divergence. Indeed some such methods have been shown to be unstable in theory (Baird, 1995; =-=Gordon, 1995-=-; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (Lin,... |

187 | Reinforcement learning with replacing eligibility traces - Singh, Sutton - 1996 |

176 | Neuronlike Elements that Can Solve Difficult Learning Control Problems - Barto, Sutton, et al. - 1983 |

109 | Real-time learning and control using asynchronous dynamic programming (Technical Report 91-57 - Barto, J, et al. - 1991 |

65 | CMAC: An Associative Neural Network Alternative to Backpropagation,” Pmc ZEEE, Vol 78 - Glantz, Kraft - 1990 |

61 |
The convergence of td() for general
- Dayan
- 1992
(Show Context)
Citation Context ... be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; =-=Dayan, 1992-=-) and very effective in practice (Lin, 1991; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The ... |

49 | Online learning with random representations - Sutton, Whitehead - 1993 |

38 |
The convergence of TD(λ) for general λ
- Dayan
- 1992
(Show Context)
Citation Context ... be unstable in theory (Baird, 1995; Gordon, 1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; =-=Dayan, 1992-=-) and very effective in practice (Lin, 1991; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The ... |

27 |
CMAC-based adaptive critic self-learning control
- Lin, Kim
- 1991
(Show Context)
Citation Context ...1995; Tsitsiklis & Van Roy, 1994) and in practice (Boyan & Moore, 1995). On the other hand, other methods have been proven stable in theory (Sutton, 1988; Dayan, 1992) and very effective in practice (=-=Lin, 1991-=-; Tesauro, 1992; Zhang & Dietterich, 1995; Crites & Barto, 1996). What are the key requirements of a method or task in order to obtain good performance? The experiments in this paper are part of narro... |

26 | Feature-based methods for large-scale dynamic programming - Tsitsiklis, Roy - 1996 |

22 | Counter-Example to Temporal Differences Learning
- Bertsekas, P
- 1994
(Show Context)
Citation Context ...ods and as in TD() withs= 1, or to learn on the basis of interim estimates, as in TD() withs! 1. Theoretically, the former has asymptotic advantages when function approximators are used (Dayan, 1992; =-=Bertsekas, 1995-=-), but empirically the latter is thought to achieve better learning rates (Sutton, 1988). However, hitherto this question has not been put to an empirical test using function approximators. Figures 6 ... |

18 | Reinforcement Learning for planning and Control - Dean, Basye, et al. - 1993 |

13 |
Swinging up the acrobot: An example of intelligent control
- 16DeJong, Spong
- 1994
(Show Context)
Citation Context ...mented with a larger and more difficult task not attempted by Boyan and Moore. The acrobot is a two-link under-actuated robot (Figure 5) roughly analogous to a gymnast swinging on a highbar (Dejong & =-=Spong, 1994-=-; Spong & Vidyasagar, 1989). The first joint (corresponding to the gymnast's hands on the bar) cannot exert torque, but the second joint (corresponding to the gymnast bending at the waist) can. The ob... |

10 | Modular On-line Function Approximation for Scaling up Reinforcement Learning
- Tham
- 1994
(Show Context)
Citation Context ...ntinuous state space, we combined it with a sparse, coarse-coded function approximator known as the CMAC (Albus, 1980; Miller, Gordon & Kraft, 1990; Watkins, 1989; Lin & Kim, 1991; Dean et al., 1992; =-=Tham, 1994-=-). A CMAC uses multiple overlapping tilings of the state space to produce a feature representation for a final linear mapping where all the learning takes place. See Figure 2. The overall effect is mu... |