## Reinforcement Learning with Replacing Eligibility Traces (1996)

Venue: | MACHINE LEARNING |

Citations: | 188 - 11 self |

### BibTeX

@INPROCEEDINGS{Singh96reinforcementlearning,

author = {Satinder Singh and Richard S. Sutton},

title = {Reinforcement Learning with Replacing Eligibility Traces},

booktitle = {MACHINE LEARNING},

year = {1996},

pages = {123--158},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that t...

### Citations

2613 |
Applied Dynamic Programming
- Bellman, Dreyfus
- 1962
(Show Context)
Citation Context ...very-visit MC as a function of the number of trials, n. The mean squared error (MSE) is Bias 2 + V ar. 3.4. Bias Results First consider the true value of state s in Figure 3. From Bellman's equation (=-=Bellman, 1957-=-): V (s) = P s (R s + V (s)) + PT (RT + VT ) or (1 \Gamma P s )V (s) = P s R s + PT RT ; and therefore V (s) = P s PT R s + R T : Theorem 6: First-visit MC is unbiased, i.e., Bias F n (s) = V (s) \Gam... |

1228 | Learning to predict by the method of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...kov chain, CMAC 1. Eligibility Traces Two fundamental mechanisms have been used in reinforcement learning to handle delayed reward. One is temporal-difference (TD) learning, as in the TD() algorithm (=-=Sutton, 1988-=-) and in Q-learning (Watkins, 1989). TD learning in effect constructs an internal reward signal that is less delayed than the original, external one. However, TD methods can eliminate the delay comple... |

413 |
Simulation and Monte Carlo Method
- Rubinstein
- 1981
(Show Context)
Citation Context ...the expected return. This suggests that one might estimate a state's value simply by averaging all the returns that follow it. This is what is classically done in Monte Carlo (MC) prediction methods (=-=Rubinstein, 1981-=-; Curtiss, 1954; Wasow, 1952; Barto & Duff, 1994). We distinguish two specific algorithms: Every-visit MC : Estimate the value of a state as the average of the returns that have followed all visits to... |

368 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context ...ndling delay is the eligibilitystrace. 1 Introduced by Klopf (1972), eligibility traces have been used in a variety of reinforcement learning systems (e.g., Barto, Sutton & Anderson, 1983; Lin, 1992; =-=Tesauro, 1992-=-; Peng & Williams, 1994). Systematic empirical studies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical results have been obtained by several authors (e... |

287 | On-line Q-learning using connectionist systems - Rummery, Niranjan - 1994 |

279 |
Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems
- Holland
- 1986
(Show Context)
Citation Context ... number of features, m, present on a typical time step. Here m is 5. REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 21 (Watkins, 1989) and to various simplified forms of the bucket brigade (=-=Holland, 1986-=-; Wilson, to appear). It is also identical to the TD() algorithm applied to state-action pairs rather than to states. 6 The mountain-car task has a continuous two-dimensional state space with an infin... |

276 | Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching
- Lin
- 1992
(Show Context)
Citation Context ...used for handling delay is the eligibilitystrace. 1 Introduced by Klopf (1972), eligibility traces have been used in a variety of reinforcement learning systems (e.g., Barto, Sutton & Anderson, 1983; =-=Lin, 1992-=-; Tesauro, 1992; Peng & Williams, 1994). Systematic empirical studies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical results have been obtained by sev... |

246 | Temporal credit assignment in reinforcement learning - Sutton - 1984 |

207 | Convergence of stochastic iterative dynamic programming algorithms - Jaakkola, Jordan, et al. - 1994 |

176 | Neuronlike Elements that Can Solve Difficult Learning Control Problems - Barto, Sutton, et al. - 1983 |

153 | Asynchronous stochastic approximation and Q-learning
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...dies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical results have been obtained by several authors (e.g., Dayan, 1992; Jaakkola, Jordan & Singh, 1994; =-=Tsitsiklis, 1994-=-; Dayan & Sejnowski, 1994; Sutton & Singh, 1994). The idea behind all eligibility traces is very simple. Each time a state is visited it initiates a short-term memory process, a trace, which then deca... |

150 | Reinforcement learning algorithm for partially observable Markov decision process - Jaakkola, Singh, et al. - 1994 |

145 |
Computer Algorithms: Introduction to Design and Analysis
- Baase
- 1988
(Show Context)
Citation Context ... this reason, this estimate is sometimes also referred to as the certainty equivalent estimate (e.g., Kumar and Varaiya, 1986). 5. In theory it is possible to get this down to O(n 2:376 ) operations (=-=Baase, 1988-=-), but, even if practical, this is still far too complex for many applications. 6. Although this algorithm is indeed identical to TD(), the theoretical results for TD() on stationary prediction proble... |

88 | Incremental multi-step Q-learning
- Peng, Williams
- 1996
(Show Context)
Citation Context ...igibilitystrace. 1 Introduced by Klopf (1972), eligibility traces have been used in a variety of reinforcement learning systems (e.g., Barto, Sutton & Anderson, 1983; Lin, 1992; Tesauro, 1992; Peng & =-=Williams, 1994-=-). Systematic empirical studies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical results have been obtained by several authors (e.g., Dayan, 1992; Jaakk... |

78 | Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued state-spaces - Moore - 1991 |

65 | CMAC: An Associative Neural Network Alternative to Backpropagation,” Pmc ZEEE, Vol 78 - Glantz, Kraft - 1990 |

63 | TD models: Modeling the world at a mixture of time scales
- Sutton
- 1995
(Show Context)
Citation Context ...rminates with the first position value that exceeds p t+1 ? 0:5. Notes 1. Arguably, yet a third mechanism for managing delayed reward is to change representations or world models (e.g., Dayan, 1993 ; =-=Sutton, 1995). 2. In s-=-ome previous work (e.g., Sutton & Barto, 1987, 1990) the traces were normalized by a factor of 1 \Gamma fl, which is equivalent to replacing the "1" in these equations by 1 \Gamma fl. In thi... |

61 |
The convergence of td() for general
- Dayan
- 1992
(Show Context)
Citation Context ...g & Williams, 1994). Systematic empirical studies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical results have been obtained by several authors (e.g., =-=Dayan, 1992-=-; Jaakkola, Jordan & Singh, 1994; Tsitsiklis, 1994; Dayan & Sejnowski, 1994; Sutton & Singh, 1994). The idea behind all eligibility traces is very simple. Each time a state is visited it initiates a s... |

55 | TD(λ) converges with probability 1 - CASTRO, Dayan, et al. - 1994 |

49 | Online learning with random representations - Sutton, Whitehead - 1993 |

42 | A temporal-difference model of classical conditioning - Sutton, Barto - 1987 |

28 | Improving generalization for temporal difference learning: The successor representation - Dayan - 1993 |

20 | Monte Carlo matrix inversion and reinforcement learning - Barto, Duff - 1994 |

14 | On step-size and bias in temporal-difference learning - Sutton, Singh |

12 |
Temporal-difference methods and Markov models
- Barnard
- 1993
(Show Context)
Citation Context ...C follows immediately from prior results. Batch TD(1) is a gradient-descent procedure known to converge to the estimate with minimum mean squared error on the training set (Sutton, 1988; Dayan, 1992; =-=Barnard, 1993-=-). In the case of discrete states, the minimum MSE estimate for a state is the sample average of the returns from every visit to that state in the training set, and thus it is the same as the estimate... |

12 |
A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo method for computing one component of the solution of a set of linear algebraic equations
- Curtiss
- 1954
(Show Context)
Citation Context ...n. This suggests that one might estimate a state's value simply by averaging all the returns that follow it. This is what is classically done in Monte Carlo (MC) prediction methods (Rubinstein, 1981; =-=Curtiss, 1954-=-; Wasow, 1952; Barto & Duff, 1994). We distinguish two specific algorithms: Every-visit MC : Estimate the value of a state as the average of the returns that have followed all visits to the state. Fir... |

12 |
Efficient Dynamic Programming-Based Learning for Control
- Peng
- 1993
(Show Context)
Citation Context ...briefly establish the asymptotic correctness of the TD methods. The asymptotic convergence of accumulate TD() for generalsis well known (Dayan, 1992; Jaakkola, Jordan & Singh, 1994; Tsitsiklis, 1994; =-=Peng, 1993-=-). The main results appear to carry over to the replace-trace case with minimal modifications. In particular: Theorem 4: Offline (online) replace TD() converges to the desired value function w.p.1 und... |

12 |
A note on the inversion of matrices by random walks
- Wasow
- 1952
(Show Context)
Citation Context ...s that one might estimate a state's value simply by averaging all the returns that follow it. This is what is classically done in Monte Carlo (MC) prediction methods (Rubinstein, 1981; Curtiss, 1954; =-=Wasow, 1952-=-; Barto & Duff, 1994). We distinguish two specific algorithms: Every-visit MC : Estimate the value of a state as the average of the returns that have followed all visits to the state. First-visit MC :... |

11 | Time-derivative Models of Pavlovian Conditioning - Barto, Sutton - 1990 |

10 | Brain function and adaptive systems---A heterostatic theory - Klopf - 1972 |