## Learning and Sequential Decision Making (1989)

Venue: | LEARNING AND COMPUTATIONAL NEUROSCIENCE |

Citations: | 200 - 11 self |

### BibTeX

@INPROCEEDINGS{Barto89learningand,

author = {Andrew G. Barto and R. S. Sutton and C. J. C. H. Watkins},

title = {Learning and Sequential Decision Making},

booktitle = {LEARNING AND COMPUTATIONAL NEUROSCIENCE},

year = {1989},

pages = {539--602},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of long-term payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the non-engineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.

### Citations

4171 |
Pattern Classification and Scene Analysis
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...D procedure is method for estimating an evaluation function by means of parameter estimation. Methods for parameter estimation are central to the fields of pattern classification (e.g., Duda and Hart =-=[16]-=-, Sklansky and Wassel [58]), adaptive signal processing (e.g., Widrow and Stearns[73]), and adaptive control (e.g., Goodwin and Sin [20]), as well as the field of connectionist modeling (e.g., refs. [... |

3012 |
Learning internal representation by error propagation
- Rumelhart, Hinton, et al.
- 1996
(Show Context)
Citation Context ...vative of f with respect the i th dimension evaluated at the point x. This vector points in the direction of the steepest increase of the surface at the point x. 19 The error back--propagation method =-=[52]-=- is derived by computing this error gradient for a particular class of models in the form of layered connectionist networks. 27 objective is to adjust the parameter values of the linear model in order... |

2891 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ... size (as determined by its horizon, number of states, and number of actions) that it is not feasible to perform this search for large tasks. Dynamic programming, a term introduced in 1957 by Bellman =-=[10]-=-, consists of particular methods for organizing the search under the assumption that a complete model of the decision task is available. Although these methods are much more efficient than explicit ex... |

727 |
Heuristics: Intelligent Search Strategies for Computer Problem Solving
- Pearl
- 1984
(Show Context)
Citation Context ...cial intelligence research concerns search strategies of this type, called "heuristic search" strategies, although their objective is usually not to maximize a measure of cumulative payoff. =-=See Pearl [44]-=-. 5 Most of the methods for the adaptive control of Markov processes described in the engineering literature are model--based. Examples are provided by Borkar and Varaiya [12], El-Fattah [17], Kumar a... |

664 |
A theory of Pavlovian conditioning: Variations in effectiveness of reinforcement and non-reinforcement
- RESCORLA, WAGNER
- 1972
(Show Context)
Citation Context ... the exchange of ideas between researchers studying natural learning and those studying synthetic learning. Sutton and Barto [63] pointed out that the Rescorla--Wagner model of classical conditioning =-=[47]-=- is identical (with some minor caveats) to the equation presented by Widrow and Hoff [72] as a procedure for approximating solutions of systems of linear equations. As a behavioral model, this equatio... |

563 |
Dynamic Programming and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...ructing an optimal policy that is more closely related to the learning method we describe in Section 7 than is value iteration. This method is called the policy improvement method or policy iteration =-=[26, 51]-=- because it generates a sequence of policies, each of which is an improvement over its predecessor. Although one needs to know the optimal evaluation function in order to define an optimal policy, the... |

433 |
Theory and Practice of Recursive Identification
- Ljung
- 1983
(Show Context)
Citation Context ...heory of the LMS rule, and other parameter estimation methods, is very well--developed but beyond the scope to the present report. Theoretical treatments of parameter estimation can be found in refs. =-=[16, 20, 33, 58, 73]-=-. This theory specifies conditions under which a parameter estimation method converges to a final estimate and what criterion of best fit the final estimate satisfies. To obtain these theoretical resu... |

290 |
Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rulebased systems
- Holland
- 1986
(Show Context)
Citation Context ...adaptive prediction. A number of other researchers have independently developed and experimented with methods that use TD principles or closely related ones. The "bucket brigade" algorithm o=-=f Holland [24]-=- is closely related to the TD procedure as discussed by Sutton [61] and Liepins, Hilliard, and Palmer [32]. Booker's [11] learning system employs a TD procedure, as does Hampson's [22] proposed learni... |

281 |
Introduction to Stochastic Dynamic Programming
- Ross
- 1983
(Show Context)
Citation Context ...g outlined in this report also has the potential for fostering communication between animal learning theorists and in the TD procedure, although there has been much research on related problems (Ross =-=[51]-=- and Dreyfus and Law [15] provide good expositions of dynamic programming, and Footnotes 5 and 6 provide references to some of the related engineering research on the adaptive control of Markov proces... |

268 |
Connectionist models and their properties
- Feldman, Ballard
- 1982
(Show Context)
Citation Context ... and selecting the action whose number is largest. This is a kind of competition among the actions that can be implemented by connectionist networks with lateral inhibition (e.g., Feldman and Ballard =-=[19]-=-). An additional consideration in parameterizing policies is that some policy adjustment methods require stochastic policies instead of deterministic policies. In this case, one has to parameterize fu... |

263 |
Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Spartan
- Rosenblatt
- 1962
(Show Context)
Citation Context ...laining the TD procedure. The 17 If one replaces the linear model of Equation 15 with the linear threshold decision rule given by Expression 16, one obtains the Perceptron learning rule of Rosenblatt =-=[50]-=-, which is identical to Equation 21 except that the weighted sum v T t OE t is replaced by the threshold of the weighted sum as defined by Expression 16. These correspondences are explained in more de... |

253 |
Learning Automata: An Introduction
- Narendra, Thathachar
- 1989
(Show Context)
Citation Context ...1], Wheeler and Narendra [71], and Witten [77]. Most of these rely on results about the collective behavior of stochastic learning automata and ergodic Markov chains (see also Narendra and Thathachar =-=[43]-=-). 12 in a manner depending on x and a. This sequence of events repeats for an indefinite number of time steps. Because we shall be concerned only with the expectation of the total amount of payoff ac... |

243 |
Adaptive Filtering Prediction and Control
- Goodwin, Sin
- 1984
(Show Context)
Citation Context ...Ross [51]. The second major part of the report, consisting of Section 6, is a tutorial on parameter estimation based on the view taken in the field of adaptive control as described by Goodwin and Sin =-=[20]-=-. Some of this material also appears in Barto [5]. The report's third major part, consisting of Section 7, shows how the TD procedure emerges as a synthesis of the ideas from dynamic programming and p... |

198 | Steps towards Artificial Intelligence
- Minsky
- 1963
(Show Context)
Citation Context ...y to arise later in the game. Using this method, it was possible to "assign credit" to moves that were instrumental in setting the stage for later moves that directly captured opponent piece=-=s. Minsky [40, 41]-=- discussed the credit assignment problem and methods similar Samuel's in terms of connectionist networks and animal learning. Mendel and McLaren [38] discussed similar methods in the context of contro... |

190 |
Neuronlike Elements That Can Solve Difficult Learning Control Problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ... including the TD model of conditioning, are useful for adaptive prediction, and additional publications illustrate how TD methods can be used as components of synthetic learning systems (e.g., refs. =-=[3, 9, 60]-=-). Here, we restrict attention to a slightly simplified version of the TD model, which we call the TD procedure in this report. We show how the TD procedure is related to theoretical principles which ... |

173 |
Stochastic Models of Learning
- Bush, Mosteller
- 1955
(Show Context)
Citation Context ...een inspired by the theory of stochastic learning automata developed by cyberneticians and engineers [43, 66]. A precursor of this theory is the statistical learning theory developed by psychologists =-=[13, 18]-=-. Barto and Anandan [8], and Barto [6, 7] discuss a learning rule of this kind called the Associative Reward/Penalty, or AR0P , rule. Sutton [60], Anderson [2, 3], and Gullapalli [21] describe the res... |

139 |
Dynamic modeling in behavioral ecology
- CW
- 1988
(Show Context)
Citation Context ...as been used extensively in behavioral ecology for the analysis of animal behavior (see, for example, Krebs, Kacelnik, and Taylor [29], Houston, Clark, McNamara, and Mangel [25], and Mangel and Clark =-=[36]-=-). In these studies dynamic programming is used to determine decision strategies meeting certain definitions of optimality to which animal behavior is compared. Behavioral ecologists do not suggest th... |

127 |
Boxes: An experiment in adaptive control
- Michie, Chambers
- 1968
(Show Context)
Citation Context ... Barto, Sutton, and Anderson [9], where it was incorporated into a neuron--like unit called the "adaptive critic element." This system, which was inspired by the "Boxes" system of =-=Michie and Chambers [39]-=-, was further studied by Selfridge, Sutton, and Barto [57] and Anderson [2, 3]. Since then, Sutton [61] has extended the theory and has proved a number of theoretical results. His results suggest that... |

120 | Neurocomputing - Foundation of Research - Anderson, Rosenfeled - 1988 |

80 |
Contemporary Animal Learning Theory
- Dickinson
- 1980
(Show Context)
Citation Context ...ct [65], it is not our aim to argue for the validity of this view of animal instrumental learning, which probably involves more than can be accounted for by an S--R model (see, for example, Dickinson =-=[14]-=- and Rescorla [46]). In Section 7 we provide an example of how the TD procedure can be used with this kind of reinforcement learning method, but the TD procedure can also be combined with model--based... |

80 |
The utility driven dynamic error propagation network
- Robinson, Fallside
- 1987
(Show Context)
Citation Context ...ions for state--action pairs instead just states (Watkins [67]); the computation of evaluation gradients using the model of the evaluation function (Munro [42], Werbos [69, 70], Robinson and Fallside =-=[49]-=-, and Williams [76]); and the use of systematic, instead of stochastic, variation in activity. 40 a stochastic policy. A stochastic policy selects actions probabilistically so that over time many acti... |

79 |
Intelligent behavior as an adaptation to the task environment
- Booker
- 1982
(Show Context)
Citation Context ... principles or closely related ones. The "bucket brigade" algorithm of Holland [24] is closely related to the TD procedure as discussed by Sutton [61] and Liepins, Hilliard, and Palmer [32].=-= Booker's [11]-=- learning system employs a TD procedure, as does Hampson's [22] proposed learning system, which is very similar to the one we discuss here. Other related procedures have been proposed as models of cla... |

77 | Strategy learning with multilayer connectionist representations
- Anderson
- 1987
(Show Context)
Citation Context ... including the TD model of conditioning, are useful for adaptive prediction, and additional publications illustrate how TD methods can be used as components of synthetic learning systems (e.g., refs. =-=[3, 9, 60]-=-). Here, we restrict attention to a slightly simplified version of the TD model, which we call the TD procedure in this report. We show how the TD procedure is related to theoretical principles which ... |

77 |
The art and theory of dynamic programming
- Dreyfus, Law
- 1977
(Show Context)
Citation Context ... also has the potential for fostering communication between animal learning theorists and in the TD procedure, although there has been much research on related problems (Ross [51] and Dreyfus and Law =-=[15]-=- provide good expositions of dynamic programming, and Footnotes 5 and 6 provide references to some of the related engineering research on the adaptive control of Markov processes). To the best of our ... |

76 |
Pattern recognizing stochastic learning automata
- Barto, Anandan
- 1985
(Show Context)
Citation Context ...f stochastic learning automata developed by cyberneticians and engineers [43, 66]. A precursor of this theory is the statistical learning theory developed by psychologists [13, 18]. Barto and Anandan =-=[8]-=-, and Barto [6, 7] discuss a learning rule of this kind called the Associative Reward/Penalty, or AR0P , rule. Sutton [60], Anderson [2, 3], and Gullapalli [21] describe the results of computer simula... |

73 |
Toward a statistical theory of learning
- Estes
- 1950
(Show Context)
Citation Context ...een inspired by the theory of stochastic learning automata developed by cyberneticians and engineers [43, 66]. A precursor of this theory is the statistical learning theory developed by psychologists =-=[13, 18]-=-. Barto and Anandan [8], and Barto [6, 7] discuss a learning rule of this kind called the Associative Reward/Penalty, or AR0P , rule. Sutton [60], Anderson [2, 3], and Gullapalli [21] describe the res... |

68 |
Parallel models of associative memory
- Hinton, Anderson
- 1981
(Show Context)
Citation Context ...signal processing and pattern classification, and are currently playing a major role in the emerging field of connectionist modeling (see, for example, Anderson and Rosenfeld [4], Hinton and Anderson =-=[23]-=-, McClelland and Rumelhart [37], and Rumelhart and McClelland [37]). The connection between the experimental and computational literatures due to the parallel between the Rescorla--Wagner model and th... |

56 |
Learning by statistical cooperation of self-interested neuron-like computing elements
- Barto
- 1985
(Show Context)
Citation Context ...ntical to Equation 21 except that the weighted sum v T t OE t is replaced by the threshold of the weighted sum as defined by Expression 16. These correspondences are explained in more detail in refs. =-=[6, 16, 63]-=-. 18 A function, f , from an n--dimensional space to the real numbers can be viewed as a surface. If x is a point in the n--dimensional space, then the gradient of f at x is the vector ( @f @x 1 (x); ... |

55 | Learning and Problem solving with multilayer connectionist systems
- Anderson
- 1986
(Show Context)
Citation Context ...ike unit called the "adaptive critic element." This system, which was inspired by the "Boxes" system of Michie and Chambers [39], was further studied by Selfridge, Sutton, and Bart=-=o [57] and Anderson [2, 3]-=-. Since then, Sutton [61] has extended the theory and has proved a number of theoretical results. His results suggest that TD procedures can have advantages over other methods for adaptive prediction.... |

52 | Connectionist learning for control: An overview
- Barto
- 1989
(Show Context)
Citation Context ...nsisting of Section 6, is a tutorial on parameter estimation based on the view taken in the field of adaptive control as described by Goodwin and Sin [20]. Some of this material also appears in Barto =-=[5]-=-. The report's third major part, consisting of Section 7, shows how the TD procedure emerges as a synthesis of the ideas from dynamic programming and parameter estimation covered in the first two part... |

33 |
Theory of Neural - Analog Reinforcement System and Application to the Brain - Model Problem
- Minsky
- 1954
(Show Context)
Citation Context ...y to arise later in the game. Using this method, it was possible to "assign credit" to moves that were instrumental in setting the stage for later moves that directly captured opponent piece=-=s. Minsky [40, 41]-=- discussed the credit assignment problem and methods similar Samuel's in terms of connectionist networks and animal learning. Mendel and McLaren [38] discussed similar methods in the context of contro... |

27 |
A Dual Back-Propagation Scheme for Scalar Reward Learning
- Munro
- 1987
(Show Context)
Citation Context ...port. These include the estimation of evaluations for state--action pairs instead just states (Watkins [67]); the computation of evaluation gradients using the model of the evaluation function (Munro =-=[42]-=-, Werbos [69, 70], Robinson and Fallside [49], and Williams [76]); and the use of systematic, instead of stochastic, variation in activity. 40 a stochastic policy. A stochastic policy selects actions ... |

20 |
Test of optimal sampling by foraging great tits
- Kacelnik, Taylor
- 1978
(Show Context)
Citation Context ...ited in refs. [64, 62]. 4 behavioral ecologists. Dynamic programming has been used extensively in behavioral ecology for the analysis of animal behavior (see, for example, Krebs, Kacelnik, and Taylor =-=[29]-=-, Houston, Clark, McNamara, and Mangel [25], and Mangel and Clark [36]). In these studies dynamic programming is used to determine decision strategies meeting certain definitions of optimality to whic... |

19 |
Estimation and control in Markov chains
- Mandl
- 1974
(Show Context)
Citation Context ...lity that depends on the action a. We denote this probability P xy (a). When this state transition occurs, the agent receives a payoff, denoted r, which is determined randomly and Poznyak [34], Mandl =-=[35]-=-, Riordon [48], and Sato, Abe, and Takeda [56]. Most of these methods apply to the case in which return is the average payoff per--time--step and the underlying system is an ergodic Markov chain for e... |

18 |
Adaptive control of Markov chains, I: Finite parameter set
- Borkar, Varaiya
- 1979
(Show Context)
Citation Context ...lative payoff. See Pearl [44]. 5 Most of the methods for the adaptive control of Markov processes described in the engineering literature are model--based. Examples are provided by Borkar and Varaiya =-=[12]-=-, El-Fattah [17], Kumar and Lin [30], Lyubchik 11 In this report our concern is with other approaches to learning how to solve sequential decision tasks, which we call direct approaches. Instead of le... |

17 | Mechanisms of planning and problem solving in the brain
- Albus
- 1979
(Show Context)
Citation Context ...and examples involving other representations are provided by Anderson's [3] use of layered networks in the pole--balancing problem and Watkin's [67] use of the representation method proposed by Albus =-=[1]-=- in a model of the cerebellum. 35 7.2 Learning an Optimal Decision Policy Section 7.1 addressed the problem of learning the evaluation function for a fixed policy in the absence of a model of the deci... |

14 |
Reinforcement learning control and pattern recognition systems
- Mendel, McLaren
- 1970
(Show Context)
Citation Context ...that directly captured opponent pieces. Minsky [40, 41] discussed the credit assignment problem and methods similar Samuel's in terms of connectionist networks and animal learning. Mendel and McLaren =-=[38]-=- discussed similar methods in the context of control problems, and the learning method of Witten [77], presented in the context of Markov decision problems, is closely related to the method we describ... |

10 |
From chemotaxis to cooperativity: Abstract exercises in neuronal learning strategies
- Barto
- 1989
(Show Context)
Citation Context ...arning automata developed by cyberneticians and engineers [43, 66]. A precursor of this theory is the statistical learning theory developed by psychologists [13, 18]. Barto and Anandan [8], and Barto =-=[6, 7]-=- discuss a learning rule of this kind called the Associative Reward/Penalty, or AR0P , rule. Sutton [60], Anderson [2, 3], and Gullapalli [21] describe the results of computer simulations of related m... |

10 |
A neural model of adaptative behavior
- Hampson
- 1983
(Show Context)
Citation Context ...ithm of Holland [24] is closely related to the TD procedure as discussed by Sutton [61] and Liepins, Hilliard, and Palmer [32]. Booker's [11] learning system employs a TD procedure, as does Hampson's =-=[22]-=- proposed learning system, which is very similar to the one we discuss here. Other related procedures have been proposed as models of classical conditioning in publications cited in refs. [64, 62]. 4 ... |

10 |
The Hedonistic Neuron: A Theory of Memory
- Klopf
- 1982
(Show Context)
Citation Context ...and abstracted away from the domain of game playing. This work began with the interest of Sutton and Barto in classical conditioning and the exploration of Klopf's idea of "generalized reinforcem=-=ent" [27, 28]-=-, which emphasized the importance of sequentiality in a neuronal model of learning. The adaptive heuristic critic algorithm was used (although in slightly different form) in the reinforcement--learnin... |

10 |
Optimal adaptive controllers for unknown Markov chains
- Kumar, Lin
- 1982
(Show Context)
Citation Context ...t of the methods for the adaptive control of Markov processes described in the engineering literature are model--based. Examples are provided by Borkar and Varaiya [12], El-Fattah [17], Kumar and Lin =-=[30]-=-, Lyubchik 11 In this report our concern is with other approaches to learning how to solve sequential decision tasks, which we call direct approaches. Instead of learning a model of the decision task,... |

5 |
A stochastic algorithm for learning real-valued functions via reinforcement feedback
- Gullapalli
- 1988
(Show Context)
Citation Context ...sychologists [13, 18]. Barto and Anandan [8], and Barto [6, 7] discuss a learning rule of this kind called the Associative Reward/Penalty, or AR0P , rule. Sutton [60], Anderson [2, 3], and Gullapalli =-=[21]-=- describe the results of computer simulations of related methods. Williams [74, 75] provides theoretical analysis of a more general class of stochastic learning rules. There are other approaches to re... |

5 |
A Pavlovian analysis of goal-directed behavior
- Rescorla
- 1987
(Show Context)
Citation Context ... our aim to argue for the validity of this view of animal instrumental learning, which probably involves more than can be accounted for by an S--R model (see, for example, Dickinson [14] and Rescorla =-=[46]-=-). In Section 7 we provide an example of how the TD procedure can be used with this kind of reinforcement learning method, but the TD procedure can also be combined with model--based methods in a vari... |

3 |
Dynamic models in behavioral and evolutionary ecology
- Houston, Clark, et al.
- 1988
(Show Context)
Citation Context ...ists. Dynamic programming has been used extensively in behavioral ecology for the analysis of animal behavior (see, for example, Krebs, Kacelnik, and Taylor [29], Houston, Clark, McNamara, and Mangel =-=[25]-=-, and Mangel and Clark [36]). In these studies dynamic programming is used to determine decision strategies meeting certain definitions of optimality to which animal behavior is compared. Behavioral e... |

3 |
function and adaptive systems---A heterostatic theory. Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories
- Brain
- 1972
(Show Context)
Citation Context ...and abstracted away from the domain of game playing. This work began with the interest of Sutton and Barto in classical conditioning and the exploration of Klopf's idea of "generalized reinforcem=-=ent" [27, 28]-=-, which emphasized the importance of sequentiality in a neuronal model of learning. The adaptive heuristic critic algorithm was used (although in slightly different form) in the reinforcement--learnin... |

2 |
Recursive algorithms for adaptive control of finite markov chains
- El-Fattah
- 1981
(Show Context)
Citation Context ...ee Pearl [44]. 5 Most of the methods for the adaptive control of Markov processes described in the engineering literature are model--based. Examples are provided by Borkar and Varaiya [12], El-Fattah =-=[17]-=-, Kumar and Lin [30], Lyubchik 11 In this report our concern is with other approaches to learning how to solve sequential decision tasks, which we call direct approaches. Instead of learning a model o... |

2 |
Learning Algorithms and Applications
- Lakshmivarahan
- 1981
(Show Context)
Citation Context ... estimates for the transition and payoff probabilities. 6 Examples of various direct methods for learning how to solve sequential decision tasks are those of Lyubchik and Poznyak [34], Lakshmivarahan =-=[31]-=-, Wheeler and Narendra [71], and Witten [77]. Most of these rely on results about the collective behavior of stochastic learning automata and ergodic Markov chains (see also Narendra and Thathachar [4... |

2 |
Credit Assignment and Discovery in Classifier Systems
- Liepins, Hilliard, et al.
- 1991
(Show Context)
Citation Context ...ods that use TD principles or closely related ones. The "bucket brigade" algorithm of Holland [24] is closely related to the TD procedure as discussed by Sutton [61] and Liepins, Hilliard, a=-=nd Palmer [32]-=-. Booker's [11] learning system employs a TD procedure, as does Hampson's [22] proposed learning system, which is very similar to the one we discuss here. Other related procedures have been proposed a... |

2 |
An adaptive automaton controller for discrete-time markov processes
- Riordon
- 1969
(Show Context)
Citation Context ...nds on the action a. We denote this probability P xy (a). When this state transition occurs, the agent receives a payoff, denoted r, which is determined randomly and Poznyak [34], Mandl [35], Riordon =-=[48]-=-, and Sato, Abe, and Takeda [56]. Most of these methods apply to the case in which return is the average payoff per--time--step and the underlying system is an ergodic Markov chain for each possible p... |

1 |
Learning automata in stochastic plant control problems
- Lyubchik, Poznyak
- 1974
(Show Context)
Citation Context ...th a probability that depends on the action a. We denote this probability P xy (a). When this state transition occurs, the agent receives a payoff, denoted r, which is determined randomly and Poznyak =-=[34]-=-, Mandl [35], Riordon [48], and Sato, Abe, and Takeda [56]. Most of these methods apply to the case in which return is the average payoff per--time--step and the underlying system is an ergodic Markov... |