Results 1  10
of
61
Fast gradientdescent methods for temporaldifference learning with linear function approximation
 In Danyluk et
, 2009
"... ..."
DecisionTheoretic Military Operations Planning
, 2004
"... Military operations planning involves concurrent actions, resource assignment, and conflicting costs. Individual tasks sometimes fail with a known probability, promoting a decisiontheoretic approach. The planner must choose between multiple tasks that achieve similar outcomes but have different cos ..."
Abstract

Cited by 42 (7 self)
 Add to MetaCart
Military operations planning involves concurrent actions, resource assignment, and conflicting costs. Individual tasks sometimes fail with a known probability, promoting a decisiontheoretic approach. The planner must choose between multiple tasks that achieve similar outcomes but have different costs. The military domain is particularly suited to automated methods because hundreds of tasks, specified by many planning staff, need to be quickly and robustly coordinated. The authors
A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation
 Advances in Neural Information Processing Systems 21 (to appear
"... We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. ..."
Abstract

Cited by 41 (9 self)
 Add to MetaCart
We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policyevaluation setting in which the data need not come from onpolicy experience. The gradient temporaldifference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same leastsquares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importancesampling methods. 1 Offpolicy learning methods Offpolicy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “offpolicy ” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Qlearning algorithm (Watkins 1989) is an offpolicy temporaldifference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding ɛgreedy policy. Offpolicy methods are also critical to reinforcementlearningbased efforts to model humanlevel world knowledge and state representations as predictions of option outcomes (e.g.,
An Analysis of Reinforcement Learning with Function Approximation
"... We address the problem of computing the optimal Qfunction in Markov decision problems with infinite statespace. We analyze the convergence properties of several variations of Qlearning when combined with function approximation, extending the analysis of TDlearning in (Tsitsiklis & Van Roy, 1 ..."
Abstract

Cited by 39 (5 self)
 Add to MetaCart
(Show Context)
We address the problem of computing the optimal Qfunction in Markov decision problems with infinite statespace. We analyze the convergence properties of several variations of Qlearning when combined with function approximation, extending the analysis of TDlearning in (Tsitsiklis & Van Roy, 1996a) to stochastic control settings. We identify conditions under which such approximate methods converge with probability 1. We conclude with a brief discussion on the general applicability of our results and compare them with several related works. 1.
Importance sampling for reinforcement learning with multiple objectives
, 2001
"... OFTECHNOLOGY hairman, ..."
Temporal abstraction in temporaldifference networks
 In Advances in Neural Information Processing Systems 18 (NIPS*05
, 2006
"... We present a generalization of temporaldifference networks to include temporally abstract options on the links of the question network. Temporaldifference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
We present a generalization of temporaldifference networks to include temporally abstract options on the links of the question network. Temporaldifference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the interrelated predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intraoption learning in TD networks with function approximation and eligibility
CompetitiveCooperativeConcurrent Reinforcement Learning with Importance Sampling
 In Proc. of International Conference on Simulation of Adaptive Behavior: From Animals and Animats
, 2004
"... The speed and performance of learning depend on the complexity of the learner. A simple learner with few parameters and no internal states can quickly obtain a reactive policy, but its performance is limited. A learner with many parameters and internal states may finally achieve high performance, bu ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
The speed and performance of learning depend on the complexity of the learner. A simple learner with few parameters and no internal states can quickly obtain a reactive policy, but its performance is limited. A learner with many parameters and internal states may finally achieve high performance, but it may take enormous time for learning. Therefore, it is difficult to decide in advance which architecture and algorithm should be used for a new task. In this paper, we propose a new framework for selecting an appropriate policy out of a set of heterogeneous reinforcement learning modules and for correctly improving the policies of all learning modules including those not selected, using the method of importance sampling. In this framework, multiple heterogeneous learning modules sharing the same sensorymotor system can compete to act and cooperate to learn, allowing the overall learning system to obtain a good performance faster. We show in a simulation of partiallyobservable pole balancing task and robotic experiments of batterypack foraging and partially observable Tmaze tasks that a complex learning module trained with the proposed method can actually learn faster than when it is trained alone, by exploiting taskrelevant episodes generated by suboptimal, but fastlearning modules.
A Survey of MultiObjective Sequential DecisionMaking
"... Sequential decisionmaking problems with multiple objectives arise naturally in practice and pose unique challenges for research in decisiontheoretic planning and learning, which has largely focused on singleobjective settings. This article surveys algorithms designed for sequential decisionmakin ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
Sequential decisionmaking problems with multiple objectives arise naturally in practice and pose unique challenges for research in decisiontheoretic planning and learning, which has largely focused on singleobjective settings. This article surveys algorithms designed for sequential decisionmaking problems with multiple objectives. Though there is a growing body of literature on this subject, little of it makes explicit under what circumstances special methods are needed to solve multiobjective problems. Therefore, we identify three distinct scenarios in which converting such a problem to a singleobjective one is impossible, infeasible, or undesirable. Furthermore, we propose a taxonomy that classifies multiobjective methods according to the applicable scenario, the nature of the scalarization function (which projects multiobjective values to scalar ones), and the type of policies considered. We show how these factors determine the nature of an optimal solution, which can be a single policy, a convex hull, or a Pareto front. Using this taxonomy, we survey the literature on multiobjective methods for planning and learning. Finally, we discuss key applications of such methods and outline opportunities for future work. 1.
Convergence of Least Squares Temporal Difference Methods Under General Conditions
"... We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the offpolicy learning context and with the simulationbased least squares temporal difference algorithm, LSTD(λ). We establish for the discounted cost criterion that the offpolicy LSTD(λ) conv ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the offpolicy learning context and with the simulationbased least squares temporal difference algorithm, LSTD(λ). We establish for the discounted cost criterion that the offpolicy LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces. 1. Overview We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in an explorationenhanced learning context, called “offpolicy” learning. In this context, we employ a certain policy called the “behavior policy ” to adequately explore the state and action space, and using the observations of costs and transitions generated under the behavior policy, we may approximately evaluate any suitable “target policy ” of interest. This differs from the standard policy evaluation case – “onpolicy ” learning – where the behavior policy always coincides with the policy to be evaluated. The dichotomy between the offpolicy and onpolicy learning stems from the explorationexploitation tradeoff in practical modelfree/simulationbased methods for policy search. With their flexibility, offpolicy methods form an important part of the modelfree learning methodology (Sutton & Barto, 1998) and have been suggested as important