Results 11 - 20
of
79
TD Models: Modeling the World at a Mixture of Time Scales
- PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 1995
"... Temporal-difference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of tem ..."
Abstract
-
Cited by 55 (14 self)
- Add to MetaCart
Temporal-difference (TD) learning can be used not just to predict rewards, as is commonly done in reinforcement learning, but also to predict states, i.e., to learn a model of the world's dynamics. We present theory and algorithms for intermixing TD models of the world at different levels of temporal abstraction within a single structure. Such multi-scale TD models can be used in model-based reinforcementlearning architectures and dynamic programming methods in place of conventional Markov models. This enables planning at higher and varied levels of abstraction, and, as such, may prove useful in formulating methods for hierarchical or multi-level planning and reinforcement learning. In this paper we treat only the prediction problem---that of learning a model and value function for the case of fixed agent behavior. Within this context, we establish the theoretical foundations of multi-scale models and derive TD algorithms for learning them. Two small computational experim...
Temporal Abstraction in Reinforcement Learning
, 2000
"... Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such as the moveme ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such as the movements for getting into a car. The problem of picking an appropriate time scale for reasoning and learning has been explored in artificial intelligence, control theory and robotics. In this dissertation we develop a framework that allows novel solutions to this problem, in the context of Markov Decision Processes (MDPs) and reinforcement learning. In this dissertation, we present a general framework for prediction, control and learning at multipl...
Between MDPs and semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales
- Journal of Artificial Intelligence Research
, 1998
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion o ..."
Abstract
-
Cited by 51 (7 self)
- Add to MetaCart
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options—whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semi-Markov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions
Theoretical results on reinforcement learning with temporally abstract options
- in: Proc. 10th European Conference on Machine Learning
, 1998
"... Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcem ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among ”options”, whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closed-loop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multi-time models. In this paper we focus on the theory of planning with multi-time models. We define new Bellman equations that are satisfied for sets of multi-time models. As a consequence, multi-time models can be used interchangeably with models of primitive actions in a variety of well-known planning methods including value iteration, policy improvement and policy iteration. 1
Intra-option learning about temporally abstract actions
- In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decisi ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
Several researchers have proposed modeling temporally abstract actions in reinforcement learning by the combination of a policy and a termination condition, which we refer to as an option. Value functions over options and models of options can be learned using methods designed for semi-Markov decision processes (SMDPs). However, all these methods require an option to be executed to termination. In this paper we explore methods that learn about an option from small fragments of experience consistent with that option, even if the option itself is not executed. We call these methods intra-option learning methods because they learn from experience within an option. Intra-option methods are sometimes much more efficient than SMDP methods because they can use off-policy temporaldifference mechanisms to learn simultaneously about all the options consistent with an experience, not just the few that were actually executed. In this paper we present intra-option learning methods for learning value functions over options and for learning multi-time models of the consequences of options. We present computational examples in which these new methods learn much faster than SMDP methods and learn effectively when SMDP methods cannot learn at all. We also sketch a convergence proof for intraoption value learning. 1
Learning from observation using primitives
- In IEEE International Conference on Robotics and Automation
, 2001
"... This paper describes the use of task primitives in robot learning from observation. A framework has been developed that uses observed data to initially learn a task and then the agent goes on to increase its performance through repeated task performance (learning from practice). Data that is collect ..."
Abstract
-
Cited by 46 (2 self)
- Add to MetaCart
This paper describes the use of task primitives in robot learning from observation. A framework has been developed that uses observed data to initially learn a task and then the agent goes on to increase its performance through repeated task performance (learning from practice). Data that is collected while a human performs a task is parsed into small parts of the task called primitives. Modules are created for each primitive that encode the movements required during the performance of the primitive, and when and where the primitives are performed. The feasibility of this method is currently being tested with agents that learn to play a virtual and an actual air hockey game. 1
Eligibility Traces for Off-Policy Policy Evaluation
- Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally. 1.
Stratified Case-Based Reasoning: Reusing Hierarchical ProblemSolving Episodes
- In Proceedings of the Fourtheenth International Joint Conference on Artificial Intelligence, IJCAI-95
, 1995
"... Stratified case-based reasoning is a technique in which abstract solutions produced during hierarchical problem solving are used to assist casebased retrieval, matching, and adaptation. We describe the motivation for the integration of case-based reasoning with hierarchical problem solving, exemplif ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
Stratified case-based reasoning is a technique in which abstract solutions produced during hierarchical problem solving are used to assist casebased retrieval, matching, and adaptation. We describe the motivation for the integration of case-based reasoning with hierarchical problem solving, exemplify its benefits, detail a set of algorithms that implement our approach, and present their comparative empirical evaluation on a path planning task. Our results show that stratified case-based reasoning significantly decreases the computational expense required to retrieve, match, and adapt cases, leading to performance superior both to simple case-based
Temporal-difference networks
- In Advances in Neural Information Processing Systems 17
, 2005
"... We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or
Improved Switching among Temporally Abstract Actions
- ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 11
, 1999
"... In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well a ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by ...

