Results 11  20
of
408
Identifying useful subgoals in reinforcement learning by local graph partitioning
 In Proceedings of the TwentySecond International Conference on Machine Learning
, 2005
"... We present a new subgoalbased method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs—those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discov ..."
Abstract

Cited by 58 (11 self)
 Add to MetaCart
(Show Context)
We present a new subgoalbased method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs—those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discovery method allows it to successfully identify the type of subgoals we seek—states that lie between two denselyconnected regions of the state space—while producing an algorithm with low computational cost.
Between MDPs and semiMDPs: Learning, planning, and representing knowledge at multiple temporal scales
 Journal of Artificial Intelligence Research
, 1998
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion o ..."
Abstract

Cited by 57 (7 self)
 Add to MetaCart
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key challenges for AI. In this paper we develop an approach to these problems based on the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action to include options—whole courses of behavior that may be temporally extended, stochastic, and contingent on events. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint torques. Options may be given a priori, learned by experience, or both. They may be used interchangeably with actions in a variety of planning and learning methods. The theory of semiMarkov decision processes (SMDPs) can be applied to model the consequences of options and as a basis for planning and learning methods using them. In this paper we develop these connections, building on prior work by Bradtke and Duff (1995), Parr (in prep.) and others. Our main novel results concern the interface between the MDP and SMDP levels of analysis. We show how a set of options can be altered by changing only their termination conditions
Theoretical results on reinforcement learning with temporally abstract options
 in: Proc. 10th European Conference on Machine Learning
, 1998
"... Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinfo ..."
Abstract

Cited by 56 (9 self)
 Add to MetaCart
(Show Context)
Abstract. We present new theoretical results on planning within the framework of temporally abstract reinforcement learning (Precup & Sutton, 1997; Sutton, 1995). Temporal abstraction is a key step in any decision making system that involves planning and prediction. In temporally abstract reinforcement learning, the agent is allowed to choose among ”options”, whole courses of action that may be temporally extended, stochastic, and contingent on previous events. Examples of options include closedloop policies such as picking up an object, as well as primitive actions such as joint torques. Knowledge about the consequences of options is represented by special structures called multitime models. In this paper we focus on the theory of planning with multitime models. We define new Bellman equations that are satisfied for sets of multitime models. As a consequence, multitime models can be used interchangeably with models of primitive actions in a variety of wellknown planning methods including value iteration, policy improvement and policy iteration. 1
SoarRL: Integrating Reinforcement Learning with Soar
 Cognitive Systems
, 2005
"... In this paper, we describe an architectural modification to Soar that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator. This mechanism serves the same purpose as production utilities in A ..."
Abstract

Cited by 55 (13 self)
 Add to MetaCart
(Show Context)
In this paper, we describe an architectural modification to Soar that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator. This mechanism serves the same purpose as production utilities in ACTR, but the implementation is more directly tied to the standard definition of the reinforcement learning (RL) problem. The paper explains our implementation, gives a rationale for adding an RL capability to Soar, and shows results for SoarRL agents ’ performance on two tasks.
SMDP Homomorphisms: An Algebraic Approach to Abstraction in SemiMarkov Decision Processes
, 2003
"... To operate effectively in complex environments learning agents require the ability to selectively ignore irrelevant details and form useful abstractions. ..."
Abstract

Cited by 52 (9 self)
 Add to MetaCart
To operate effectively in complex environments learning agents require the ability to selectively ignore irrelevant details and form useful abstractions.
Action understanding as inverse planning
 Cognition
, 2009
"... Humans are adept at inferring the mental states underlying other agents’ actions, such as goals, beliefs, desires, emotions and other thoughts. We propose a computational framework based on Bayesian inverse planning for modeling human action understanding. The framework represents an intuitive theor ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
Humans are adept at inferring the mental states underlying other agents’ actions, such as goals, beliefs, desires, emotions and other thoughts. We propose a computational framework based on Bayesian inverse planning for modeling human action understanding. The framework represents an intuitive theory of intentional agents ’ behavior based on the principle of rationality: the expectation that agents will plan approximately rationally to achieve their goals, given their beliefs about the world. The mental states that caused an agent’s behavior are inferred by inverting this model of rational planning using Bayesian inference, integrating the likelihood of the observed actions with the prior over mental states. This approach formalizes in precise probabilistic terms the essence of previous qualitative approaches to action understanding based on an “intentional stance ” (Dennett, 1987) or a “teleological stance ” (Gergely et al., 1995). In three psychophysical experiments using animated stimuli of agents moving in simple mazes, we assess how well different inverse planning models based on different goal priors can predict human goal inferences. The results provide quantitative evidence for an approximately rational inference mechanism in human goal inference within our simplified stimulus paradigm, and for the flexible nature of goal representations that human observers can adopt. We discuss the implications of our experimental results for human action understanding in realworld contexts, and suggest how our framework might be extended to capture other kinds of mental state inferences, such as inferences about beliefs, or inferring whether an entity is an intentional agent.
Autonomous Discovery Of Temporal Abstractions From Interaction With An Environment
, 2002
"... This dissertation is dedicated to my parents, Bill and Gaye, who have always loved and believed in me and to my husband, Andy, whose love and support made it possible. ACKNOWLEDGMENTS Andrew Barto has been a great thesis advisor. He has helped me to become a better researcher by shaping my critical ..."
Abstract

Cited by 49 (2 self)
 Add to MetaCart
(Show Context)
This dissertation is dedicated to my parents, Bill and Gaye, who have always loved and believed in me and to my husband, Andy, whose love and support made it possible. ACKNOWLEDGMENTS Andrew Barto has been a great thesis advisor. He has helped me to become a better researcher by shaping my critical thinking as well as by improving my expressive skills. I also benefited greatly from having Rich Sutton as my second advisor during my first two years at the University of Massachusetts. I would like to thank the members of my thesis committee, Eliot Moss, Rod Grupen, and Neil Berthier for their feedback. Doina Precup and Kiri Wagstaff have been wonderful friends and supporters of my research. It is very helpful to have such smart women friends in CS. They provided support when I needed it and they pushed me when I needed that. I feel privileged to know Doina both as a mentor and as a friend. I thank Kiri for helpful feedback on drafts of my dissertation as well as the motivation provided by exchanging and reviewing each other’s thesis
Rulebased Evolutionary Online Learning Systems: LEARNING BOUNDS, CLASSIFICATION, AND PREDICTION
, 2004
"... Rulebased evolutionary online learning systems, often referred to as Michiganstyle learning classifier systems (LCSs), were proposed nearly thirty years ago (Holland, 1976; Holland, 1977) originally calling them cognitive systems. LCSs combine the strength of reinforcement learning with the genera ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
Rulebased evolutionary online learning systems, often referred to as Michiganstyle learning classifier systems (LCSs), were proposed nearly thirty years ago (Holland, 1976; Holland, 1977) originally calling them cognitive systems. LCSs combine the strength of reinforcement learning with the generalization capabilities of genetic algorithms promising a flexible, online generalizing, solely reinforcement dependent learning system. However, despite several initial successful applications of LCSs and their interesting relations with animal learning and cognition, understanding of the systems remained somewhat obscured. Questions concerning learning complexity or convergence remained unanswered. Performance in different problem types, problem structures, concept spaces, and hypothesis spaces stayed nearly unpredictable. This thesis has the following three major objectives: (1) to establish a facetwise theory approach for LCSs that promotes system analysis, understanding, and design; (2) to analyze, evaluate, and enhance the XCS classifier system (Wilson, 1995) by the means of the facetwise approach establishing a fundamental XCS learning theory; (3) to identify both the major advantages of an LCSbased learning approach as well as the most promising potential application areas. Achieving these three objectives leads to a rigorous understanding
Offpolicy temporaldifference learning with function approximation
 Proceedings of the 18th International Conference on Machine Learning
, 2001
"... We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear fun ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
(Show Context)
We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multiscale, multigoal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any ɛsoft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the actionvalue function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1 Although Qlearning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a sevenstate Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet 1 This is a retypeset version of an article published in the Proceedings
Efficient structure learning in factoredstate MDPs
, 2007
"... We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given ..."
Abstract

Cited by 47 (9 self)
 Add to MetaCart
We consider the problem of reinforcement learning in factoredstate MDPs in the setting in which learning is conducted in one long trial with no resets allowed. We show how to extend existing efficient algorithms that learn the conditional probability tables of dynamic Bayesian networks (DBNs) given their structure to the case in which DBN structure is not known in advance. Our method learns the DBN structures as part of the reinforcementlearning process and provably provides an efficient learning algorithm when combined with factored Rmax.