Results 11  20
of
202
LeastSquares Policy Iteration
 Journal of Machine Learning Research
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. ..."
Abstract

Cited by 301 (9 self)
 Add to MetaCart
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration.
Selfimproving reactive agents based on reinforcement learning, planning and teaching
 Machine Learning
, 1992
"... Abstract. To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much ..."
Abstract

Cited by 275 (2 self)
 Add to MetaCart
Abstract. To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning. This paper compares eight reinforcement learning frameworks: adaptive heuristic critic (AHC) learning due to Sutton, Qlearning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The enviromaaent is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.
The partigame algorithm for variable resolution reinforcement learning in multidimensional statespaces
 Machine Learning
, 1995
"... Abstract. Partigame is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous statespaces. In high dimensions it is essential that learning does not plan uniformly over a statespace. Partigame maintains a decisiontree partitioning of statespace and ap ..."
Abstract

Cited by 224 (8 self)
 Add to MetaCart
Abstract. Partigame is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous statespaces. In high dimensions it is essential that learning does not plan uniformly over a statespace. Partigame maintains a decisiontree partitioning of statespace and applies techniques from gametheory and computational geometry to e ciently and adaptively concentrate high resolution only on critical areas. The currentversion of the algorithm is designed to nd feasible paths or trajectories to goal regions in high dimensional spaces. Future versions will be designed to nd a solution that optimizes a realvalued criterion. Many simulated problems have been tested, ranging from twodimensional to ninedimensional statespaces, including mazes, path planning, nonlinear dynamics, and planar snake robots in restricted spaces. In all cases, a good solution is found in less than ten trials and a few minutes.
Learning to coordinate behaviors
 In Proceedings of AAAI90
, 1990
"... We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find ..."
Abstract

Cited by 207 (3 self)
 Add to MetaCart
We describe an algorithm which allows a behaviorbased robot to learn on the basis of positive and negative feedback when to activate its behaviors. In accordance with the philosophy of behaviorbased robots, the algorithm is completely distributed: each of the behaviors independently tries to find out (i) whether it is relevant (ie. whether it is at all correlated to positive feedback) and (ii) what the conditions are under which it becomes reliable (i.e. the conditions under which it maximizes the probability of receiving positive feedback and minimizes the probability of receiving negative feedback). The algorithm has been tested successfully on an autonomous 6legged robot which had to learn how to coordinate its legs so as to walk forward. Situation of the Problem Since 1985, the MIT Mobile Robot group has advocated a radically different architecture for autonomous intelligent agents (Brooks, 1986). Instead of decomposing the architecture into functional modules, such as perception, modeling, and planning (figure 1), the architecture is decomposed into taskachieving modules, also called behaviors (figure 2). This novel approach has already demonstrated to be very successful and similar approaches have become more
Learning and Sequential Decision Making
 LEARNING AND COMPUTATIONAL NEUROSCIENCE
, 1989
"... In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of ..."
Abstract

Cited by 195 (10 self)
 Add to MetaCart
In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of longterm payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the nonengineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.
Reinforcement Learning with Replacing Eligibility Traces
 MACHINE LEARNING
, 1996
"... The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional ..."
Abstract

Cited by 186 (11 self)
 Add to MetaCart
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replacetrace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replacetrace TD is unbiased. In addition, we show that t...
Algorithms for Sequential Decision Making
, 1996
"... Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one ..."
Abstract

Cited by 175 (8 self)
 Add to MetaCart
Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a longrun measure of reward, and "I" is an automated planning or learning system (agent). In particular,
Efficient Exploration In Reinforcement Learning
, 1992
"... Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper d ..."
Abstract

Cited by 122 (4 self)
 Add to MetaCart
Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper distinguishes between two families of exploration schemes: undirected and directed exploration. While the former family is closely related to random walk exploration, directed exploration techniques memorize explorationspecific knowledge which is used for guiding the exploration search. In many finite deterministic domains, any learning technique based on undirected exploration is inefficient in terms of learning time, i.e. learning time is expected to scale exponentially with the size of the state space (Whitehead, 1991b) . We prove that for all these domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of e...
Finding Structure in Reinforcement Learning
 Advances in Neural Information Processing Systems 7
, 1995
"... Reinforcement learning addresses the problem of learning to select actions in order to maximize one's performance in unknown environments. To scale reinforcement learning to complex realworld tasks, such as typically studied in AI, one must ultimately be able to discover the structure in the world, ..."
Abstract

Cited by 102 (4 self)
 Add to MetaCart
Reinforcement learning addresses the problem of learning to select actions in order to maximize one's performance in unknown environments. To scale reinforcement learning to complex realworld tasks, such as typically studied in AI, one must ultimately be able to discover the structure in the world, in order to abstract away the myriad of details and to operate in more tractable problem spaces. This paper presents the SKILLS algorithm. SKILLS discovers skills, which are partially defined action policies that arise in the context of multiple, related tasks. Skills collapse whole action sequences into single operators. They are learned by minimizing the compactness of action policies, using a description length argument on their representation. Empirical results in simple grid navigation tasks illustrate the successful discovery of structure in reinforcement learning. 1 Introduction Reinforcement learning comprises a family of incremental planning algorithms that construct reactive con...
Incremental MultiStep QLearning
 Machine Learning
, 1996
"... . This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement le ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
. This paper presents a novel incremental algorithm that combines Qlearning, a wellknown dynamic programmingbased reinforcement learning method, with the TD() return estimation process, which is typically used in actorcritic learning, another wellknown dynamic programmingbased reinforcement learning method. The parameter is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the nonMarkovian effect of coarse statespace quantization. The resulting algorithm, Q()learning, thus combines some of the best features of the Qlearning and actorcritic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations. Keywords: reinforcement learning, temporal difference learning 1. Introduction The incremental multistep Qlearning (Q()learning) method is a new direct (or modelfree) algorithm that extends the onestep Qlearning algorithm (Watkins 1989) by combining it with TD() ret...