Results 1  10
of
31
Locally Weighted Learning for Control
, 1996
"... Lazy learning methods provide useful representations and training algorithms for learning about complex phenomena during autonomous adaptive control of complex systems. This paper surveys ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks. We ex ..."
Abstract

Cited by 159 (17 self)
 Add to MetaCart
Lazy learning methods provide useful representations and training algorithms for learning about complex phenomena during autonomous adaptive control of complex systems. This paper surveys ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks. We explain various forms that control tasks can take, and how this affects the choice of learning paradigm. The discussion section explores the interesting impact that explicitly remembering all previous experiences has on the problem of learning to control.
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
Practical Reinforcement Learning in Continuous Spaces
, 2000
"... Dynamic control tasks are good candidates for the application of reinforcement learning techniques. However, many of these tasks inherently have continuous state or action variables. This can cause problems for traditional reinforcement learning algorithms which assume discrete states and actions. I ..."
Abstract

Cited by 83 (3 self)
 Add to MetaCart
Dynamic control tasks are good candidates for the application of reinforcement learning techniques. However, many of these tasks inherently have continuous state or action variables. This can cause problems for traditional reinforcement learning algorithms which assume discrete states and actions. In this paper, we introduce an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data. We give experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment. The algorithm works well in a traditional learning setting, and demonstrates extremely good learning when bootstrapped with a small amount of humanprovided data.
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the NystrÃ¶m extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
Inductive Learning of Reactive Action Models
 Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... An important area of learning in autonomous agents is the ability to learn domainspecific models of actions to be used by planning systems. In this paper, we present methods by which an agent learns action models from its own experience and from its observation of a domain expert. These methods dif ..."
Abstract

Cited by 54 (1 self)
 Add to MetaCart
An important area of learning in autonomous agents is the ability to learn domainspecific models of actions to be used by planning systems. In this paper, we present methods by which an agent learns action models from its own experience and from its observation of a domain expert. These methods differ from previous work in the area in two ways: the use of an action model formalism which is better suited to the needs of a reactive agent, and successful implementation of noisehandling mechanisms. Training instances are generated from experience and observation, and a variant of GOLEM is used to learn action models from these instances. The integrated learning system has been experimentally validated in simulated construction and office domains. 1 INTRODUCTION Autonomous agents acting in complex environments must be capable of learning from experience, both to avoid the need for exhaustive preprogramming and to adapt to unanticipated or changing situations. Most such work has focused o...
Robot Shaping: Developing Situated Agents through Learning
, 1993
"... Learning plays a vital role in the development of situated agents. In this paper, we explore the use of reinforcement learning to "shape" a robot to perform a predefined target behavior. We connect both simulated and real robots to ALECSYS, a parallel implementation of a learning classifier system w ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
Learning plays a vital role in the development of situated agents. In this paper, we explore the use of reinforcement learning to "shape" a robot to perform a predefined target behavior. We connect both simulated and real robots to ALECSYS, a parallel implementation of a learning classifier system with an extended genetic algorithm. After classifying different kinds of Animatlike behaviors, we explore the effects on learning of different types of agent's architecture (monolithic, flat and hierarchical) and of training strategies. In particular, hierarchical architecture requires the agent to learn how to coordinate basic learned responses. We show that the best results are achieved when both the agent's architecture and the training strategy match the structure of the behavior pattern to be learned. We report the results of a number of experiments carried out both in simulated and in real environments, and show that the results of simulations carry smoothly to real robots. While most o...
Online learning with random representations
 In Proceedings of the Tenth International Conference on Machine Learning
, 1993
"... We consider the requirements of online learninglearning which must be done incrementally and in realtime, with the results of learning available soon after each new example is acquired. Despite the abundance of methods for learning from examples, there are few that can be used e ectively for online ..."
Abstract

Cited by 49 (5 self)
 Add to MetaCart
We consider the requirements of online learninglearning which must be done incrementally and in realtime, with the results of learning available soon after each new example is acquired. Despite the abundance of methods for learning from examples, there are few that can be used e ectively for online learning, e.g., as components of reinforcement learning systems. Most of these few, including radial basis functions, CMACs, Kohonen's selforganizing maps, and those developed in this paper, share the same structure. All expand the original input representation into a higher dimensional representation in an unsupervised way, and then map that representation to the nal answer using a relatively simple supervised learner, such as a perceptron or LMS rule. Such structures learn very rapidly and reliably, but have been thought either to scale poorly or to require extensive domain knowledge. To the contrary, some researchers (Rosenblatt,
Learning to Solve Markovian Decision Processes
, 1994
"... This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have d ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent's experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is "greedy" with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the develop...
Reacting, Planning, and Learning in an Autonomous Agent
"... We present an autonomous agent architecture and its component subsystems that integrate important abilities needed for robust, flexible performance in dynamic environments. These abilities involve appropriate reaction to environmental situations given the agent's goals; selective attention to multip ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
We present an autonomous agent architecture and its component subsystems that integrate important abilities needed for robust, flexible performance in dynamic environments. These abilities involve appropriate reaction to environmental situations given the agent's goals; selective attention to multiple, competing goals; planning new action routines when innovation beyond designerprovided routines is necessary; and learning the effects of actions so that the planner can use them to build ever more reliable plans. The teleoreactive format allows actions to be closely coupled to continuous environmental feedback and is also especially compatible with conventional AI planning and learning mechanisms. The workings of the proposed architecture and its subsystems are illustrated in a simulated robot domain. We conclude by noting areas where future work is needed.
Finitesample convergence rates for Qlearning and indirect algorithms
 In Neural Information Processing Systems 12
, 1999
"... In this paper, we address two issues of longstanding interest in the reinforcement learning literature. First, what kinds of performance guarantees can be made for Qlearning after only a nite number of actions? Second, what quantitative comparisons can be made between Qlearning and modelbased (i ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
In this paper, we address two issues of longstanding interest in the reinforcement learning literature. First, what kinds of performance guarantees can be made for Qlearning after only a nite number of actions? Second, what quantitative comparisons can be made between Qlearning and modelbased (indirect) approaches, which use experience to estimate nextstate distributions for oline value iteration? We rst show that both Qlearning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the number of state transitions observed. In particular, on the order of only (N log(1 =) = 2)(log(N) + log log(1 =)) transitions are su cient for both algorithms to come within of the optimal policy, in an idealized model that assumes the observed transitions are \wellmixed " throughout an Nstate MDP. Thus, the two approaches have roughly the same sample complexity. Perhaps surprisingly, this sample complexity is far less than what is required for the modelbased approach to actually construct a good approximation to the nextstate distribution. The result also shows that the amount of memory required by the modelbased approach is closer to N than to N 2. For either approach, to remove the assumption that the observed transitions are wellmixed, we consider a model in which the transitions are determined by a xed, arbitrary exploration policy. Bounds on the number of transitions required in order to achieve a desired level of performance are then related to the stationary distribution and mixing time of this policy. 1