Results 1  10
of
117
Reinforcement learning: a survey
 Journal of Artificial Intelligence Research
, 1996
"... This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem ..."
Abstract

Cited by 1309 (22 self)
 Add to MetaCart
This paper surveys the field of reinforcement learning from a computerscience perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trialanderror interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
Nearoptimal reinforcement learning in polynomial time
 Machine Learning
, 1998
"... We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the m ..."
Abstract

Cited by 237 (3 self)
 Add to MetaCart
We present new algorithms for reinforcement learning, and prove that they have polynomial bounds on the resources required to achieve nearoptimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the ExplorationExploitation tradeoff. 1
Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach
 Advances in Neural Information Processing Systems 6
, 1994
"... This paper describes the Qrouting algorithm for packet routing, in which a reinforcement learning module is embedded into each node of a switching network. Only local communication is used by each node to keep accurate statistics on which routing decisions lead to minimal delivery times. In simple ..."
Abstract

Cited by 184 (2 self)
 Add to MetaCart
This paper describes the Qrouting algorithm for packet routing, in which a reinforcement learning module is embedded into each node of a switching network. Only local communication is used by each node to keep accurate statistics on which routing decisions lead to minimal delivery times. In simple experiments involving a 36node, irregularly connected network, Qrouting proves superior to a nonadaptive algorithm based on precomputed shortest paths and is able to route efficiently even when critical aspects of the simulation, such as the network load, are allowed to vary dynamically. The paper concludes with a discussion of the tradeoff between discovering shortcuts and maintaining stable policies. 1 INTRODUCTION The field of reinforcement learning has grown dramatically over the past several years, but with the exception of backgammon [8, 2], has had few successful applications to largescale, practical tasks. This paper demonstrates that the practical task of routing packets through...
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 112 (7 self)
 Add to MetaCart
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
Bayesian Qlearning
 In AAAI/IAAI
, 1998
"... A central problem in learning in complex environments is balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information the expected improvement in future decision ..."
Abstract

Cited by 104 (1 self)
 Add to MetaCart
A central problem in learning in complex environments is balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information the expected improvement in future decision quality that might arise from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper, we adopt a Bayesian approach to maintaining this uncertain information. We extend Watkins' Qlearning by maintaining and propagating probability distributions over the Qvalues. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation. We establish the convergence properties of our algorithm and show experimentally that it can exhibit substantial improvements o...
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
Hierarchical Learning in Stochastic Domains: Preliminary Results
 In Proceedings of the Tenth International Conference on Machine Learning
, 1993
"... This paper presents the HDG learning algorithm, which uses a hierarchical decomposition of the state space to make learning to achieve goals more efficient with a small penalty in path quality. Special care must be taken when performing hierarchical planning and learning in stochastic domains, ..."
Abstract

Cited by 99 (8 self)
 Add to MetaCart
This paper presents the HDG learning algorithm, which uses a hierarchical decomposition of the state space to make learning to achieve goals more efficient with a small penalty in path quality. Special care must be taken when performing hierarchical planning and learning in stochastic domains, because macrooperators cannot be executed ballistically. The HDG algorithm, which is a descendent of Watkins' Qlearning algorithm, is described here and preliminary empirical results are presented. 1 INTRODUCTION Reinforcement learning is a general tool for deriving strategies that optimize a fixed reinforcement function in a stochastic environment. A crucial problem in reinforcement learning is temporal credit assignment: how to choose actions based on good results that happen after (perhaps long after) the action is taken. This problem is solved well in the general case by temporal difference methods, such as Watkins' Q learning [Barto et al., 1989, Watkins, 1989] and Sutton's TD ...
Active Markov Localization for Mobile Robots
 Robotics and Autonomous Systems
, 1998
"... Localization is the problem of determining the position of a mobile robot from sensor data. Most existing localization approaches are passive, i.e., they do not exploit the opportunity to control the robot's effectors during localization. This paper proposes an active localization approach. The appr ..."
Abstract

Cited by 81 (7 self)
 Add to MetaCart
Localization is the problem of determining the position of a mobile robot from sensor data. Most existing localization approaches are passive, i.e., they do not exploit the opportunity to control the robot's effectors during localization. This paper proposes an active localization approach. The approach is based on Markov localization and provides rational criteria for (1) setting the robot's motion direction (exploration), and (2) determining the pointing direction of the sensors so as to most efficiently localize the robot. Furthermore, it is able to deal with noisy sensors and approximative world models. The appropriateness of our approach is demonstrated empirically using a mobile robot in a structured office environment. Key words: Robot Position Estimation, Autonomous Service Robots 1 Introduction To navigate reliably in indoor environments, a mobile robot must know where it is. Over the last few years, there has been a tremendous scientific interest in algorithms for estimating ...
When Push comes to Shove: A Computational Model of the Role of Motor Control in the Acquisition of Action Verbs
, 1997
"... Children learn a variety of verbs for hand actions starting in their second year of life. The semantic distinctions can be subtle, and they vary across languages, yet they are learned quickly. Howis this possible? This dissertation explores the hypothesis that to explain the acquisition and use of a ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
Children learn a variety of verbs for hand actions starting in their second year of life. The semantic distinctions can be subtle, and they vary across languages, yet they are learned quickly. Howis this possible? This dissertation explores the hypothesis that to explain the acquisition and use of action verbs, motor control must be taken into account. It presents a model of embodied semanticsbased on the principles of neural computation in general and on the human motor system in particularwhich takes a set of labelled actions and learns both to label novel actions and to obey verbal commands. Akey feature of the model is the executing schema, anactivecontroller mechanism which, by actually driving behavior, allows the model to carry out verbal commands. A hardwired mechanism links the activity of executing schemas to a set of linguistically important features including hand posture, joint motions, force, aspect and goals. The feature set is relatively small and is xed, helping to make learning tractable. Moreover, the use of traditional feature structures facilitates the use of model merging, a Bayesian probabilistic learning algorithm which rapidly learns plausible word meanings, automatically determines an appropriate number of senses for each verb, and can plausibly be mapped to a connectionist recruitment