Results 1  10
of
34
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
Learning to Drive a Bicycle using Reinforcement Learning and Shaping
, 1998
"... We present and solve a realworld problem of learning to drive a bicycle. We solve the problem by online reinforcement learning using the Sarsa()algorithm. Then we solve the composite problem of learning to balance a bicycle and then drive to a goal. In our approach the reinforcement function is in ..."
Abstract

Cited by 61 (3 self)
 Add to MetaCart
We present and solve a realworld problem of learning to drive a bicycle. We solve the problem by online reinforcement learning using the Sarsa()algorithm. Then we solve the composite problem of learning to balance a bicycle and then drive to a goal. In our approach the reinforcement function is independent of the task the agent tries to learn to solve. 1 Introduction Here we consider the problem of learning to balance on a bicycle. Having done this we want to drive the bicycle to a goal. The second problem is not as straightforward as it may seem. The learning agent has to solve two problems at the same time: Balancing on the bicycle and driving to a specific place. Recently, ideas from behavioural psychology have been adapted by reinforcement learning to solve this type of problem. We will return to this in section 3. In reinforcement learning an agent interacts with an environment or a system. At each time step the agent receives information on the state of the system and chooses ...
Fast Concurrent Reinforcement Learners
 IN PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 2001
"... When several agents learn concurrently, the payoff received by an agent is dependent on the behavior of the other agents. As the other agents learn, the reward of one agent becomes nonstationary. This makes learning in multiagent systems more difficult than singleagent learning. A few methods ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
When several agents learn concurrently, the payoff received by an agent is dependent on the behavior of the other agents. As the other agents learn, the reward of one agent becomes nonstationary. This makes learning in multiagent systems more difficult than singleagent learning. A few methods, however, are known to guarantee convergence to equilibrium in the limit in such systems. In this paper we experimentally study one such technique, the minimaxQ, in a competitive domain and prove its equivalence with another wellknown method for competitive domains. We study the rate of convergence of minimaxQ and investigate possible ways for increasing the same. We also present a variant of the algorithm, minimaxSARSA, and prove its convergence to minimaxQ values under appropriate conditions. Finally we show that this new algorithm performs better than simple minimaxQ in a generalsum domain as well.
Learning and Value Function Approximation in Complex Decision Processes
, 1998
"... In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and sto ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
In principle, a wide variety of sequential decision problems  ranging from dynamic resource allocation in telecommunication networks to financial risk management  can be formulated in terms of stochastic control and solved by the algorithms of dynamic programming. Such algorithms compute and store a value function, which evaluates expected future reward as a function of current state. Unfortunately, exact computation of the value function typically requires time and storage that grow proportionately with the number of states, and consequently, the enormous state spaces that arise in practical applications render the algorithms intractable. In this thesis, we study tractable methods that approximate the value function. Our work builds on research in an area of artificial intelligence known as reinforcement learning. A point of focus of this thesis is temporaldifference learning  a stochastic algorithm inspired to some extent by phenomena observed in animal behavior. Given a selection of...
QLearning in Continuous State and Action Spaces
 IN AUSTRALIAN JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE
, 1999
"... Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to con ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
Qlearning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a nonholonomic control task. Advantage Learning, a variation of Qlearning, is shown enhance learning speed and reliability for this task.
Temporal sequence learning, prediction and control  a review of different models and their relation to biological mechanisms
 Neural Computation
, 2004
"... In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) T ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
In this article we compare methods for temporal sequence learning (TSL) across the disciplines machinecontrol, classical conditioning, neuronal models for TSL as well as spiketiming dependent plasticity. This review will briefly introduce the most influential models and focus on two questions: 1) To what degree are rewardbased (e.g. TDlearning) and correlation based (hebbian) learning related? and 2) How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We will first compare the different models in an openloop condition, where behavioral feedback does not alter the learning. Here we observe, that rewardbased and correlation based learning are indeed very similar. Machinecontrol is then used to introduce the problem of closedloop control (e.g. “actorcritic architectures”). Here the problem of evaluative (“rewards”) versus nonevaluative (“correlations”) feedback from the environment will be discussed showing that both learning approaches are fundamentally different in the closedloop condition. In trying to answer the second question we will compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basalganglia, thalamus and
Generalized markov decision processes: dynamicprogramming and reinforcementlearning algorithms
 in: Proceedings of the 13th International Conference of Machine Learning (ICML96
, 1996
"... The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decisionmaking problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, modelbased reinforcementlearning, and Qlcarning that can be used to make optimal dccisions in the generali7,ed model undcr various assumptions. Applications of the theory to particular models are described, including riskaverse MDPs, explorationsensitive MDPs, sarsa, Qlcarning with spreading, twoplayer games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochasticapproximation theorCIn that reduces asynchronous convergence to synchronous convergence. 1 1
Applications of the selforganising map to reinforcement learning
 Neural Networks
, 2002
"... Running Title: Applying the SOM to reinforcement learning ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Running Title: Applying the SOM to reinforcement learning
Shaping in Reinforcement Learning by Changing the Physics of the Problem
"... Children learn to ride a bicycle by using training wheels. They are actually trying to learn one task (riding without training wheels) by training another one. In general, solving a difficult problem can be facilitated by training other problems. ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
Children learn to ride a bicycle by using training wheels. They are actually trying to learn one task (riding without training wheels) by training another one. In general, solving a difficult problem can be facilitated by training other problems.