Results 1  10
of
22
Correlated Qlearning
 In Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... There have been several attempts to design multiagent Qlearning algorithms capable of learning equilibrium policies in generalsum Markov games, just as Qlearning learns optimal policies in Markov decision processes. We introduce correlated Qlearning, one such algorithm based on the correlated eq ..."
Abstract

Cited by 56 (2 self)
 Add to MetaCart
There have been several attempts to design multiagent Qlearning algorithms capable of learning equilibrium policies in generalsum Markov games, just as Qlearning learns optimal policies in Markov decision processes. We introduce correlated Qlearning, one such algorithm based on the correlated equilibrium solution concept. Motivated by a fixed point proof of the existence of stationary correlated equilibrium policies in Markov games, we present a generic multiagent Qlearning algorithm of which many popular algorithms are immediate special cases. We also prove that certain variants of correlated (and Nash) Qlearning are guaranteed to converge to stationary correlated (and Nash) equilibrium policies in two special classes of Markov games, namely zerosum and commoninterest. Finally, we show empirically that correlated Qlearning outperforms Nash Qlearning, further justifying the former beyond noting that it is less computationally expensive than the latter.
CorrelatedQ learning
 In NIPS Workshop on Multiagent Learning
, 2002
"... Bowling named two desiderata for multiagent learning algorithms: rationality and convergence. This paper introduces co~elatedQ learning, a natural generalization of NashQ and FFQ that satisfies these criteria. NashoQ satisfies rationality, but in general it does not converge. FFQ satisfies conve ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
Bowling named two desiderata for multiagent learning algorithms: rationality and convergence. This paper introduces co~elatedQ learning, a natural generalization of NashQ and FFQ that satisfies these criteria. NashoQ satisfies rationality, but in general it does not converge. FFQ satisfies convergence, but in general it is not rational. CorrelatedQ satisfies rationality by construction. This papers demonstrates the empirical convergence of correlatedQ on a standard testbed of generalsum Markov games.
Learning rates for QLearning
 Journal of Machine Learning Research
, 2001
"... In this paper we derive convergence rates for Qlearning. We show an interesting relationship between the convergence rate and the learning rate used in the Qlearning. For a polynomial learning rate, one which is 1=t ! at time t where ! 2 (1=2; 1), we show that that the convergence rate is pol ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
In this paper we derive convergence rates for Qlearning. We show an interesting relationship between the convergence rate and the learning rate used in the Qlearning. For a polynomial learning rate, one which is 1=t ! at time t where ! 2 (1=2; 1), we show that that the convergence rate is polynomial in 1=(1 ). In contrast we show that for a linear learning rate, one which is 1=t at time t, the convergence rate has an exponential dependence on 1=(1 ), where is the discount factor. In addition we show a simple example that proves that this behavior is inherent for a linear learning rate. School of Computer Science, TelAviv University. email: evend@cs.tau.ac.il y School of Computer Science, TelAviv University. email: mansour@cs.tau.ac.il 0 1 Introduction In Reinforcement Learning, an agent wanders in an unknown environment and tries to maximize its long term return by performing actions and receiving rewards. The challenge is to understand how a current action w...
Multicriteria Reinforcement Learning
, 1998
"... We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the ordertopology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the mediumterm multicriteria RL often converges to better solutions (measured by the first criterion) than their singlecriterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
Module Based Reinforcement Learning for a Real Robot
"... . The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state and actionspace, discretetime controlled Markovchains. Robotlearning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partial ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
. The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state and actionspace, discretetime controlled Markovchains. Robotlearning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable. In this article we suggest a systematic method whose motivation comes from the desire to transform the tasktobesolved into a finitestate, discretetime, "approximately" Markovian task, which is completely observable too. The key idea is to break up the problem into subtasks and design controllers for each of the subtasks. Then operating conditions are attached to the controllers (together the controllers and their operating conditions which are called modules) and possible additional features are designed to facilitate observability. A new discrete timecounter is introduced at the "modulelevel" that clicks only when a change in the value of one of the features is observe...
The Asymptotic ConvergenceRate of Qlearning
, 1998
"... In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Qlearning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the stateaction pairs are sampled from a fixed probability distribution. He ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Qlearning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the stateaction pairs are sampled from a fixed probability distribution. Here R = p min =pmax is the ratio of the minimum and maximum stateaction occupation frequencies. The results extend to convergent online learning provided that p min ? 0, where p min and pmax now become the minimum and maximum stateaction occupation frequencies corresponding to the stationary distribution. 1 INTRODUCTION Qlearning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesv'ari, 1996; Szepesv'ari and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookuptable based) Qlearning algorithms. Although, this upper bound i...
Learning and Exploitation do not Conflict under Minimax Optimality
"... . We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
. We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated twoplayer zerosum deterministic games. Keywords. reinforcement learning, selfoptimizing systems, dynamic games 1 Introduction Reinforcement learning (RL) concerns practical problems related to learning of optimal behaviour in sequential decision tasks. The most popular theoretical framework adopted by RL researchers is that of Markovian Decision Problems (MDPs). One of the main questions in RL is what extent of exploration is needed for a learner so that the price of exploration does not become too demanding. Usually some exploration (e...
Heuristically Accelerated Q–Learning: a new approach to speed up Reinforcement Learning
 Advances in AI – SBIA
, 2004
"... Abstract. This work presents a new algorithm, called Heuristically Accelerated Q–Learning (HAQL), that allows the use of heuristics to speed up the wellknown Reinforcement Learning algorithm Q–learning. A heuristic function H that influences the choice of the actions characterizes the HAQL algorith ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Abstract. This work presents a new algorithm, called Heuristically Accelerated Q–Learning (HAQL), that allows the use of heuristics to speed up the wellknown Reinforcement Learning algorithm Q–learning. A heuristic function H that influences the choice of the actions characterizes the HAQL algorithm. The heuristic function is strongly associated with the policy: it indicates that an action must be taken instead of another. This work also proposes an automatic method for the extraction of the heuristic function H from the learning process, called Heuristic from Exploration. Finally, experimental results shows that even a very simple heuristic results in a significant enhancement of performance of the reinforcement learning algorithm.
Neural Networks for RealTime Traffic Signal Control
 IEEE Transactions on Intelligent Transportation Systems
"... Abstract—Realtime traffic signal control is an integral part of the urban traffic control system, and providing effective realtime traffic signal control for a large complex traffic network is an extremely challenging distributed control problem. This paper adopts the multiagent system approach to ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract—Realtime traffic signal control is an integral part of the urban traffic control system, and providing effective realtime traffic signal control for a large complex traffic network is an extremely challenging distributed control problem. This paper adopts the multiagent system approach to develop distributed unsupervised traffic responsive signal control models, where each agent in the system is a local traffic signal controller for one intersection in the traffic network. The first multiagent system is developed using hybrid computational intelligent techniques. Each agent employs a multistage online learning process to update and adapt its knowledge base and decisionmaking mechanism. The second multiagent system is developed by integrating the simultaneous perturbation stochastic approximation theorem in fuzzy neural networks (NN). The problem of realtime traffic signal control is especially challenging if the agents are used for an infinite horizon problem, where online learning has to take place continuously once the agentbased traffic signal controllers are implemented into the traffic network. A comprehensive simulation model of a section of the Central Business District of Singapore has been developed using PARAMICS microscopic simulation program. Simulation results show that the hybrid multiagent system provides significant improvement in traffic conditions when evaluated against an existing traffic signal control algorithm as well as the SPSANNbased multiagent system as the complexity of the simulation scenario increases. Using the hybrid NNbased multiagent system, the mean delay of each vehicle was reduced by 78 % and the mean stoppage time, by 85 % compared to the existing traffic signal control algorithm. The promising results demonstrate the efficacy of the hybrid NNbased multiagent system in solving largescale traffic signal control problems in a distributed manner. Index Terms—Distributed control, hybrid model, neural control, online learning, traffic signal control. I.
Heuristic Reinforcement Learning applied to RoboCup Simulation Agents
"... Abstract. This paper describes the design and implementation of robotic agents for the RoboCup Simulation 2D category that learns using a recently proposed Heuristic Reinforcement Learning algorithm, the Heuristically Accelerated Q–Learning (HAQL). This algorithm allows the use of heuristics to spee ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. This paper describes the design and implementation of robotic agents for the RoboCup Simulation 2D category that learns using a recently proposed Heuristic Reinforcement Learning algorithm, the Heuristically Accelerated Q–Learning (HAQL). This algorithm allows the use of heuristics to speed up the wellknown Reinforcement Learning algorithm Q–Learning. A heuristic function that influences the choice of the actions characterizes the HAQL algorithm. A set of empirical evaluations was conducted in the RoboCup 2D Simulator, and experimental results show that even very simple heuristics enhances significantly the performance of the agents.