Results 1 - 10
of
21
Correlated Q-learning
- In Proceedings of the Twentieth International Conference on Machine Learning
, 2003
"... There have been several attempts to design multiagent Q-learning algorithms capable of learning equilibrium policies in general-sum Markov games, just as Q-learning learns optimal policies in Markov decision processes. We introduce correlated Q-learning, one such algorithm based on the correlated eq ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
There have been several attempts to design multiagent Q-learning algorithms capable of learning equilibrium policies in general-sum Markov games, just as Q-learning learns optimal policies in Markov decision processes. We introduce correlated Q-learning, one such algorithm based on the correlated equilibrium solution concept. Motivated by a fixed point proof of the existence of stationary correlated equilibrium policies in Markov games, we present a generic multiagent Q-learning algorithm of which many popular algorithms are immediate special cases. We also prove that certain variants of correlated (and Nash) Q-learning are guaranteed to converge to stationary correlated (and Nash) equilibrium policies in two special classes of Markov games, namely zero-sum and common-interest. Finally, we show empirically that correlated Q-learning outperforms Nash Q-learning, further justifying the former beyond noting that it is less computationally expensive than the latter.
Correlated-Q learning
- In AAAI Spring Symposium
, 2003
"... This paper introduces Correlated-Q (CE-Q) learning, a multiagent Q-learning algorithm based on the correlated equilibrium (CE) solution concept. CE-Q generalizes both Nash-Q and Friend-and-Foe-Q: in general-sum games, the set of correlated equilibria contains the set of Nash equilibria; in constants ..."
Abstract
-
Cited by 34 (0 self)
- Add to MetaCart
This paper introduces Correlated-Q (CE-Q) learning, a multiagent Q-learning algorithm based on the correlated equilibrium (CE) solution concept. CE-Q generalizes both Nash-Q and Friend-and-Foe-Q: in general-sum games, the set of correlated equilibria contains the set of Nash equilibria; in constantsum games, the set of correlated equilibria contains the set of minimax equilibria. This paper describes experiments with four variants of CE-Q, demonstrating empirical convergence to equilibrium policies on a testbed of general-sum Markov games. 1.
Learning rates for Q-Learning
- Journal of Machine Learning Research
, 2001
"... In this paper we derive convergence rates for Q-learning. We show an interesting relationship between the convergence rate and the learning rate used in the Q-learning. For a polynomial learning rate, one which is 1=t ! at time t where ! 2 (1=2; 1), we show that that the convergence rate is pol ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
In this paper we derive convergence rates for Q-learning. We show an interesting relationship between the convergence rate and the learning rate used in the Q-learning. For a polynomial learning rate, one which is 1=t ! at time t where ! 2 (1=2; 1), we show that that the convergence rate is polynomial in 1=(1 ). In contrast we show that for a linear learning rate, one which is 1=t at time t, the convergence rate has an exponential dependence on 1=(1 ), where is the discount factor. In addition we show a simple example that proves that this behavior is inherent for a linear learning rate. School of Computer Science, Tel-Aviv University. e-mail: evend@cs.tau.ac.il y School of Computer Science, Tel-Aviv University. e-mail: mansour@cs.tau.ac.il 0 1 Introduction In Reinforcement Learning, an agent wanders in an unknown environment and tries to maximize its long term return by performing actions and receiving rewards. The challenge is to understand how a current action w...
Module Based Reinforcement Learning for a Real Robot
"... . The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state- and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partial ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
. The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state- and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable. In this article we suggest a systematic method whose motivation comes from the desire to transform the task-to-be-solved into a finite-state, discrete-time, "approximately" Markovian task, which is completely observable too. The key idea is to break up the problem into subtasks and design controllers for each of the subtasks. Then operating conditions are attached to the controllers (together the controllers and their operating conditions which are called modules) and possible additional features are designed to facilitate observability. A new discrete time-counter is introduced at the "module-level" that clicks only when a change in the value of one of the features is observe...
Multi-criteria Reinforcement Learning
, 1998
"... We consider multi-criteria sequential decision making problems where the vector-valued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
We consider multi-criteria sequential decision making problems where the vector-valued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the order-topology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the medium-term multicriteria RL often converges to better solutions (measured by the first criterion) than their single-criterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
The Asymptotic Convergence-Rate of Q-learning
, 1998
"... In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Q-learning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the state-action pairs are sampled from a fixed probability distribution. He ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
In this paper we show that for discounted MDPs with discount factor fl ? 1=2 the asymptotic rate of convergence of Q-learning is O(1=t R(1\Gammafl) ) if R(1 \Gamma fl) ! 1=2 and O( p log log t=t) otherwise provided that the state-action pairs are sampled from a fixed probability distribution. Here R = p min =pmax is the ratio of the minimum and maximum state-action occupation frequencies. The results extend to convergent on-line learning provided that p min ? 0, where p min and pmax now become the minimum and maximum state-action occupation frequencies corresponding to the stationary distribution. 1 INTRODUCTION Q-learning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesv'ari, 1996; Szepesv'ari and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-learning algorithms. Although, this upper bound i...
Learning and Exploitation do not Conflict under Minimax Optimality
"... . We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
. We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated two-player zero-sum deterministic games. Keywords. reinforcement learning, self-optimizing systems, dynamic games 1 Introduction Reinforcement learning (RL) concerns practical problems related to learning of optimal behaviour in sequential decision tasks. The most popular theoretical framework adopted by RL researchers is that of Markovian Decision Problems (MDPs). One of the main questions in RL is what extent of exploration is needed for a learner so that the price of exploration does not become too demanding. Usually some exploration (e...
Heuristic Reinforcement Learning applied to RoboCup Simulation Agents
"... Abstract. This paper describes the design and implementation of robotic agents for the RoboCup Simulation 2D category that learns using a recently proposed Heuristic Reinforcement Learning algorithm, the Heuristically Accelerated Q–Learning (HAQL). This algorithm allows the use of heuristics to spee ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. This paper describes the design and implementation of robotic agents for the RoboCup Simulation 2D category that learns using a recently proposed Heuristic Reinforcement Learning algorithm, the Heuristically Accelerated Q–Learning (HAQL). This algorithm allows the use of heuristics to speed up the well-known Reinforcement Learning algorithm Q–Learning. A heuristic function that influences the choice of the actions characterizes the HAQL algorithm. A set of empirical evaluations was conducted in the RoboCup 2D Simulator, and experimental results show that even very simple heuristics enhances significantly the performance of the agents.
Heuristically Accelerated Q-Learning: a new approach to speed up Reinforcement Learning
- Lecture Notes in Artificial Intelligence
"... This work presents a new algorithm, called Heuristically Accelerated Q--Learning (HAQL), that allows the use of heuristics to speed up the well-known Reinforcement Learning algorithm Q--learning. ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
This work presents a new algorithm, called Heuristically Accelerated Q--Learning (HAQL), that allows the use of heuristics to speed up the well-known Reinforcement Learning algorithm Q--learning.
Non-Markovian Policies in Sequential Decision Problems
, 1997
"... In this article we prove the validity of the Bellman Optimality Equation and related results for sequential decision problems with a general recursive structure. The characteristic feature of our approach is that also non-Markovian policies are taken into account. The theory is motivated by some exp ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this article we prove the validity of the Bellman Optimality Equation and related results for sequential decision problems with a general recursive structure. The characteristic feature of our approach is that also non-Markovian policies are taken into account. The theory is motivated by some experiments with a learning robot. 1 Introduction The theory of sequential decision problems is an important mathematical tool for studying some problems of cybernetics, e.g. control of robots. Consider for example the robot shown in Figure 1. This robot, called Khepera 1 , is equipped with eight infra-red sensors, six in the front and two at the back, the infra-red sensors measuring the proximity of objects in the range 0-5 cm. The robot has two wheels driven by two independent DC motors and a gripper that has two degrees of freedom and is equipped with a resistivity sensor and an objectpresence sensor. The robot has a vision turret mounted on its top as. The vision turret has an image se...

