## To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning (1994)

Venue: | In Proceedings of the Eleventh International Conference on Machine Learning |

Citations: | 17 - 3 self |

### BibTeX

@INPROCEEDINGS{Mahadevan94todiscount,

author = {Sridhar Mahadevan},

title = {To Discount or not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning},

booktitle = {In Proceedings of the Eleventh International Conference on Machine Learning},

year = {1994},

pages = {164--172},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Most work in reinforcement learning (RL)

### Citations

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...he optimal policy. In comparing two RL techniques, the fundamental problem is that in contrast to inductive learning techniques, where there exist accepted evaluation methods such as cross-validation =-=[4]-=-, there are no corresponding schemes for RL. One unique characteristic of RL is that the learner has to select actions stochastically to tradeoff exploration of the environment vs. exploitation of the... |

1321 | Learning from Delayed Rewards - Watkins - 1989 |

1227 | Learning to predict by the methods of temporal differences
- Sutton
- 1988
(Show Context)
Citation Context ...for comparing learning data from two RL techniques. 1 Introduction and Motivation Recently, several different reinforcement learning (RL) techniques have been developed, such as Q learning [20], TD() =-=[14]-=-, and R learning [12]. Although the former two techniques have been applied with some success to diverse domains, ranging from backgammon [18] to robotics[5, 9, 10], very little is known about the emp... |

848 | Introduction to Reinforcement Learning
- Sutton, Barto
- 1998
(Show Context)
Citation Context ... 2 Two Reinforcement Learning Methods We begin by briefly reviewing Q learning [20] and R learning [12]. In general, reinforcement learning is based on the framework of Markov decision problems (MDP) =-=[8, 16]-=-. These consist of a set of states S, a set of actions A, and a reward function mapping states (and possibly actions) to real numbers. The goal of learning is to determine a policy \Pi : S ! A that ma... |

773 |
Finite Markov chains
- Kemeny, Snell
- 1960
(Show Context)
Citation Context ... 2 Two Reinforcement Learning Methods We begin by briefly reviewing Q learning [20] and R learning [12]. In general, reinforcement learning is based on the framework of Markov decision problems (MDP) =-=[8, 16]-=-. These consist of a set of states S, a set of actions A, and a reward function mapping states (and possibly actions) to real numbers. The goal of learning is to determine a policy \Pi : S ! A that ma... |

526 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...te x and can be either Q(x; a) or R(x; a), respectively. In this experiment, the temperaturesT was initialized at 30, and gradually decayed to 0:5 using a decaying scheme similar to that described in =-=[1]. Instead -=-of simply showing the average curve, it is instructive to also view the variation over different runs. This data is shown in Figure 2, where an "errorbar" is used to display the maximum and ... |

474 |
Neuronlike adaptive elements that can solve difficult learning control problems
- Barto, Sutton, et al.
- 1983
(Show Context)
Citation Context ...experiment is to compare R with Q learning using a directed exploration method. Thrun [19] found that directed exploration methods were markedly superior to undirected methods using the AHC technique =-=[2]-=- on a simulated robot navigation task. In this section we test if this result also holds for Q learning and R learning in the robot box-pushing domain. We implemented the recency-based exploration met... |

473 | Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...ning and Q learning over multiple runs using two types of undirected exploration strategies [19], semi-uniform exploration and Boltzmann exploration, and one recency-based directed exploration method =-=[15]-=-. Computing averaged performance improvement plots does not reveal the actual variation across runs. For this reason, we will show the variation across runs in this paper. Also, we deviate from the us... |

373 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...ions) to real numbers. The goal of learning is to determine a policy \Pi : S ! A that maximizes a performance measure. In discounted reinforcement learning, the performance measure being optimized is =-=[3]-=-: J \Pi (x 0 ) = lim n!1 E( n\Gamma1 X t=0 fl t R \Pi x t ;a ) where fl ! 1 is the discount factor, and R \Pi x t ;a is the reward received for doing action a in state x t where actions are chosen usi... |

368 | Practical issues in temporal difference learning
- Tesauro
- 1992
(Show Context)
Citation Context ...niques have been developed, such as Q learning [20], TD() [14], and R learning [12]. Although the former two techniques have been applied with some success to diverse domains, ranging from backgammon =-=[18]-=- to robotics[5, 9, 10], very little is known about the empirical effectiveness of R learning. In his paper proposing R learning [12], Schwartz argued convincingly that optimizing average reward is bet... |

334 | Automatic programming of behavior-based robots using reinforcement learning
- Mahadevan, Connell
- 1990
(Show Context)
Citation Context ...undirected exploration method [19]. Basically, in this method the robot executes random actions with a fixed probability. We obtained good results using this exploration strategy in our previous work =-=[10]-=-. In the experiment below, we set the probability of doing a random action at 10%. As Figure 6 shows, both R learning and Q learning perform much better with the semi-uniform exploration strategy. Thi... |

303 |
Learning in Embedded Systems
- Kaelbling
- 1993
(Show Context)
Citation Context ...Even though this simulator is a fairly inaccurate model of reality, it is sufficiently complex to illustrate the issues raised in this paper. 4 Experimental Design Similar to most previous work in RL =-=[7, 9, 13]-=-, we used the following experimental design format. The overall learning run is broken down into trials. For our task domain, we will set a trial as consisting of 100 steps. After 100 steps, the envir... |

188 |
Reinforcement Learning for Robots using neural networks
- Lin
- 1993
(Show Context)
Citation Context ... developed, such as Q learning [20], TD() [14], and R learning [12]. Although the former two techniques have been applied with some success to diverse domains, ranging from backgammon [18] to robotics=-=[5, 9, 10]-=-, very little is known about the empirical effectiveness of R learning. In his paper proposing R learning [12], Schwartz argued convincingly that optimizing average reward is better than optimizing di... |

137 |
The role of exploration in learning control
- Thrun
- 1992
(Show Context)
Citation Context ...h type of exploration strategy used. Accordingly, in this paper we compare the average performance of R learning and Q learning over multiple runs using two types of undirected exploration strategies =-=[19]-=-, semi-uniform exploration and Boltzmann exploration, and one recency-based directed exploration method [15]. Computing averaged performance improvement plots does not reveal the actual variation acro... |

94 |
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
- Schwartz
- 1993
(Show Context)
Citation Context ...g data from two RL techniques. 1 Introduction and Motivation Recently, several different reinforcement learning (RL) techniques have been developed, such as Q learning [20], TD() [14], and R learning =-=[12]-=-. Although the former two techniques have been applied with some success to diverse domains, ranging from backgammon [18] to robotics[5, 9, 10], very little is known about the empirical effectiveness ... |

53 |
Robot learning
- Connell, Mahadevan
- 1993
(Show Context)
Citation Context ... developed, such as Q learning [20], TD() [14], and R learning [12]. Although the former two techniques have been applied with some success to diverse domains, ranging from backgammon [18] to robotics=-=[5, 9, 10]-=-, very little is known about the empirical effectiveness of R learning. In his paper proposing R learning [12], Schwartz argued convincingly that optimizing average reward is better than optimizing di... |

49 | Learning to solve Markovian decision processes
- Singh
- 1994
(Show Context)
Citation Context ...a box, and unwedge from a stalled situation. This decomposition is given to the robot and is not learned, although techniques have been developed for automatically producing sequential decompositions =-=[13]-=-. Three separate reward functions are used, one for each behavior. For finding, the robot is rewarded for approaching objects, that is it is rewarded by +3 when it goes forward and turns on the front ... |

46 |
Applied statistics. A handbook of techniques
- Sachs
- 1984
(Show Context)
Citation Context ...w the variation across runs in this paper. Also, we deviate from the usual tendency of estimating averages using the mean [9]. The median is a better distribution-free estimator of population average =-=[11]-=-, and we introduce a simple non-parametric significance test for comparing two median-based performance curves. We show there are significant differences between R learning and Q learning in a robot b... |

10 |
Computationally efficient adaptive control algorithms for Markov chains
- Jalali, Ferguson
- 1989
(Show Context)
Citation Context ...hort-term behavior. ffl Other Average Reward Methods: Another direction to purse is to study other average reward learning algorithms from the engineering literature. For example, Jalali and Ferguson =-=[6]-=- describe a model-based average reward learning method. The aforementioned paper by Tadepalli and Ok contains some initial results comparing this method with R learning in the AGV domain. We plan to s... |

7 | H-learning: A reinforcement learning method to optimize undiscounted average reward
- Tadepalli, Ok
- 1994
(Show Context)
Citation Context ... of this study. Briefly, some of the directions for future work are: ffl Different domains: We plan to test additional task domains to see if similar results arise in them. Recently, Tadepalli and Ok =-=[17]-=- have found that R learning performs better than Q learning in an automated guided vehicle (AGV) domain using semi-uniform exploration, if the task parameters are fixed such that long term optimal beh... |