Results 1  10
of
82
Nearoptimal Regret Bounds for Reinforcement Learning
"... For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ..."
Abstract

Cited by 98 (11 self)
 Add to MetaCart
(Show Context)
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm. 1
Pac modelfree reinforcement learning
 In: ICML06: Proceedings of the 23rd international conference on Machine learning
, 2006
"... For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm—Delayed QLearning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2 A) bounds of best previous algorith ..."
Abstract

Cited by 66 (13 self)
 Add to MetaCart
(Show Context)
For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm—Delayed QLearning. We prove it is PAC, achieving near optimal performance except for Õ(SA) timesteps using O(SA) space, improving on the Õ(S2 A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience—no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Qlearning’s perexperience computation cost is much less than that of previous PAC algorithms. 1.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current ..."
Abstract

Cited by 52 (6 self)
 Add to MetaCart
(Show Context)
Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current stateoftheart by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the modelfree Delayed Qlearning and the modelbased RMAX. Finally, we conclude with open problems.
An Analysis of ModelBased Interval Estimation for Markov Decision Processes
, 2007
"... Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper pres ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents a theoretical analysis of MBIE and a new variation called MBIEEB, proving their efficiency even under worstcase conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online” cousins from the literature.
Logarithmic online regret bounds for undiscounted reinforcement learning
 In B. SCHÖLKOPF, J. PLATT & T. HOFFMAN, Eds., Advances in Neural Information Processing Systems 19
, 2007
"... We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the explorationexploitation tradeoff in multiarmed bandit ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
(Show Context)
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the explorationexploitation tradeoff in multiarmed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.
REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs
 In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence
, 2009
"... We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and ..."
Abstract

Cited by 41 (1 self)
 Add to MetaCart
(Show Context)
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of Õ(HS√AT). We also relate the span to various diameterlike quantities associated with the MDP, demonstrating how our results improve on previous regret bounds. 1
The Adaptive kMeteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning
"... The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matchi ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
(Show Context)
The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matching lower bound. Second, this algorithm is used to create a new reinforcementlearning algorithm for factoredstate problems that enjoys significant improvement over the previous stateoftheart algorithm. Finally, we apply the Adaptive kMeteorologists Algorithm to remove a limiting assumption in an existing reinforcementlearning algorithm. The effectiveness of our approaches is demonstrated empirically in a couple benchmark domains as well as a robotics navigation problem. 1.
Incremental modelbased learners with formal learningtime guarantees
 In Proc. 21st UAI Conference
, 2006
"... Modelbased learning algorithms have been shown to use experience efficiently when learning to solve Markov Decision Processes (MDPs) with finite state and action spaces. However, their high computational cost due to repeatedly solving an internal model inhibits their use in largescale problems. We ..."
Abstract

Cited by 34 (17 self)
 Add to MetaCart
(Show Context)
Modelbased learning algorithms have been shown to use experience efficiently when learning to solve Markov Decision Processes (MDPs) with finite state and action spaces. However, their high computational cost due to repeatedly solving an internal model inhibits their use in largescale problems. We propose a method based on realtime dynamic programming (RTDP) to speed up two modelbased algorithms, RMAX and MBIE (modelbased interval estimation), resulting in computationally much faster algorithms with little loss compared to existing bounds. Specifically, our two new learning algorithms, RTDPRMAX and RTDPIE, have considerably smaller computational demands than RMAX and MBIE. We develop a general theoretical framework that allows us to prove that both are efficient learners in a PAC (probably approximately correct) sense. We also present an experimental evaluation of these new algorithms that helps quantify the tradeoff between computational and experience demands. 1
Modelbased reinforcement learning with nearly tight exploration complexity bounds
"... One might believe that modelbased algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results fo ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
One might believe that modelbased algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for modelbased algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O(N 2 log N), in contrast to the O(N log N) bound available for the modelfree, delayed Qlearning algorithm. In this paper we show that Mormax, a modified version of the Rmax algorithm needs to make at most O(N log N) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the stateoftheart modelfree algorithm, while our new bound improves the dependence on other problem parameters. In the reinforcement learning (RL) framework, an agent interacts with an unknown environment and tries to maximize its longterm profit. A standard way to measure the efficiency of the agent is sample complexity or exploration complexity. Roughly, this quantity tells how many nonoptimal (exploratory) steps does the agent make at most. The best understood and most studied case is when the environment is a finite Markov decision process (MDP) with the expected total discounted reward criterion. Since the work of Kearns & Singh (1998), many algorithms have been published with bounds on their sam