Results 1  10
of
23
Knows What It Knows: A Framework For SelfAware Learning
"... We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is ..."
Abstract

Cited by 69 (20 self)
 Add to MetaCart
We introduce a learning framework that combines elements of the wellknown PAC and mistakebound models. The KWIK (knows what it knows) framework was designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to, as is true in reinforcementlearning and activelearning problems. We catalog several KWIKlearnable classes and open problems. 1.
Reinforcement Learning in Finite MDPs: PAC Analysis Reinforcement Learning in Finite MDPs: PAC Analysis
"... Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current ..."
Abstract

Cited by 45 (5 self)
 Add to MetaCart
Editor: We study the problem of learning nearoptimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PACMDP ” algorithms include the wellknown E 3 and RMAX algorithms as well as the more recent Delayed Qlearning algorithm. We summarize the current stateoftheart by presenting bounds for the problem in a unified theoretical framework. We also present a more refined analysis that yields insight into the differences between the modelfree Delayed Qlearning and the modelbased RMAX. Finally, we conclude with open problems.
Integrating Samplebased Planning and Modelbased Reinforcement Learning
"... Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Recent advancements in modelbased reinforcement learning have shown that the dynamics of many structured domains (e.g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these overmatched planners with a class of samplebased planners—whose computation time is independent of the number of states—without sacrificing the sampleefficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a samplebased planner to be used in such a learning system and analyze two popular samplebased approaches from the literature. We also introduce our own samplebased planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.
The Adaptive kMeteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning
"... The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matchi ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
(Show Context)
The purpose of this paper is threefold. First, we formalize and study a problem of learning probabilistic concepts in the recently proposed KWIK framework. We give details of an algorithm, known as the Adaptive kMeteorologists Algorithm, analyze its samplecomplexity upper bound, and give a matching lower bound. Second, this algorithm is used to create a new reinforcementlearning algorithm for factoredstate problems that enjoys significant improvement over the previous stateoftheart algorithm. Finally, we apply the Adaptive kMeteorologists Algorithm to remove a limiting assumption in an existing reinforcementlearning algorithm. The effectiveness of our approaches is demonstrated empirically in a couple benchmark domains as well as a robotics navigation problem. 1.
Modelbased reinforcement learning with nearly tight exploration complexity bounds
"... One might believe that modelbased algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results fo ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
One might believe that modelbased algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for modelbased algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O(N 2 log N), in contrast to the O(N log N) bound available for the modelfree, delayed Qlearning algorithm. In this paper we show that Mormax, a modified version of the Rmax algorithm needs to make at most O(N log N) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the stateoftheart modelfree algorithm, while our new bound improves the dependence on other problem parameters. In the reinforcement learning (RL) framework, an agent interacts with an unknown environment and tries to maximize its longterm profit. A standard way to measure the efficiency of the agent is sample complexity or exploration complexity. Roughly, this quantity tells how many nonoptimal (exploratory) steps does the agent make at most. The best understood and most studied case is when the environment is a finite Markov decision process (MDP) with the expected total discounted reward criterion. Since the work of Kearns & Singh (1998), many algorithms have been published with bounds on their sam
Provably Efficient Learning with Typed Parametric Models
"... To quickly achieve good performance, reinforcementlearning algorithms for acting in large continuousvalued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
To quickly achieve good performance, reinforcementlearning algorithms for acting in large continuousvalued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, realworld domains, such as robot navigation across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problemspecific quantities including the statespace dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture realworld dynamics in a sufficient manner to produce good performance.
Learning is planning: near Bayesoptimal reinforcement learning via MonteCarlo tree search
"... Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unkn ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Bayesoptimal behavior, while welldefined, is often difficult to achieve. Recent advances in the use of MonteCarlo tree search (MCTS) have shown that it is possible to act nearoptimally in Markov Decision Processes (MDPs) with very large or infinite state spaces. Bayesoptimal behavior in an unknown MDP is equivalent to optimal behavior in the known beliefspace MDP, although the size of this beliefspace MDP grows exponentially with the amount of history retained, and is potentially infinite. We show how an agent can use one particular MCTS algorithm, Forward Search Sparse Sampling (FSSS), in an efficient way to act nearly Bayesoptimally for all but a polynomial number of steps, assuming that FSSS can be used to act efficiently in any possible underlying MDP. 1
Exploration in Relational Domains for Modelbased Reinforcement Learning
"... A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of modelbased reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and RMAX algorithms. Efficien ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of modelbased reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and RMAX algorithms. Efficient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a wellknown context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efficiency of our framework using count functions under the assumption that we had a relational KWIK learner and a nearoptimal planner. We propose a concrete exploration algorithm which integrates a practically efficient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques.
Efficient learning of relational models for sequential decision making
, 2010
"... The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
The explorationexploitation tradeoff is crucial to reinforcementlearning (RL) agents, and a significant number of sample complexity results have been derived for agents in propositional domains. These results guarantee, with high probability, nearoptimal behavior in all but a polynomial number of timesteps in the agent’s lifetime. In this work, we prove similar results for certain relational representations, primarily a class we call “relational action schemas”. These generalized models allow us to specify state transitions in a compact form, for instance describing the effect of picking up a generic block instead of picking up 10 different specific blocks. We present theoretical results on crucial subproblems in actionschema learning using the KWIK framework, which allows us to characterize the sample efficiency of an agent learning these models in a reinforcementlearning setting. These results are extended in an apprenticeship learning paradigm where and agent has access not only to its environment, but also to a teacher that can demonstrate traces of state/action/state sequences. We show that the class of action schemas that are efficiently learnable in this paradigm is strictly larger than those learnable in the online setting. We link
PAC Optimal Exploration in Continuous Space Markov Decision Processes
"... Current exploration algorithms can be classified in two broad categories: Heuristic, and PAC optimal. While numerous researchers have used heuristic approaches such as ɛgreedy exploration successfully, such approaches lack formal, finite sample guarantees and may need a significant amount of finetu ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Current exploration algorithms can be classified in two broad categories: Heuristic, and PAC optimal. While numerous researchers have used heuristic approaches such as ɛgreedy exploration successfully, such approaches lack formal, finite sample guarantees and may need a significant amount of finetuning to produce good results. PAC optimal exploration algorithms, on the other hand, offer strong theoretical guarantees but are inapplicable in domains of realistic size. The goal of this paper is to bridge the gap between theory and practice, by introducing CPACE, an algorithm which offers strong theoretical guarantees and can be applied to interesting, continuous space problems. 1 Introduction and