Results 1  10
of
79
A Bayesian Sampling Approach to Exploration in Reinforcement Learning
"... We present a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models. The approach, BOSS (Best of Sampled Set), drives exploration by sampling multiple models from the posterior and selecting actions optimistically. It extends previous work by pr ..."
Abstract

Cited by 40 (8 self)
 Add to MetaCart
We present a modular approach to reinforcement learning that uses a Bayesian representation of the uncertainty over models. The approach, BOSS (Best of Sampled Set), drives exploration by sampling multiple models from the posterior and selecting actions optimistically. It extends previous work by providing a rule for deciding when to resample and how to combine the models. We show that our algorithm achieves nearoptimal reward with high probability with a sample complexity that is low relative to the speed at which the posterior distribution converges during learning. We demonstrate that BOSS performs quite favorably compared to stateoftheart reinforcementlearning approaches and illustrate its flexibility by pairing it with a nonparametric model that generalizes across states. 1
NearBayesian exploration in polynomial time (full version). Available at http://ai.stanford.edu/˜kolter
, 2009
"... We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to modelbased RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intr ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to modelbased RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform ǫclose to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the socalled PACMDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we can achieve lower sample complexity bounds than existing algorithms, while using an exploration strategy that is much greedier than the (extremely cautious) exploration of PACMDP algorithms. 1.
Bayesian sparse sampling for online reward optimization
 In ICML ’05: Proceedings of the 22nd international conference on Machine learning
, 2005
"... We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making whil ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
We present an efficient “sparse sampling ” technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior—rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios. 1.
Multitask reinforcement learning: A hierarchical bayesian approach
 In: ICML ’07: Proceedings of the 24th international conference on Machine learning
, 2007
"... We consider the problem of multitask reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
We consider the problem of multitask reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel MDP, we use the previously learned distribution as an informed prior for modelbased Bayesian reinforcement learning. The hierarchical Bayesian framework provides a strong prior that allows us to rapidly infer the characteristics of new environments based on previous environments, while the use of a nonparametric model allows us to quickly adapt to environments we have not encountered before. In addition, the use of infinite mixtures allows for the model to automatically learn the number of underlying MDP components. We evaluate our approach and show that it leads to significant speedups in convergence to an optimal policy after observing only a small number of tasks. 1.
Multiarmed bandit algorithms and empirical evaluation
 In European Conference on Machine Learning
, 2005
"... Abstract. The multiarmed bandit problem for a gambler is to decide which arm of a Kslot machine to pull to maximize his total reward in a series of trials. Many realworld learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a soluti ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
Abstract. The multiarmed bandit problem for a gambler is to decide which arm of a Kslot machine to pull to maximize his total reward in a series of trials. Many realworld learning and optimization problems can be modeled in this way. Several strategies or algorithms have been proposed as a solution to this problem in the last two decades, but, to our knowledge, there has been no common evaluation of these algorithms. This paper provides a preliminary empirical evaluation of several multiarmed bandit algorithms. It also describes and analyzes a new algorithm, Poker (Price Of Knowledge and Estimated Reward) whose performance compares favorably to that of other existing algorithms in several experiments. One remarkable outcome of our experiments is that the most naive approach, the ɛgreedy strategy, proves to be often hard to beat. 1
Modelbased Bayesian reinforcement learning in partially observable domains. ISAIM
, 2008
"... Bayesian reinforcement learning in partially observable domains is notoriously difficult, in part due to the unknown form of the beliefs and the optimal value function. We show that beliefs represented by mixtures of products of Dirichlet distributions are closed under belief updates for factored do ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
Bayesian reinforcement learning in partially observable domains is notoriously difficult, in part due to the unknown form of the beliefs and the optimal value function. We show that beliefs represented by mixtures of products of Dirichlet distributions are closed under belief updates for factored domains. Belief monitoring algorithms that use this mixture representation are proposed. We also show that the optimal value function is a linear combination of products of Dirichlets for factored domains. Finally, we extend BEETLE, which is a pointbased value iteration algorithm for Bayesian RL in fully observable domains, to partially observable domains. 1
State abstraction discovery from irrelevant state variables
 In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence
, 2005
"... Abstraction is a powerful form of domain knowledge that allows reinforcementlearning agents to cope with complex environments, but in most cases a human must supply this knowledge. In the absence of such prior knowledge or a given model, we propose an algorithm for the automatic discovery of state ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
Abstraction is a powerful form of domain knowledge that allows reinforcementlearning agents to cope with complex environments, but in most cases a human must supply this knowledge. In the absence of such prior knowledge or a given model, we propose an algorithm for the automatic discovery of state abstraction from policies learned in one domain for use in other domains that have similar structure. To this end, we introduce a novel condition for state abstraction in terms of the relevance of state features to optimal behavior, and we exhibit statistical methods that detect this condition robustly. Finally, we show how to apply temporal abstraction to benefit safely from even partial state abstraction in the presence of generalization error. 1
Reinforcement learning with limited reinforcement: Using bayes risk for active learning in pomdps. ISAIM (online proceedings
, 2008
"... Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward. Unfortunately, most POMDPs are defined with a large number of parameters which are difficult to sp ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward. Unfortunately, most POMDPs are defined with a large number of parameters which are difficult to specify only from domain knowledge. In this paper, we present an approximation approach that allows us to treat the POMDP model parameters as additional hidden state in a “modeluncertainty ” POMDP. Coupled with modeldirected queries, our planner actively learns good policies. We demonstrate our approach on several POMDP problems. 1.
Bandits for taxonomies: A modelbased approach
 In In Proc. of the SIAM International Conference on Data Mining
, 2007
"... We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
We consider a novel problem of learning an optimal matching, in an online fashion, between two feature spaces that are organized as taxonomies. We formulate this as a multiarmed bandit problem where the arms of the bandit are dependent due to the structure induced by the taxonomies. We then propose a multistage hierarchical allocation scheme that improves the explore/exploit properties of the classical multiarmed bandit policies in this scenario. In particular, our scheme uses the taxonomy structure and performs shrinkage estimation in a Bayesian framework to exploit dependencies among the arms, thereby enhancing exploration without losing efficiency on short term exploitation. We prove that our scheme asymptotically converges to the optimal matching. We conduct extensive experiments on real data to illustrate the efficacy of our scheme in practice. 1