Results 11  20
of
204
BayesAdaptive POMDPs
, 2007
"... Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the explorationexploitation tradeoff in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processe ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
(Show Context)
Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the explorationexploitation tradeoff in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the BayesAdaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates.
A survey of pointbased POMDP solvers
 AUTON AGENT MULTIAGENT SYST
, 2012
"... The past decade has seen a significant breakthrough in research on solving partially observable Markov decision processes (POMDPs). Where past solvers could not scale beyond perhaps a dozen states, modern solvers can handle complex domains with many thousands of states. This breakthrough was mainly ..."
Abstract

Cited by 33 (5 self)
 Add to MetaCart
(Show Context)
The past decade has seen a significant breakthrough in research on solving partially observable Markov decision processes (POMDPs). Where past solvers could not scale beyond perhaps a dozen states, modern solvers can handle complex domains with many thousands of states. This breakthrough was mainly due to the idea of restricting value function computations to a finite subset of the belief space, permitting only local value updates for this subset. This approach, known as pointbased value iteration, avoids the exponential growth of the value function, and is thus applicable for domains with longer horizons, even with relatively large state spaces. Many extensions were suggested to this basic idea, focusing on various aspects of the algorithm—mainly the selection of the belief space subset, and the order of value function updates. In this survey, we walk the reader through the fundamentals of pointbased value iteration, explaining the main concepts and ideas. Then, we survey the major extensions to the basic algorithm, discussing their merits. Finally, we include an extensive empirical analysis using well known benchmarks, in order to shed light on the strengths and limitations of the various approaches.
Solving POMDPs: RTDPBel vs. Pointbased Algorithms
"... Pointbased algorithms and RTDPBel are approximate methods for solving POMDPs that replace the full updates of parallel value iteration by faster and more effective updates at selected beliefs. An important difference between the two methods is that the former adopt Sondik’s representation of the v ..."
Abstract

Cited by 32 (7 self)
 Add to MetaCart
Pointbased algorithms and RTDPBel are approximate methods for solving POMDPs that replace the full updates of parallel value iteration by faster and more effective updates at selected beliefs. An important difference between the two methods is that the former adopt Sondik’s representation of the value function, while the latter uses a tabular representation and a discretization function. The algorithms, however, have not been compared up to now, because they target different POMDPs: discounted POMDPs on the one hand, and Goal POMDPs on the other. In this paper, we bridge this representational gap, showing how to transform discounted POMDPs into Goal POMDPs, and use the transformation to compare RTDPBel with pointbased algorithms over the existing discounted benchmarks. The results appear to contradict the conventional wisdom in the area showing that RTDPBel is competitive, and sometimes superior to pointbased algorithms in both quality and time. 1
Optimizing FixedSize Stochastic Controllers for POMDPs and Decentralized POMDPs
"... POMDPs and their decentralized multiagent counterparts, DECPOMDPs, offer a rich framework for sequential decision making under uncertainty. Their computational complexity, however, presents an important research challenge. One approach that effectively addresses the intractable memory requirements ..."
Abstract

Cited by 30 (14 self)
 Add to MetaCart
(Show Context)
POMDPs and their decentralized multiagent counterparts, DECPOMDPs, offer a rich framework for sequential decision making under uncertainty. Their computational complexity, however, presents an important research challenge. One approach that effectively addresses the intractable memory requirements of current algorithms is based on representing agent policies as finitestate controllers. In this paper, we propose a new approach that uses this representation and formulates the problem as a nonlinear program (NLP). The NLP defines an optimal policy of a desired size for each agent. This new representation allows a wide range of powerful nonlinear programming algorithms to be used to solve POMDPs and DECPOMDPs. Although solving the NLP optimally is often intractable, the results we obtain using an offtheshelf optimization method are competitive with stateoftheart POMDP algorithms and outperform stateoftheart DECPOMDP algorithms. Our approach is easy to implement and it opens up promising research directions for solving POMDPs and DECPOMDPs using nonlinear programming methods. 1.
Scaling POMDPs for spoken dialog management
 Audio, Speech, and Language Processing 15(7):2116–2129
"... Abstract — Control in spoken dialog systems is challenging largely because automatic speech recognition is unreliable and hence the state of the conversation can never be known with certainty. Partially observable Markov decision processes (POMDPs) provide a principled mathematical framework for pla ..."
Abstract

Cited by 28 (13 self)
 Add to MetaCart
(Show Context)
Abstract — Control in spoken dialog systems is challenging largely because automatic speech recognition is unreliable and hence the state of the conversation can never be known with certainty. Partially observable Markov decision processes (POMDPs) provide a principled mathematical framework for planning and control in this context; however, POMDPs face severe scalability challenges and past work has been limited to trivially small dialog tasks. This paper presents a novel POMDP optimization technique – composite summary pointbased value iteration (CSPBVI) – which enables optimization to be performed on slotfilling POMDPbased dialog managers of a realistic size. Using dialog models trained on data from a tourist information domain, simulation results show that CSPBVI scales effectively, outperforms nonPOMDP baselines, and is robust to estimation errors. Index Terms — Decision theory, dialogue management, partially observable Markov decision process, planning under uncertainty, spoken dialogue system. I.
Forward search value iteration for POMDPs
 in: International Joint Conference on AI
, 2007
"... Recent scaling up of POMDP solvers towards realistic applications is largely due to pointbased methods which quickly converge to an approximate solution for mediumsized problems. Of this family HSVI, which uses trialbased asynchronous value iteration, can handle the largest domains. In this paper ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Recent scaling up of POMDP solvers towards realistic applications is largely due to pointbased methods which quickly converge to an approximate solution for mediumsized problems. Of this family HSVI, which uses trialbased asynchronous value iteration, can handle the largest domains. In this paper we suggest a new algorithm, FSVI, that uses the underlying MDP to traverse the belief space towards rewards, finding sequences of useful backups, and show how it scales up better than HSVI on larger benchmarks. 1
The Infinite Partially Observable Markov Decision Process
"... The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many realworld p ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
(Show Context)
The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many realworld problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1
A pointbased POMDP planner for target tracking
 in Proc. ICRA
, 2008
"... AbstractTarget tracking has two variants that are often studied independently with different approaches: target searching requires a robot to find a target initially not visible, and target following requires a robot to maintain visibility on a target initially visible. In this work, we use a part ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
AbstractTarget tracking has two variants that are often studied independently with different approaches: target searching requires a robot to find a target initially not visible, and target following requires a robot to maintain visibility on a target initially visible. In this work, we use a partially observable Markov decision process (POMDP) to build a single model that unifies target searching and target following. The POMDP solution exhibits interesting tracking behaviors, such as anticipatory moves that exploit target dynamics, informationgathering moves that reduce target position uncertainty, and energyconserving actions that allow the target to get out of sight, but do not compromise longterm tracking performance. To overcome the high computational complexity of solving POMDPs, we have developed SARSOP, a new pointbased POMDP algorithm based on successively approximating the space reachable under optimal policies. Experimental results show that SARSOP is competitive with the fastest existing pointbased algorithm on many standard test problems and faster by many times on some.
Bayesian Reinforcement Learning in Continuous POMDPs with Application to Robot Navigation
"... We consider the problem of optimal control in continuous and partially observable environments when the parameters of the model are not known exactly. Partially Observable Markov Decision Processes (POMDPs) provide a rich mathematical model to handle such environments but require a known model to be ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
We consider the problem of optimal control in continuous and partially observable environments when the parameters of the model are not known exactly. Partially Observable Markov Decision Processes (POMDPs) provide a rich mathematical model to handle such environments but require a known model to be solved by most approaches. This is a limitation in practice as the exact model parameters are often difficult to specify exactly. We adopt a Bayesian approach where a posterior distribution over the model parameters is maintained and updated through experience with the environment. We propose a particle filter algorithm to maintain the posterior distribution and an online planning algorithm, based on trajectory sampling, to plan the best action to perform under the current posterior. The resulting approach selects control actions which optimally tradeoff between 1) exploring the environment to learn the model, 2) identifying the system’s state, and 3) exploiting its knowledge in order to maximize longterm rewards. Our preliminary results on a simulated robot navigation problem show that our approach is able to learn good models of the sensors and actuators, and performs as well as if it had the true model.
Robot planning in partially observable continuous domains
 In Robotics: Science and Systems I
, 2005
"... Abstract — We present a value iteration algorithm for learning to act in Partially Observable Markov Decision Processes (POMDPs) with continuous state spaces. Mainstream POMDP research focuses on the discrete case and this complicates its application to, e.g., robotic problems that are naturally mod ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
Abstract — We present a value iteration algorithm for learning to act in Partially Observable Markov Decision Processes (POMDPs) with continuous state spaces. Mainstream POMDP research focuses on the discrete case and this complicates its application to, e.g., robotic problems that are naturally modeled using continuous state spaces. The main difficulty in defining a (beliefbased) POMDP in a continuous state space is that expected values over states must be defined using integrals that, in general, cannot be computed in closed from. In this paper, we first show that the optimal finitehorizon value function over the continuous infinitedimensional POMDP belief space is piecewise linear and convex, and is defined by a finite set of supporting αfunctions that are analogous to the αvectors (hyperplanes) defining the value function of a discretestate POMDP. Second, we show that, for a fairly general class of POMDP models in which all functions of interest are modeled by Gaussian mixtures, all belief updates and value iteration backups can be carried out analytically and exact. A crucial difference with respect to the αvectors of the discrete case is that, in the continuous case, the αfunctions will typically grow in complexity (e.g., in the number of components) in each value iteration. Finally, we demonstrate PERSEUS, our previously proposed randomized pointbased value iteration algorithm, in a simple robot planning problem with a continuous domain, where encouraging results are observed. I.