Results 1  10
of
26
Prioritizing PointBased POMDP Solvers ⋆
"... Abstract. Recent scaling up of POMDP solvers towards realistic applications is largely due to pointbased methods such as PBVI, Perseus, and HSVI, which quickly converge to an approximate solution for mediumsized problems. These algorithms improve a value function by using backup operations over a ..."
Abstract

Cited by 37 (6 self)
 Add to MetaCart
Abstract. Recent scaling up of POMDP solvers towards realistic applications is largely due to pointbased methods such as PBVI, Perseus, and HSVI, which quickly converge to an approximate solution for mediumsized problems. These algorithms improve a value function by using backup operations over a single belief point. In the simpler domain of MDP solvers, prioritizing the order of equivalent backup operations on states is well known to speed up convergence. We generalize the notion of prioritized backups to the POMDP framework, and show that the ordering of backup operations on belief points is important. We also present a new algorithm, Prioritized Value Iteration (PVI), and show empirically that it outperforms current pointbased algorithms. Finally, a new empirical evaluation measure, based on the number of backups and the number of belief points, is proposed, in order to provide more accurate benchmark comparisons. 1
Exponential family predictive representations of state
 In Neural Information Processing Systems (NIPS
"... 2008 To my wife, Martha. ii Acknowledgments This work would not have been possible without generous help, both intellectually and financially. I am grateful to my advisor, Satinder Singh, for the long discussions we have had as he has patiently taught me to think clearly through my own ideas, sharpe ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
2008 To my wife, Martha. ii Acknowledgments This work would not have been possible without generous help, both intellectually and financially. I am grateful to my advisor, Satinder Singh, for the long discussions we have had as he has patiently taught me to think clearly through my own ideas, sharpen my writing, and to raise my sights. A special thanks also to my lab mates, Matt Rudary, Britton Wolfe, Vishal Soni, Erik Talviti, Jonathan Sorg and Ishan Chaudhuri for always letting me bounce ideas around, for listening, and for patient tutoring. Thanks to Andrew Nuxoll for being a kindred spirit, to Nick Gorski for the occasional foosball game and to my collaborators at the University of Alberta. Finally, I would like to gratefully acknowledge the National Science Foundation for financially supporting me through most of my studies with a Graduate Research Fellowship. Finally, a special thank you to my wife Martha for her love, her constancy, her feistiness and for always keeping me on the straight and narrow. Thank you, Grace, Peterson and Andrew for reminding
D.: Learning policies for embodied virtual agents through demonstration
 In Proc. IJCAI ’07
"... Although many powerful AI and machine learning techniques exist, it remains difficult to quickly create AI for embodied virtual agents that produces visually lifelike behavior. This is important for applications (e.g., games, simulators, interactive displays) where an agent must behave in a manner t ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Although many powerful AI and machine learning techniques exist, it remains difficult to quickly create AI for embodied virtual agents that produces visually lifelike behavior. This is important for applications (e.g., games, simulators, interactive displays) where an agent must behave in a manner that appears humanlike. We present a novel technique for learning reactive policies that mimic demonstrated human behavior. The user demonstrates the desired behavior by dictating the agent’s actions during an interactive animation. Later, when the agent is to behave autonomously, the recorded data is generalized to form a continuous statetoaction mapping. Combined with an appropriate animation algorithm (e.g., motion capture), the learned policies realize stylized and naturallooking agent behavior. We empirically demonstrate the efficacy of our technique for quickly producing policies which result in lifelike virtual agent behavior. 1
Scaling up: Solving pomdps through value based clustering
 In Proceedings of AAAI
, 1295
"... Partially Observable Markov Decision Processes (POMDPs) provide an appropriately rich model for agents operating under partial knowledge of the environment. Since finding an optimal POMDP policy is intractable, approximation techniques have been a main focus of research, among them pointbased algor ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Partially Observable Markov Decision Processes (POMDPs) provide an appropriately rich model for agents operating under partial knowledge of the environment. Since finding an optimal POMDP policy is intractable, approximation techniques have been a main focus of research, among them pointbased algorithms, which scale up relatively well up to thousands of states. An important decision in a pointbased algorithm is the order of backup operations over belief states. Prioritization techniques for ordering the sequence of backup operations reduce the number of needed backups considerably, but involve significant overhead. This paper suggests a new way to order backups, based on a soft clustering of the belief space. Our novel soft clustering method relies on the solution of the underlying MDP. Empirical evaluation verifies that our method rapidly computes a good order of backups, showing orders of magnitude improvement in runtime over a number of benchmarks.
Topological Order Planner for POMDPs
"... Over the past few years, pointbased POMDP solvers scaled up to produce approximate solutions to midsized domains. However, to solve real world problems, solvers must exploit the structure of the domain. In this paper we focus on the topological structure of the problem, where the state space conta ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Over the past few years, pointbased POMDP solvers scaled up to produce approximate solutions to midsized domains. However, to solve real world problems, solvers must exploit the structure of the domain. In this paper we focus on the topological structure of the problem, where the state space contains layers of states. We present here the Topological Order Planner (TOP) that utilizes the topological structure of the domain to compute belief space trajectories. TOP rapidly produces trajectories focused on the solveable regions of the belief space, thus reducing the number of redundant backups considerably. We demonstrate TOP to produce good quality policies faster than any other pointbased algorithm on domains with sufficient structure. 1
Topological Value Iteration Algorithms
"... Value iteration is a powerful yet inefficient algorithm for Markov decision processes (MDPs) because it puts the majority of its effort into backing up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many approaches have been proposed. Amon ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Value iteration is a powerful yet inefficient algorithm for Markov decision processes (MDPs) because it puts the majority of its effort into backing up the entire state space, which turns out to be unnecessary in many cases. In order to overcome this problem, many approaches have been proposed. Among them, ILAO * and variants of RTDP are stateoftheart ones. These methods use reachability analysis and heuristic search to avoid some unnecessary backups. However, none of these approaches build the graphical structure of the state transitions in a preprocessing step or use the structural information to systematically decompose a problem, whereby generating an intelligent backup sequence of the state space. In this paper, we present two optimal MDP algorithms. The first algorithm, topological value iteration (TVI), detects the structure of MDPs and backs up states based on topological sequences. It (1) divides an MDP into stronglyconnected components (SCCs), and (2) solves these components sequentially. TVI outperforms VI and other stateoftheart algorithms vastly when an MDP has multiple, closetoequalsized SCCs. The second algorithm, focused topological value iteration (FTVI), is an extension of TVI. FTVI restricts its attention to connected components that are relevant for solving the MDP. Specifically, it uses a small amount of heuristic search to eliminate provably suboptimal actions; this pruning allows FTVI to find smaller connected components, thus running faster. We demonstrate that FTVI outperforms TVI by an order of magnitude, averaged across several domains. Surprisingly, FTVI also significantly outperforms popular ‘heuristicallyinformed ’ MDP algorithms such as ILAO*, LRTDP, BRTDP and BayesianRTDP in many domains, sometimes by as much as two orders of magnitude. Finally, we characterize the type of domains where FTVI excels — suggesting a way to an informed choice of solver. 1.
PAC Optimal Planning for Invasive Species Management: Improved Exploration for Reinforcement Learning from SimulatorDefined MDPs
"... Often the most practical way to define a Markov Decision Process (MDP) is as a simulator that, given a state and an action, produces a resulting state and immediate reward sampled from the corresponding distributions. Simulators in natural resource management can be very expensive to execute, so tha ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Often the most practical way to define a Markov Decision Process (MDP) is as a simulator that, given a state and an action, produces a resulting state and immediate reward sampled from the corresponding distributions. Simulators in natural resource management can be very expensive to execute, so that the time required to solve such MDPs is dominated by the number of calls to the simulator. This paper presents an algorithm, DDV, that combines improved confidence intervals on the Q values (as in interval estimation) with a novel upper bound on the discounted state occupancy probabilities to intelligently choose stateaction pairs to explore. We prove that this algorithm terminates with a policy whose value is within ε of the optimal policy (with probability 1 − δ) after making only polynomiallymany calls to the simulator. Experiments on benchmark MDPs and on an MDP for invasive species management show 3 to 5fold reductions in the number of simulator calls required.
Action Understanding as Inverse Planning Appendix
"... Markov decision problems This section formalizes the encoding of an agent’s environment and goal into a Markov decision problem (MDP), and describes how this MDP can be solved efficiently by algorithms for rational planning. Let π be an agent’s plan, referred to here (and in the MDP literature) as a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Markov decision problems This section formalizes the encoding of an agent’s environment and goal into a Markov decision problem (MDP), and describes how this MDP can be solved efficiently by algorithms for rational planning. Let π be an agent’s plan, referred to here (and in the MDP literature) as a policy, such that Pπ(atst, g, w) is a probability distribution over actions at at time t, given the agent’s state st at time t, the agent’s goal g and world state w. This distribution formalizes P (ActionsGoal, Environment), the expression for probabilistic planning sketched in the main text. The policy π encodes all goaldependent plans the agent could make in a given environment. We assume that agents ’ policies follow the principle of rationality. Within a goalbased MDP, this means that agents choose action sequences that minimize the expected cost to achieve their goals, given their beliefs about the environment. Let Cg,w(a, s) be the environment and goaldependent cost to an agent of taking action a in state s. The expected cost to an agent of executing policy π starting from state s is given by the agent’s value function, which sums the costs the agent is expected to incur over an infinite horizon: V π
Learning and solving partially observable Markov Decision Processes
, 2007
"... Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Partially Observable Markov Decision Processes (POMDPs) provide a rich representation for agents acting in a stochastic domain under partial observability. POMDPs optimally balance key properties such as the need for information and the sum of collected rewards. However, POMDPs are difficult to use for two reasons; first, it is difficult to obtain the environment dynamics and second, even given the environment dynamics, solving POMDPs optimally is intractable. This dissertation deals with both difficulties. We begin with a number of methods for learning POMDPs. Methods for learning POMDPs are usually categorized as either modelfree or modelbased. We show how modelfree methods fail to provide good policies as noise in the environment increases. We continue to suggest how to transform modelfree into modelbased methods, thus improving their solution. This transformation is first demonstrated in an offline process — after the modelfree method has computed a policy, and then in an online setting — where a model of the environment is learned together with a policy through interactions with the environment. The second part of the dissertation focuses on ways to solve predefined POMDPs. Point
Partitioned ExternalMemory Value Iteration
, 2008
"... Dynamic programming methods (including value iteration, LAO*, RTDP, and derivatives) are popular algorithms for solving Markov decision processes (MDPs). Unfortunately, however, these techniques store the MDP model extensionally in a table and thus are limited by the amount of main memory available. ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Dynamic programming methods (including value iteration, LAO*, RTDP, and derivatives) are popular algorithms for solving Markov decision processes (MDPs). Unfortunately, however, these techniques store the MDP model extensionally in a table and thus are limited by the amount of main memory available. Since the required space is exponential in the number of domain features, these dynamic programming methods are ineffective for large problems. To address this problem, Edelcamp et al. devised the external memory value iteration (EMVI) algorithm, which uses a clever sorting scheme to efficiently move parts of the model between disk and main memory. While EMVI can handle larger problems than previously addressed, the need to repeatedly perform external sorts still limits scalability. This paper proposes a new approach. We partition an MDP into smaller pieces (blocks), keeping just the relevant blocks in memory and performing Bellman backups block by block. Experiments show that our algorithm is able to solve large MDPs an order of magnitude faster than EMVI.