## Planning and acting in partially observable stochastic domains (1998)

### Cached

### Download Links

- [www.cis.upenn.edu]
- [www.cs.tufts.edu]
- [ftp.cs.brown.edu]
- [www.cs.duke.edu]
- [www.cis.upenn.edu]
- [damas.ift.ulaval.ca]
- [www.damas.ift.ulaval.ca]
- [www.cs.duke.edu]
- [www.cs.brown.edu]
- [msl.cs.uiuc.edu]
- [www.cs.ubc.ca]
- [www.ai.mit.edu]
- [csail.mit.edu]
- [people.csail.mit.edu]
- [staff.science.uva.nl]
- [www.cs.rutgers.edu]
- [people.csail.mit.edu]
- [classes.engr.oregonstate.edu]
- [elite.polito.it]
- DBLP

### Other Repositories/Bibliography

Venue: | ARTIFICIAL INTELLIGENCE |

Citations: | 832 - 30 self |

### BibTeX

@ARTICLE{Kaelbling98planningand,

author = {Leslie Pack Kaelbling and Michael L. Littman and Anthony R. Cassandra},

title = {Planning and acting in partially observable stochastic domains},

journal = {ARTIFICIAL INTELLIGENCE},

year = {1998},

volume = {101},

pages = {99--134}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (mdps) and partially observable mdps (pomdps). We then outline a novel algorithm for solving pomdps offline and show how, in some cases, a finite-memory controller can be extracted from the solution to a pomdp. We conclude with a discussion of how our approach relates to previous work, the complexity of finding exact solutions to pomdps, and of some possibilities for finding approximate solutions.

### Citations

4286 | A tutorial on hidden Markov models and selected applications in speech recognition
- Rabiner
(Show Context)
Citation Context ...o get good solutions to large problems. Another area that is not addressed in this paper is the acquisition of a world model. One approach is to extend techniques for learning hidden Markov mod35sels =-=[43,53]-=- to learn pomdp models. Then, we could apply algorithms of the type described in this paper to the learned models. Another approach is to combine the learning of the model with the computation of the ... |

2872 |
Genetic programming: On the programming of computers by means of natural selection
- Koza
- 1992
(Show Context)
Citation Context ...resent optimal plans in general. This argues that, in the limit, a plan is actually a program. Several techniques have been proposed recently for searching for good program-like controllers in pomdps =-=[46,23]-=- We restrict our attention to the simpler nite-horizon case and a small set of in nite-horizon problems that have optimal nite-state plans. 7 Extensions and Conclusions The pomdp model provides a rm f... |

2129 |
A new approach to linear filtering and prediction problems. Transaction of the ASME
- Kalman
- 1960
(Show Context)
Citation Context ...the underlying dynamics of the world (the map and other information), to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman filter =-=[20]-=- to maintain a running estimate of the robot's spatial uncertainty, expressed as an ellipsoid or normal distribution in Cartesian space. This approach will not do for our robot, though. Its uncertaint... |

1462 |
Theory of linear and integer programming
- Schrijver
- 1986
(Show Context)
Citation Context ...he number of bits of precision used in specifying the model is polynomial in these quantities since the polynomial running time of linear programming is expressed as a function of the input precision =-=[48]-=-. 4.5 Alternative Approaches One paragraph each on Cheng, Sondik 1 and 2, Incremental Pruning??. And a short discussion of their relative e ciencies. 22s4.6 The In nite Horizon Be sure this is right, ... |

957 | Fast planning through planning graph analysis
- Blum, Furst
- 1995
(Show Context)
Citation Context ... when the initial state is known and all actions are deterministic. A slightly more elaborate structure is the partially ordered plan (generated, for example, by snlp and ucpop), or the parallel plan =-=[4]-=-. In this type of plan, actions can be left unordered if all orderings are equivalent under the performance metric. When actions are stochastic, partially ordered plans can still be used (as in Burida... |

581 |
Markov Decision Processes
- Puterman
- 1994
(Show Context)
Citation Context ...ct perceptual abilities. AGENT Actions States WORLD Figure 1: An mdp models the synchronous interaction between agent and world. Markov decision processes are described in depth in a variety of texts =-=[2, 20]-=-; we will just briefly cover the necessary background. 2.1 Basic Framework A Markov decision process can be described as a tuple hS; A;T;Ri, where ffl S is a finite set of states of the world; ffl A i... |

519 |
Dynamic Programming and and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...erived from t,1 and V t,2. 2 3 5 : 4R(s; a)+ X T (s; a; s 0 )Vt,1(s 0 ) 5 ; In the in nite-horizon discounted case, for any initial state s, we want to execute the policy that maximizes V (s). Howard =-=[18]-=- showed that there exists a stationary policy, , that is optimal for every starting state. The value function for this policy, V equations , also written V , is de ned by the set of V (s) = max a 2 s ... |

462 |
Dynamic programming and optimal control. Athena Scientific
- Bertsekas
- 1995
(Show Context)
Citation Context ...gent's actions, there is never any uncertainty about the agent's current state|it has complete and perfect perceptual abilities. Markov decision processes are described in depth in a variety of texts =-=[3,42]-=-; we will just brie y cover the necessary background. 3sStates WORLD AGENT Actions Fig. 1. An mdp models the synchronous interaction between agent and world. 2.1 Basic Framework A Markov decision proc... |

415 | UCPOP: A sound, complete, partial order planner for ADL
- Penberthy, Weld
- 1992
(Show Context)
Citation Context ...s will be known with certainty during plan execution. In the mdp framework, the agent is informed of the current state each time it takes an action. In many classical planners (e.g., snlp [32], ucpop =-=[38]-=-), the current state can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research o... |

386 | D.: Systematic nonlinear planning
- McAllester, Rosenblitt
- 1991
(Show Context)
Citation Context ...f the process will be known with certainty during plan execution. In the mdp framework, the agent is informed of the current state each time it takes an action. In many classical planners (e.g., snlp =-=[32]-=-, ucpop [38]), the current state can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains... |

337 |
The Optimal Control of Partially Observable Markov Decision Processes
- Sondik
- 1971
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

326 | Universal Plans for Reactive Robots in Unpredictable Environments
- Schoppers
- 1987
(Show Context)
Citation Context ...entation is a policy which maps the current state (situation) toachoice of action. Because there is an action choice speci ed for all possible initial states, policies are also called universal plans =-=[47]-=-. This representation is not appropriate for pomdps, since the underlying state is not fully observable. However, pomdp policies can be viewed as universal plans over belief space. It is interesting t... |

313 |
A formal theory of knowledge and action
- Moore
- 1985
(Show Context)
Citation Context ...d an optimal way to behave. In the arti cial intelligence (AI) literature, a deterministic version of this problem has been addressed by adding knowledge preconditions to traditional planning systems =-=[36]-=-. Because we are interested in stochastic domains, however, we must depart from the traditional AI planning model. Rather than taking plans to be sequences of actions, which may only rarely execute as... |

295 |
The optimal control of partially observable markov decision processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ... amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The first is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly. We begin by introducing the theory of Markov decision ... |

275 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...mately optimal. 5 Understanding Policies In this section we introduce a very simple example and use it to illustrate some properties of pomdp policies. Other examples are explored in an earlier paper =-=[7]-=-. 5.1 The Tiger Problem Imagine an agent standing in front of two closed doors. Behind one of the doors is a tiger and behind the other is a large reward. If the agent opens the door with the tiger, t... |

258 | An algorithm for probabilistic planning
- Kushmerick, Hanks, et al.
- 1995
(Show Context)
Citation Context ...objective of planning, the representation 29 TR listen TL TR TL listen TL listen TR TRsof domains, and plan structures. The most closely related work to our own is that of Kushmerick, Hanks, and Weld =-=[24]-=- on the Buridan system, and Draper, Hanks and Weld [13] on the C-Buridan system. 6.1 Imperfect Knowledge Plans generated using standard mdp algorithms and classical (strips-like or partial-order) plan... |

232 | Learning policies for partially observable environments: Scaling up
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...res the use of function-approximation methods for representing value functions and the use of simulation in order to concentrate the approximations on the frequently visited parts of the belief space =-=[27]-=-. The results of this work are encouraging and have allowed us to get a very good solution to an 89 state, 16 observation instance of a hallway navigation problem similar to the one described in the i... |

211 | Conditional nonlinear planning
- Peot, Smith
- 1992
(Show Context)
Citation Context ...rovide makes e cient reasoning very di cult. A step towards building a working planning system that reasons about knowledge is to relax the generality of the logic-based schemes. The approach of cnlp =-=[39]-=- uses three-valued propositions where, in addition to true and false, there is a value unknown, which represents the state when the truth of the proposition is not known. Operators can then refer to w... |

200 | Probabilistic planning with information gathering and contingent execution
- Draper, Hanks, et al.
- 1994
(Show Context)
Citation Context ...TL TR TL listen TL listen TR TRsof domains, and plan structures. The most closely related work to our own is that of Kushmerick, Hanks, and Weld [24] on the Buridan system, and Draper, Hanks and Weld =-=[13]-=- on the C-Buridan system. 6.1 Imperfect Knowledge Plans generated using standard mdp algorithms and classical (strips-like or partial-order) planning algorithms assume that the underlying state of the... |

193 | Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach
- Chrisman
- 1992
(Show Context)
Citation Context ...tential signi cant advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman =-=[10]-=- and McCallum [33,34]. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree... |

193 |
A survey of partially observable markov decision processes: Theory, models, and algorithms
- Monahan
- 1982
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

176 | Algorithms for Sequential Decision Making
- Littman
- 1996
(Show Context)
Citation Context ...ahan [35], is to test R( ; ~V) for every in ~V and remove those that are nowhere dominant. A much more e cient pruning method was proposed by Lark and White [57] and is described in detail by Littman =-=[29]-=- and by Cassandra [8]. Because it has many subtle technical details, it is not described here. 4.3 One Step of Value Iteration The value function for a pomdp can be computed using value iteration, wit... |

176 |
A survey of algorithmic methods for partially observed markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ...he amount of reward they produce, and how they change the state of the world. This paper is intended to make two contributions. The rst is to recapitulate work from the operations-research literature =-=[30,35,50,52,57]-=- and to describe its connection to closely related work in AI. The second is to describe a novel algorithmic approach for solving pomdps exactly.We begin byintroducing the theory of Markov decision pr... |

163 | Planning under time constraints in stochastic domains
- Dean, Kaelbling, et al.
- 1995
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[14,54,12]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations suchasBayes nets or ... |

158 | Incremental pruning: A simple, fast, exact method for partially observable markov decision processes
- Cassandra, Littman, et al.
- 1997
(Show Context)
Citation Context ...the geometric approaches are useful only in pomdps with extremely small state spaces. Zhang and Liu [67] describe the incremental-pruning algorithm, later generalized by Cassandra, Littman, and Zhang =-=[7]-=-. This algorithm is simple to implement and empirically faster than the witness algorithm, while sharing its good worst-case complexity in terms of P a jQ a t j. The basic algorithm works like the exh... |

153 | The complexity of stochastic games
- Condon
- 1992
(Show Context)
Citation Context ...transformation holds in the opposite direction: any total expected discounted reward problem (completely observable or nite horizon) can be transformed into a goal-achievement problem of similar size =-=[11,60]-=-. Roughly, the transformation simulates the discount factor by introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewa... |

136 |
Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Unpublished doctoral dissertation
- Cassandra
- 1998
(Show Context)
Citation Context ...ich dominates; that is, R( ; V) =fb j b >b ~; for all ~ 2V, and b 2Bg : It is relatively easy, using a linear program, to nd a point in R( ; V) if one exists, or to determine that the region is empty =-=[8]-=-. The simplest pruning strategy, described by Monahan [35], is to test R( ; ~V) for every in ~V and remove those that are nowhere dominant. A much more e cient pruning method was proposed by Lark and ... |

135 |
Information value theory
- Howard
- 1966
(Show Context)
Citation Context ...f the simplex, the agent can take actions more likely to be appropriate for the current state of the world and, so, gain more reward. This has some connection to the notion of \value of information," =-=[19]-=- where an agent can incur a cost to move it from a highentropy to a low-entropy state; this is only worthwhile when the value of the information (the di erence in value between the two states) exceeds... |

134 | Hidden Markov model induction by Bayesian model merging
- Stolcke, Omohundro
- 1993
(Show Context)
Citation Context ...o get good solutions to large problems. Another area that is not addressed in this paper is the acquisition of a world model. One approach is to extend techniques for learning hidden Markov mod35sels =-=[43,53]-=- to learn pomdp models. Then, we could apply algorithms of the type described in this paper to the learned models. Another approach is to combine the learning of the model with the computation of the ... |

122 |
Optimal control of markov decision processes with incomplete state estimation
- Astrom
- 1965
(Show Context)
Citation Context ... agent: given the agent's current belief state (properly computed), no additional data about its past actions or observations would supply any further information about the current state of the world =-=[1,50]-=-. This means that the process over belief states is Markov, and that no additional data about the past would help to increase the agent's expected reward. To illustrate the evolution of a belief state... |

119 | Theory of Linear and - Schrijver - 1986 |

113 | Computing optimal policies for partially observable decision processes using compact representations
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ...rough compositional representations suchasBayes nets or probabilistic operator descriptions. However, this work has served as a substrate for devel2sopment of more complex and e cient representations =-=[6]-=-. Section 6 describes the relation between the present approach and prior research in more detail. One important facet of the pomdp approach is that there is no distinction drawn between actions taken... |

108 | Anytime synthetic projection: Maximizing the probability of goal satisfaction
- Drummond, Bresina
- 1990
(Show Context)
Citation Context ...what may happen. In many cases, we may not want a full policy; methods for developing partial policies and conditional plans for completely observable domains are the subject of much current interest =-=[14,54,12]-=-. A weakness of the methods described in this paper is that they require the states of the world to be represented enumeratively, rather than through compositional representations suchasBayes nets or ... |

107 | Overcoming Incomplete Perception with Util Distinction Memory
- McCallum
- 1993
(Show Context)
Citation Context ... advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman [10] and McCallum =-=[33,34]-=-. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree p 2 U a, o 2 , and p... |

98 | G.: Planning for contingencies: A decision-based approach
- Pryor, Collins
- 1996
(Show Context)
Citation Context ...not easily modeled with deterministic actions, since an action can have di erent results, even when applied in exactly the same state. Extensions to classical planning, such ascnlp [39] and Cassandra =-=[41]-=- have considered operators with nondeterministic e ects. For each operator, there is a set of possible next states that could occur. A drawback of this approach is that it gives no information about t... |

97 | The Complexity of Mean Payoff Games on Graphs
- Zwick, Paterson
- 1996
(Show Context)
Citation Context ...ansformation holds in the opposite direction: any total expected discounted reward problem (completely observable or finite horizon) can be transformed into a goal-achievement problem of similar size =-=[11,60]-=-. Roughly, the transformation simulates the discount factor by introducing an absorbing state with a small probability of being entered on each step. Rewards are then simulated by normalizing all rewa... |

95 | Utility models for goal-directed, decision-theoretic planners
- Haddawy, Hanks
- 1998
(Show Context)
Citation Context ...the duration of a run. Koenig and Simmons [22] examine risk-sensitive planning and showed how planners for the total-reward criterion could be used to optimize risk-sensitive behavior. Haddawy et al. =-=[16]-=- looked at a broad family of decision-theoretic objectives that make it possible to specify trade-o s between partially satisfying goals quickly and satisfying them completely. Bacchus, Boutilier, and... |

95 |
Markov Decision Processesâ€”Discrete Stochastic Dynamic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...gent's actions, there is never any uncertainty about the agent's current state|it has complete and perfect perceptual abilities. Markov decision processes are described in depth in a variety of texts =-=[3,42]-=-; we will just brie y cover the necessary background. 3sStates WORLD AGENT Actions Fig. 1. An mdp models the synchronous interaction between agent and world. 2.1 Basic Framework A Markov decision proc... |

95 | The frame problem and knowledgeproducing actions
- Scherl, Levesque
- 1993
(Show Context)
Citation Context ...te can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research on epistemic logic =-=[36,37,45]-=- relaxes this assumption by making it possible to reason about what is and is not known at a given time. Unfortunately, epistemic logics have not been used as a representation in automatic planning sy... |

93 | Memoryless policies: Theoretical limitations and practical results
- Littman
- 1994
(Show Context)
Citation Context ...ions with the same appearance, increasing the probability that it might choose a good action; in practice deterministic observation-action mappings are prone to getting trapped in deterministic loops =-=[26]-=-. In order to behave truly e ectively in a partially observable world, it is necessary to use memory of previous actions and observations to aid in the disambiguation of the states of the world. The p... |

90 | Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State
- McCallum
- 1995
(Show Context)
Citation Context ... advantage of being able to learn a model that is complex enough to support optimal (or good) behavior without making irrelevant distinctions; this idea has been pursued by Chrisman [10] and McCallum =-=[33,34]-=-. A Appendix Theorem 1 Let U a be a non-empty set of useful policy trees, and Q a t be the complete set of useful policy trees. Then U a 6= Q a t if and only if there is some tree p 2 U a, o 2 , and p... |

84 | Tight performance bounds on greedy policies based on imperfect value functions
- Williams, Baird
- 1993
(Show Context)
Citation Context ... optimal in nite-horizon policy, . Rather than calculating a bound on t in advance and running value iteration for that long, we instead use the following result regarding the Bellman error magnitude =-=[58]-=- in order to terminate with a near-optimal policy. If jV t(s),V t,1(s)j < for all s, then the value of the greedy policy with respect 7sto V t does not di er from V by more than 2 =(1 , )atany state. ... |

83 | Dynamic Programming and Optimal Control, vols - Bertsekas - 1995 |

73 |
Algorithms for Partially Observable Markov Decision Processes
- Cheng
- 1988
(Show Context)
Citation Context ...d, we would like to generate the elements of V t directly. If we could do this, we might be able to reach a computation time per iteration that is polynomial in jSj, jAj, j j, jVt,1j, and jVtj. Cheng =-=[9]-=- and Smallwood and Sondik [50] also try to avoid generating all of V + t by constructing Vt directly. However, their algorithms still have worst-case running times exponential in at least one of the p... |

70 | MAXPLAN: A new approach to probabilistic planning
- Majercik, Littman
- 1998
(Show Context)
Citation Context ...ning model is called "completely observable." The mdp model, as well as some planning systems such as cnlp and Plinth [18,19] assume complete observability. Other systems, such as Buridan an=-=d maxplan [37], have no -=-observation model and can attack "completely unobservable" problems. Classical planning systems typically have no observation model, but the fact that the initial state is known and operator... |

64 |
Knowledge preconditions for actions and plans
- Morgenstern
- 1987
(Show Context)
Citation Context ...te can be calculated trivially from the known initial state and knowledge of the deterministic operators. The assumption of perfect knowledge is not valid in many domains. Research on epistemic logic =-=[36,37,45]-=- relaxes this assumption by making it possible to reason about what is and is not known at a given time. Unfortunately, epistemic logics have not been used as a representation in automatic planning sy... |

52 |
Planning with external events
- Blythe
- 1994
(Show Context)
Citation Context ...ossible to assess whether a plan is likely to reach the goal even if it is not guaranteed to do so. This type of action model is used in mdps and pomdps aswell as in Buridan and C-Buridan. Other work =-=[5,14]-=- has used representations that can be used to compute probability distributions over future states. 31s6.4 Observation Model When the starting state is known and actions are deterministic, there is no... |

45 |
A new approach to linear ltering and prediction problems
- Kalman
- 1960
(Show Context)
Citation Context ...f the underlying dynamics of the world (the map and other information), to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman lter =-=[20]-=- to maintain a running estimate of the robot's spatial uncertainty, expressed as an ellipsoid or normal distribution in Cartesian space. This approach will not do for our robot, though. Its uncertaint... |

45 | The witness algorithm: Solving partially observable markov decision processes
- Littman
- 1994
(Show Context)
Citation Context ... compute the maximum of their value functions to get V t . If the value functions are represented by sets of policy trees, the test for termination can be implemented exactly using linear programming =-=[12]-=-. This is, of course, hopelessly computationally intractable. Each t-step policy tree contains (j\Omega j t \Gamma 1)=(j\Omega j \Gamma 1) nodes (the branching factor is j\Omega j, the number of possi... |

37 | Conditional linear planning
- Goldman, Boddy
- 1994
(Show Context)
Citation Context ... state. If observations reveal the precise identity of the current state, the planning model is called "completely observable." The mdp model, as well as some planning systems such as cnlp a=-=nd Plinth [18,19] assume co-=-mplete observability. Other systems, such as Buridan and maxplan [37], have no observation model and can attack "completely unobservable" problems. Classical planning systems typically have ... |