## Learning policies for partially observable environments: Scaling up (1995)

### Cached

### Download Links

- [www.cs.brown.edu]
- [www.cs.brown.edu]
- [www.cs.ubc.ca]
- DBLP

### Other Repositories/Bibliography

Citations: | 232 - 11 self |

### BibTeX

@MISC{Littman95learningpolicies,

author = {Michael L. Littman and Anthony R. Cassandra and Leslie Pack Kaelbling},

title = {Learning policies for partially observable environments: Scaling up},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

Partially observable Markov decision processes (pomdp's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of pomdp's is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of pomdp's, this paper discusses several simple solution methods and shows that all are capable of finding near-optimal policies for a selection of extremely small pomdp's taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains. 1 Introduction Mobile robots must act on the basis of thei...

### Citations

3734 | Artificial Intelligence: A Modern Approach - Russell, Norvig - 2002 |

1212 |
Markov Decision Processes: Discrete Stochastic Programming
- Puterman
- 1994
(Show Context)
Citation Context ...r the mdp consisting of the transitions and rewards only. These values can be computed extremely efficiently for problems with dozens to thousands of states and a variety of approaches are available (=-=Puterman, 1994-=-). With the QMDP values in hand, we can treat all the QMDP values for each action as a single linear function and estimate the Q value for a belief state b as Q a (b) = P s b(s) QMDP (s; a). This esti... |

374 |
Dynamic Programming Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...S is a set of states, A a set of actions, and\Omega a set of observations. We will only consider the case in which these sets are finite. The functions T and R define a Markov decision processs(mdp) (=-=Bertsekas, 1987-=-) with which the agent interacts without direct information as to the current state. The transition function, T : S \Theta A ! \Pi(S), specifies how the various actions affect the state of the environ... |

337 |
The Optimal Control of Partially Observable Markov Decision Processes
- Sondik
- 1971
(Show Context)
Citation Context ...ed arbitrarily well by a piecewise-linear and convex (pwlc) function (Smallwood and Sondik, 1973; Littman, 1994). Further, there is a class of pomdp's that have value functions that are exactly pwlc (=-=Sondik, 1978-=-). These results apply to the optimalsQ functions as well: the Q function for action a, Q a (b) is the expected reward for a policy that starts in belief state b, takes action a, and then behaves opti... |

308 |
The Complexity of Markov Decision Processes
- Papadimitriou, Tsitsiklis
- 1987
(Show Context)
Citation Context ...uation and provides a basis for computing optimal behavior. A variety of algorithms have been developed for solvingspomdp's (Lovejoy, 1991), but because the problem is so computationally challenging (=-=Papadimitriou and Tsitsiklis, 1987-=-), most techniques are too inefficient to be used on all but the smallest problems (2 to 5 states (Cheng, 1988)). Recently, the Witness algorithm (Cassandra, 1994; Littman, 1994) has been used to solv... |

297 |
The optimal control of partially observable Markov decision processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...ect, objects occlude one another from view, the robot might not know its initial status or precisely where it is. The theory of partially observable Markov decision processes (pomdp's) (Astrom, 1965; =-=Smallwood and Sondik, 1973-=-; Cassandra et al., 1994) models this situation and provides a basis for computing optimal behavior. A variety of algorithms have been developed for solvingspomdp's (Lovejoy, 1991), but because the pr... |

275 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...ther from view, the robot might not know its initial status or precisely where it is. The theory of partially observable Markov decision processes (pomdp's) (Astrom, 1965; Smallwood and Sondik, 1973; =-=Cassandra et al., 1994-=-) models this situation and provides a basis for computing optimal behavior. A variety of algorithms have been developed for solvingspomdp's (Lovejoy, 1991), but because the problem is so computationa... |

258 | An algorithm for probabilistic planning
- Kushmerick, Hanks, et al.
- 1995
(Show Context)
Citation Context .... During learning, actions were selected to maximize the current Q functions with Name jSj jAj j\Omega j Noise Shuttle (Chrisman, 1992) 8 3 5 T/O Cheese Maze (McCallum, 1992) 11 4 7 -- Part Painting (=-=Kushmerick et al., 1993-=-) 4 4 2 T/O 4x4 Grid (Cassandra et al., 1994) 16 4 2 -- Tiger (Cassandra et al., 1994) 2 3 2 O 4x3 Grid (Parr and Russell, 1995) 11 4 6 T Table 1: A suite of extremely small pomdp's. Shuttle Cheese Ma... |

248 |
Introduction to Stochastic Dynamic Programming
- Ross
- 1983
(Show Context)
Citation Context ...n in the environment. We restrict our attention to stationary, deterministic policies on the belief state, since this class is relatively simple and we are assured that it includes an optimal policy (=-=Ross, 1983-=-). 2.3 PIECEWISE-LINEAR CONVEX FUNCTIONS A particularly powerful result of Sondik's is that the optimal value function for any pomdp can be approximated arbitrarily well by a piecewise-linear and conv... |

225 | M.: Exploiting structure in policy construction - Boutilier, Dearden, et al. - 1995 |

224 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...n combination with the updating rule, form a completely observable Markov decision process (mdp) with a continuous state space, similar to problems addressed in the reinforcement-learning literature (=-=Moore, 1994-=-). Our goal will be to find an approximation of the Q function over the continuous space of belief states and to use this as a basis for action in the environment. We restrict our attention to station... |

209 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1988
(Show Context)
Citation Context ...ays certain of its state (i.e., b(s) = 1 for some s at all times), this rule reduces exactly to standard Q-learning and can be shown to converge to the optimal Q function under the proper conditions (=-=Jaakkola et al., 1994-=-; Tsitsikilis, 1994). The rule itself is an extremely natural extension of Qlearning to vector-valued state spaces, since it basically consists of applying the Q-learning rule at every state where the... |

194 | Reinforcement learning with perceptual aliasing: the perceptual distinctions approach
- Chrisman
- 1992
(Show Context)
Citation Context ...rformed 75,000 steps of learning starting from the problem-specific belief state. During learning, actions were selected to maximize the current Q functions with Name jSj jAj j\Omega j Noise Shuttle (=-=Chrisman, 1992-=-) 8 3 5 T/O Cheese Maze (McCallum, 1992) 11 4 7 -- Part Painting (Kushmerick et al., 1993) 4 4 2 T/O 4x4 Grid (Cassandra et al., 1994) 16 4 2 -- Tiger (Cassandra et al., 1994) 2 3 2 O 4x3 Grid (Parr a... |

176 |
A survey of algorithmic methods for partially observed markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ..., 1965; Smallwood and Sondik, 1973; Cassandra et al., 1994) models this situation and provides a basis for computing optimal behavior. A variety of algorithms have been developed for solvingspomdp's (=-=Lovejoy, 1991-=-), but because the problem is so computationally challenging (Papadimitriou and Tsitsiklis, 1987), most techniques are too inefficient to be used on all but the smallest problems (2 to 5 states (Cheng... |

168 |
Learning with Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...learning a pomdp model in a reinforcement-learning setting. At the same time that their algorithms attempt to learn the transition and observation probabilities, they used an extension of Q-learning (=-=Watkins, 1989) to learn-=- approximate Q functions for the learned pomdp model. Although it was not the emphasis of their work, their "replicated Q-learning" rule is of independent interest. Replicated Q-learning gen... |

122 |
Optimal control of markov decision processes with incomplete state estimation
- Astrom
- 1965
(Show Context)
Citation Context ...ors are imperfect, objects occlude one another from view, the robot might not know its initial status or precisely where it is. The theory of partially observable Markov decision processes (pomdp's) (=-=Astrom, 1965-=-; Smallwood and Sondik, 1973; Cassandra et al., 1994) models this situation and provides a basis for computing optimal behavior. A variety of algorithms have been developed for solvingspomdp's (Lovejo... |

119 | Approximating optimal policies for partially observable stochastic domains
- Parr, Russell
- 1995
(Show Context)
Citation Context ..., 1992) 8 3 5 T/O Cheese Maze (McCallum, 1992) 11 4 7 -- Part Painting (Kushmerick et al., 1993) 4 4 2 T/O 4x4 Grid (Cassandra et al., 1994) 16 4 2 -- Tiger (Cassandra et al., 1994) 2 3 2 O 4x3 Grid (=-=Parr and Russell, 1995-=-) 11 4 6 T Table 1: A suite of extremely small pomdp's. Shuttle Cheese Maze Part Painting 4x4 Grid Tiger 4x3 Grid Trunc VI 1:805 \Sigma 0:014 0:188 \Sigma 0:002 0:179 \Sigma 0:012 0:193 \Sigma 0:003 0... |

84 | Tight performance bounds on greedy policies based on imperfect value functions - Williams, Baird - 1993 |

73 |
Algorithms for Partially Observable Markov Decision Processes
- Cheng
- 1988
(Show Context)
Citation Context ... 1991), but because the problem is so computationally challenging (Papadimitriou and Tsitsiklis, 1987), most techniques are too inefficient to be used on all but the smallest problems (2 to 5 states (=-=Cheng, 1988-=-)). Recently, the Witness algorithm (Cassandra, 1994; Littman, 1994) has been used to solve pomdp's with up to 16 states. While this problem size is considerably larger than prior state of the art, th... |

56 |
Learning internal representations by error backpropagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...rd the same value, the components of q a are adjusted to match the coefficients of the linear function that predicts the Q values. This is accomplished by applying the delta rule for neural networks (=-=Rumelhart et al., 1986-=-), which, adapted to the belief mdp framework, becomes: \Deltaq a (s) = ff b(s)(r + fl max a 0 Q a 0 (b 0 ) \Gamma q a \Delta b) : Like the replicated Q-learning rule, this rule reduces to ordinary Q-... |

45 | The witness algorithm: Solving partially observable markov decision processes - Littman - 1994 |

33 | Optimal policies for partially observable markov decision processes
- Cassandra
- 1994
(Show Context)
Citation Context ...nally challenging (Papadimitriou and Tsitsiklis, 1987), most techniques are too inefficient to be used on all but the smallest problems (2 to 5 states (Cheng, 1988)). Recently, the Witness algorithm (=-=Cassandra, 1994-=-; Littman, 1994) has been used to solve pomdp's with up to 16 states. While this problem size is considerably larger than prior state of the art, the algorithm is not efficient enough to be used for l... |

17 | Toward approximate planning in very large stochastic domains - Nicholson, Nicholson, et al. - 1994 |

16 |
First results with utile distinction memory for reinforcement learning
- McCallum
- 1992
(Show Context)
Citation Context ...ng from the problem-specific belief state. During learning, actions were selected to maximize the current Q functions with Name jSj jAj j\Omega j Noise Shuttle (Chrisman, 1992) 8 3 5 T/O Cheese Maze (=-=McCallum, 1992-=-) 11 4 7 -- Part Painting (Kushmerick et al., 1993) 4 4 2 T/O 4x4 Grid (Cassandra et al., 1994) 16 4 2 -- Tiger (Cassandra et al., 1994) 2 3 2 O 4x3 Grid (Parr and Russell, 1995) 11 4 6 T Table 1: A s... |

12 | Rapid task learning for real robots - Connell, Mahadevan - 1993 |

1 |
Asynchronous stohcastic aproximation and Q-learning
- Tsitsikilis
- 1994
(Show Context)
Citation Context ...e (i.e., b(s) = 1 for some s at all times), this rule reduces exactly to standard Q-learning and can be shown to converge to the optimal Q function under the proper conditions (Jaakkola et al., 1994; =-=Tsitsikilis, 1994-=-). The rule itself is an extremely natural extension of Qlearning to vector-valued state spaces, since it basically consists of applying the Q-learning rule at every state where the magnitude of the c... |