Results 1  10
of
208
Learning policies for partially observable environments: Scaling up
, 1995
"... Partially observable Markov decision processes (pomdp's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of pomdp's is motivated by a need to address realistic problems, existing techniques for finding optim ..."
Abstract

Cited by 234 (11 self)
 Add to MetaCart
Partially observable Markov decision processes (pomdp's) model decision problems in which an agent tries to maximize its reward in the face of limited and/or noisy sensor feedback. While the study of pomdp's is motivated by a need to address realistic problems, existing techniques for finding optimal behavior do not appear to scale well and have been unable to find satisfactory policies for problems with more than a dozen states. After a brief review of pomdp's, this paper discusses several simple solution methods and shows that all are capable of finding nearoptimal policies for a selection of extremely small pomdp's taken from the learning literature. In contrast, we show that none are able to solve a slightly larger and noisier problem based on robot navigation. We find that a combination of two novel approaches performs well on these problems and suggest methods for scaling to even larger and more complicated domains. 1 Introduction Mobile robots must act on the basis of thei...
Learning and Sequential Decision Making
 LEARNING AND COMPUTATIONAL NEUROSCIENCE
, 1989
"... In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of ..."
Abstract

Cited by 195 (10 self)
 Add to MetaCart
In this report we show how the class of adaptive prediction methods that Sutton called "temporal difference," or TD, methods are related to the theory of squential decision making. TD methods have been used as "adaptive critics" in connectionist learning systems, and have been proposed as models of animal learning in classical conditioning experiments. Here we relate TD methods to decision tasks formulated in terms of a stochastic dynamical system whose behavior unfolds over time under the influence of a decision maker's actions. Strategies are sought for selecting actions so as to maximize a measure of longterm payoff gain. Mathematically, tasks such as this can be formulated as Markovian decision problems, and numerous methods have been proposed for learning how to solve such problems. We show how a TD method can be understood as a novel synthesis of concepts from the theory of stochastic dynamic programming, which comprises the standard method for solving such tasks when a model of the dynamical system is available, and the theory of parameter estimation, which provides the appropriate context for studying learning rules in the form of equations for updating associative strengths in behavioral models, or connection weights in connectionist networks. Because this report is oriented primarily toward the nonengineer interested in animal learning, it presents tutorials on stochastic sequential decision tasks, stochastic dynamic programming, and parameter estimation.
Recent advances in hierarchical reinforcement learning
, 2003
"... A preliminary unedited version of this paper was incorrectly published as part of Volume ..."
Abstract

Cited by 161 (23 self)
 Add to MetaCart
A preliminary unedited version of this paper was incorrectly published as part of Volume
DecisionTheoretic Deliberation Scheduling for Problem Solving In . . .
 ARTIFICIAL INTELLIGENCE
, 1994
"... We are interested in the problem faced byanagent with limited computational capabilities, embedded in a complex environment with other agents and processes not under its control. Careful management of computational resources is important for complex problemsolving tasks in which the time spent in ..."
Abstract

Cited by 157 (3 self)
 Add to MetaCart
We are interested in the problem faced byanagent with limited computational capabilities, embedded in a complex environment with other agents and processes not under its control. Careful management of computational resources is important for complex problemsolving tasks in which the time spent in decision making affects the quality of the responses generated by a system.
Learning Without StateEstimation in Partially Observable Markovian Decision Processes
 In Proceedings of the Eleventh International Conference on Machine Learning
, 1994
"... Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many ..."
Abstract

Cited by 128 (5 self)
 Add to MetaCart
Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many realworld decision tasks, however, are inherently nonMarkovian, i.e., the state of the environment is only incompletely known to the learning agent. In this paper we consider only partially observable MDPs (POMDPs), a useful class of nonMarkovian decision processes. Most previous approaches to such problems have combined computationally expensive stateestimation techniques with learning control. This paper investigates learning in POMDPs without resorting to any form of state estimation. We present results about what TD(0) and Qlearning will do when applied to POMDPs. It is shown that the conventional discounted RL framework is inadequate to deal with POMDPs. Finally we develop a new fr...
Fundamental Concepts of Qualitative Probabilistic Networks
 ARTIFICIAL INTELLIGENCE
, 1990
"... Graphical representations for probabilistic relationships have recently received considerable attention in A1. Qualitative probabilistic networks abstract from the usual numeric representations by encoding only qualitative relationships, which are inequality constraints on the joint probability dist ..."
Abstract

Cited by 119 (6 self)
 Add to MetaCart
Graphical representations for probabilistic relationships have recently received considerable attention in A1. Qualitative probabilistic networks abstract from the usual numeric representations by encoding only qualitative relationships, which are inequality constraints on the joint probability distribution over the variables. Although these constraints are insufficient to determine probabilities uniquely, they are designed to justify the deduction of a class of relative likelihood conclusions that imply useful decisionmaking properties. Two types of qualitative relationship are defined, each a probabilistic form of monotonicity constraint over a group of variables. Qualitative influences describe the direction of the relationship between two variables. Qualitative synergies describe interactions among influences. The probabilistic definitions chosen justify sound and efficient inference procedures based on graphical manipulations of the network. These procedures answer queries about qualitative relationships among variables separated in the network and determine structural properties of optimal assignments to decision variables.
Model Checking for a Probabilistic Branching Time Logic with Fairness
 Distributed Computing
, 1998
"... We consider concurrent probabilistic systems, based on probabilistic automata of Segala & Lynch [55], which allow nondeterministic choice between probability distributions. These systems can be decomposed into a collection of "computation trees" which arise by resolving the nondeterministic, but n ..."
Abstract

Cited by 116 (37 self)
 Add to MetaCart
We consider concurrent probabilistic systems, based on probabilistic automata of Segala & Lynch [55], which allow nondeterministic choice between probability distributions. These systems can be decomposed into a collection of "computation trees" which arise by resolving the nondeterministic, but not probabilistic, choices. The presence of nondeterminism means that certain liveness properties cannot be established unless fairness is assumed. We introduce a probabilistic branching time logic PBTL, based on the logic TPCTL of Hansson [30] and the logic PCTL of [55], resp. pCTL of [14]. The formulas of the logic express properties such as "every request is eventually granted with probability at least p". We give three interpretations for PBTL on concurrent probabilistic processes: the first is standard, while in the remaining two interpretations the branching time quantifiers are taken to range over a certain kind of fair computation trees. We then present a model checking algorithm for...
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract

Cited by 99 (12 self)
 Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called ndiscountoptimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gainoptimal policies that maximize average reward, none of them can reliably filter these to produce biasoptimal (or Toptimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of Rlearning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of Rlearning is carried out to test its dependence on learning rates and exploration levels. The results suggest that Rlearning is quite sensitive to exploration strategies, and can fall into suboptimal limit cycles. The performance of Rlearning is also compared with that of Qlearning, the best studied discounted RL method. Here, the results suggest that Rlearning can be finetuned to give better performance than Qlearning in both domains.
Memoryless Policies: Theoretical Limitations and Practical Results
 In
, 1994
"... One form of adaptive behavior is "goalseeking" in which an agent acts so as to minimize the time it takes to reach a goal state. This paper presents some theoretical and empirical findings on algorithms that devise goalseeking behaviors for "memoryless" agents who base their behavioral decisions s ..."
Abstract

Cited by 93 (3 self)
 Add to MetaCart
One form of adaptive behavior is "goalseeking" in which an agent acts so as to minimize the time it takes to reach a goal state. This paper presents some theoretical and empirical findings on algorithms that devise goalseeking behaviors for "memoryless" agents who base their behavioral decisions solely on current sensations. The basic results are that (1) the general problem of finding good deterministic memoryless policies is intractable, however, (2) simple branchandbound heuristics can be used to find optimal memoryless policies extremely quickly for some established example environments. 1 Introduction This paper looks at a class of behaviors, or policies, that can be called "memoryless" since action decisions are made solely on this basis of the agent's current sensation. In nature, it would seem that memoryless behavior makes little sense. What organism would possibly ignore recent events in deciding how to act? Research on artificial agents, however, is more apt to focus o...