Results 11  20
of
1,207
Efficient Solution Algorithms for Factored MDPs
, 2003
"... This paper addresses the problem of planning under uncertainty in large Markov Decision Processes (MDPs). Factored MDPs represent a complex state space using state variables and the transition model using a dynamic Bayesian network. This representation often allows an exponential reduction in the re ..."
Abstract

Cited by 130 (4 self)
 Add to MetaCart
This paper addresses the problem of planning under uncertainty in large Markov Decision Processes (MDPs). Factored MDPs represent a complex state space using state variables and the transition model using a dynamic Bayesian network. This representation often allows an exponential reduction in the representation size of structured MDPs, but the complexity of exact solution algorithms for such MDPs can grow exponentially in the representation size. In this paper, we present two approximate solution algorithms that exploit structure in factored MDPs. Both use an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed efficiently in closed form, by exploiting both additive and contextspecific structure in a factored MDP. A central element of our algorithms is a novel linear program decomposition technique, analogous to variable elimination in Bayesian networks, which reduces an exponentially large LP to a provably equivalent, polynomialsized one. One algorithm uses approximate linear programming, and the second approximate dynamic programming. Our dynamic programming algorithm is novel in that it uses an approximation based on maxnorm, a technique that more directly minimizes the terms that appear in error bounds for approximate MDP algorithms. We provide experimental results on problems with over 10^40 states, demonstrating a promising indication of the scalability of our approach, and compare our algorithm to an existing stateoftheart approach, showing, in some problems, exponential gains in computation time.
Valuefunction approximations for partially observable Markov decision processes
 Journal of Artificial Intelligence Research
, 2000
"... Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advanta ..."
Abstract

Cited by 128 (0 self)
 Add to MetaCart
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain. 1.
Hierarchical solution of Markov decision processes using macroactions
 In Proc. of Uncertainty in Artificial Intelligence (UAI
, 1998
"... actions, or macroactions, in the solution of Markov decision processes. Unlike current models that combine both primitive actions and macroactions and leave the state space unchanged, we propose a hierarchical model (using an abstract MDP) that works with macroactions only, and that significantly ..."
Abstract

Cited by 125 (10 self)
 Add to MetaCart
actions, or macroactions, in the solution of Markov decision processes. Unlike current models that combine both primitive actions and macroactions and leave the state space unchanged, we propose a hierarchical model (using an abstract MDP) that works with macroactions only, and that significantly reduces the size of the state space. This is achieved by treating macroactions as local policies that act in certain regions MDP to those at the boundaries of regions. The abstract MDP approximates the original and can be solved more efficiently. We discuss several ways in which macroactions can be generated to ensure good solution quality. Finally, we consider ways in which macroactions can be reused to solve multiple, related MDPs; and we show that this can justify the computational overhead of macroaction generation. 1
Nash QLearning for GeneralSum Stochastic Games
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We extend Qlearning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Qfunctions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Qvalues. This learning protocol provably conv ..."
Abstract

Cited by 107 (0 self)
 Add to MetaCart
We extend Qlearning to a noncooperative multiagent context, using the framework of generalsum stochastic games. A learning agent maintains Qfunctions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Qvalues. This learning protocol provably converges given certain restrictions on the stage games (defined by Qvalues) that arise during learning. Experiments with a pair of twoplayer grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Qfunction, but sometimes fails to converge in the second, which has three different equilibrium Qfunctions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Qlearning than with a singleagent Qlearning method. When at least one agent adopts Nash Qlearning, the performance of both agents is better than using singleagent Qlearning. We have also implemented an online version of Nash Qlearning that balances exploration with exploitation, yielding improved performance.
Maximum margin planning
 In Proceedings of the 23rd International Conference on Machine Learning (ICML’06
, 2006
"... Imitation learning of sequential, goaldirected behavior by standard supervised techniques is often difficult. We frame learning such behaviors as a maximum margin structured prediction problem over a space of policies. In this approach, we learn mappings from features to cost so an optimal policy in ..."
Abstract

Cited by 105 (26 self)
 Add to MetaCart
Imitation learning of sequential, goaldirected behavior by standard supervised techniques is often difficult. We frame learning such behaviors as a maximum margin structured prediction problem over a space of policies. In this approach, we learn mappings from features to cost so an optimal policy in an MDP with these cost mimics the expert’s behavior. Further, we demonstrate a simple, provably efficient approach to structured maximum margin learning, based on the subgradient method, that leverages existing fast algorithms for inference. Although the technique is general, it is particularly relevant in problems where A * and dynamic programming approaches make learning policies tractable in problems beyond the limitations of a QP formulation. We demonstrate our approach applied to route planning for outdoor mobile robots, where the behavior a designer wishes a planner to execute is often clear, while specifying cost functions that engender this behavior is a much more difficult task. 1.
Labeled RTDP: Improving the convergence of realtime dynamic programming
 In ICAPS’03, 12–21
"... RTDP is a recent heuristicsearch DP algorithm for solving nondeterministic planning problems with full observability. In relation to other dynamic programming methods, RTDP has two benefits: first, it does not have to evaluate the entire state space in order to deliver an optimal policy, and secon ..."
Abstract

Cited by 104 (10 self)
 Add to MetaCart
RTDP is a recent heuristicsearch DP algorithm for solving nondeterministic planning problems with full observability. In relation to other dynamic programming methods, RTDP has two benefits: first, it does not have to evaluate the entire state space in order to deliver an optimal policy, and second, it can often deliver good policies pretty fast. On the other hand, RTDP final convergence is slow. In this paper we introduce a labeling scheme into RTDP that speeds up its convergence while retaining its good anytime behavior. The idea is to label a state s as solved when the heuristic values, and thus, the greedy policy defined by them, have converged over s and the states that can be reached from s with the greedy policy. While due to the presence of cycles, these labels cannot be computed in a recursive, bottomup fashion in general, we show nonetheless that they can be computed quite fast, and that the overhead is compensated by the recomputations avoided. In addition, when the labeling procedure cannot label a state as solved, it improves the heuristic value of a relevant state. This results in the number of Labeled RTDP trials needed for convergence, unlike the number of RTDP trials, to be bounded. From a practical point of view, Labeled RTDP (LRTDP) converges orders of magnitude faster than RTDP, and faster also than another recent heuristicsearch DP algorithm, LAO*. Moreover, LRTDP often converges faster than value iteration, even with the heuristic h =0, thus suggesting that LRTDP has a quite general scope.
S.: Decisiontheoretic, highlevel agent programming in the situation calculus
 In: Proc. AAAI00, AAAI Press
, 2000
"... We propose a framework for robot programming which allows the seamless integration of explicit agent programming with decisiontheoretic planning. Specifically, the DTGolog model allows one to partially specify a control program in a highlevel, logical language, and provides an interpreter that, giv ..."
Abstract

Cited by 104 (5 self)
 Add to MetaCart
We propose a framework for robot programming which allows the seamless integration of explicit agent programming with decisiontheoretic planning. Specifically, the DTGolog model allows one to partially specify a control program in a highlevel, logical language, and provides an interpreter that, given a logical axiomatization of a domain, will determine the optimal completion of that program (viewed as a Markov decision process). We demonstrate the utility of this model with results obtained in an office delivery robotics domain. 1
KernelBased Reinforcement Learning
 Machine Learning
, 1999
"... We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second ..."
Abstract

Cited by 103 (1 self)
 Add to MetaCart
We present a kernelbased approach to reinforcement learning that overcomes the stability problems of temporaldifference learning in continuous statespaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernelbased approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the biasvariance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or nonparametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.
Planning under continuous time and resource uncertainty: A challenge for AI
 In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence
, 2002
"... yQSS Group Inc. zQSS Group Inc. xRIACS experiment is assigned a scientific value). Different observations and experiments take differing amounts of time and consume differing amounts of power and data storage.There are, in general, a number of constraints that govern the rovers activities: ffl Ther ..."
Abstract

Cited by 102 (16 self)
 Add to MetaCart
yQSS Group Inc. zQSS Group Inc. xRIACS experiment is assigned a scientific value). Different observations and experiments take differing amounts of time and consume differing amounts of power and data storage.There are, in general, a number of constraints that govern the rovers activities: ffl There are time, power, data storage, and positioning constraints for performing different activities. Time constraints often result from illuminationrequirementthat is, experiments may require that a target rock or sample be illuminated with a certain intensity, or from a certain angle.
Planning, learning and coordination in multiagent decision processes
 In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge (TARK96
, 1996
"... There has been a growing interest in AI in the design of multiagent systems, especially in multiagent cooperative planning. In this paper, we investigate the extent to which methods from singleagent planning and learning can be applied in multiagent settings. We survey a number of different techniq ..."
Abstract

Cited by 96 (1 self)
 Add to MetaCart
There has been a growing interest in AI in the design of multiagent systems, especially in multiagent cooperative planning. In this paper, we investigate the extent to which methods from singleagent planning and learning can be applied in multiagent settings. We survey a number of different techniques from decisiontheoretic planning and reinforcement learning and describe a number of interesting issues that arise with regard to coordinating the policies of individual agents. To this end, we describe multiagent Markov decision processes as a general model in which to frame this discussion. These are special nperson cooperative games in which agents share the same utility function. We discuss coordination mechanisms based on imposed conventions (or social laws) as well as learning methods for coordination. Our focus is on the decomposition of sequential decision processes so that coordination can be learned (or imposed) locally, at the level of individual states. We also discuss the use of structured problem representations and their role in the generalization of learned conventions and in approximation. 1