Results 1 - 10
of
214
Decision-Theoretic Planning: Structural Assumptions and Computational Leverage
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 1999
"... Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives ..."
Abstract
-
Cited by 342 (3 self)
- Add to MetaCart
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDP-related methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to de...
Synchronization and linearity : an algebra for discrete event systems. Wiley Series in probability and mathematical statistics
, 1992
"... The first edition of this book was published in 1992 by Wiley (ISBN 0 471 93609 X). Since this book is now out of print, and to answer the request of several colleagues, the authors have decided to make it available freely on the Web, while retaining the copyright, for the benefit of the scientific ..."
Abstract
-
Cited by 203 (8 self)
- Add to MetaCart
The first edition of this book was published in 1992 by Wiley (ISBN 0 471 93609 X). Since this book is now out of print, and to answer the request of several colleagues, the authors have decided to make it available freely on the Web, while retaining the copyright, for the benefit of the scientific community. Copyright Statement This electronic document is in PDF format. One needs Acrobat Reader (available freely for most platforms from the Adobe web site) to benefit from the full interactive machinery: using the package hyperref by Sebastian Rahtz, the table of contents and all LATEX cross-references are automatically converted into clickable hyperlinks, bookmarks are generated automatically, etc.. So, do not hesitate to click on references to equation or section numbers, on items of thetableofcontents and of the index, etc.. One may freely use and print this document for one’s own purpose or even distribute it freely, but not commercially, provided it is distributed in its entirety and without modifications, including this preface and copyright statement. Any use of thecontents should be acknowledged according to the standard scientific practice. The
PEGASUS: A policy search method for large MDPs and POMDPs
- In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence
, 2000
"... We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP ..."
Abstract
-
Cited by 168 (7 self)
- Add to MetaCart
We propose a new approach to the problem of searching a space of policies for a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), given a model. Our approach is based on the following observation: Any (PO)MDP can be transformed into an "equivalent" POMDP in which all state transitions (given the current state and action) are deterministic. This reduces the general problem of policy search to one in which we need only consider POMDPs with deterministic transitions. We give a natural way of estimating the value of all policies in these transformed POMDPs. Policy search is then simply performed by searching for a policy with high estimated value. We also establish conditions under which our value estimates will be good, recovering theoretical results similar to those of Kearns, Mansour and Ng [7], but with "sample complexity" bounds that have only a polynomial rather than exponential dependence on the horizon time. Our method appl...
Planning Under Time Constraints in Stochastic Domains
- ARTIFICIAL INTELLIGENCE
, 1993
"... We provide a method, based on the theory of Markov decision processes, for efficient planning in stochastic domains. Goals are encoded as reward functions, expressing the desirability of each world state; the planner must find a policy (mapping from states to actions) that maximizes future reward ..."
Abstract
-
Cited by 150 (17 self)
- Add to MetaCart
We provide a method, based on the theory of Markov decision processes, for efficient planning in stochastic domains. Goals are encoded as reward functions, expressing the desirability of each world state; the planner must find a policy (mapping from states to actions) that maximizes future rewards. Standard goals of achievement, as well as goals of maintenance and prioritized combinations of goals, can be specified in this way. An optimal policy can be found using existing methods, but these methods require time at best polynomial in the number of states in the domain, where the number of states is exponential in the number of propositions (or state variables). By using information about the starting state, the reward function, and the transition probabilities of the domain, we restrict the planner's attention to a set of world states that are likely to be encountered in satisfying the goal. Using this restricted set of states, the planner can generate more or less complete ...
Asynchronous Stochastic Approximation and Q-Learning
- Machine Learning
, 1994
"... Abstract. We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establis ..."
Abstract
-
Cited by 129 (3 self)
- Add to MetaCart
Abstract. We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available. Keywords: Reinforcement learning, Q-learning, dynamic programming, stochastic approximation 1.
Minimizing the Average Cost of Paging Under Delay Constraints
- Wireless Networks
, 1995
"... Efficient paging procedures help minimize the amount of bandwidth expended in locating a mobile unit. Given a probability distribution on user location, it is shown that the optimal paging strategy which minimizes the expected number of locations polled E[L] is to query each location sequentially in ..."
Abstract
-
Cited by 96 (12 self)
- Add to MetaCart
Efficient paging procedures help minimize the amount of bandwidth expended in locating a mobile unit. Given a probability distribution on user location, it is shown that the optimal paging strategy which minimizes the expected number of locations polled E[L] is to query each location sequentially in order of decreasing probability. However, since sequential search over many locations may impose unacceptable polling delay, D, optimal paging subject to delay constraints is considered. It is shown that substantial reductions in E[L] can be had even after moderate constraints are imposed on acceptable D (i.e., D 3). Since all methods of mobility management eventually reduce to considering a time-varying probability distribution on user location, this work should be applicable to a wide range of problems in the area. most notably those with additive cost structures. 1 Introduction Paging and registration are necessary features of wireless communication networks because user locations va...
Planning Under Uncertainty: Structural Assumptions and Computational Leverage
- In Proceedings of the Second European Workshop on Planning
, 1996
"... The problem of planning under uncertainty has been addressed by researchers in many different fields, adopting rather different perspectives on the problem. Unfortunately, these researchers are not always aware of the relationships among these different problem formulations, often resulting in confu ..."
Abstract
-
Cited by 81 (8 self)
- Add to MetaCart
The problem of planning under uncertainty has been addressed by researchers in many different fields, adopting rather different perspectives on the problem. Unfortunately, these researchers are not always aware of the relationships among these different problem formulations, often resulting in confusion and duplicated effort. Many probabilistic planning or decision making problems can be characterized as a class of Markov decision processes that allow for significant compression in representing the underlying system dynamics. It is for this class of problems that we as experts in intensional representations are advantageously positioned to contribute efficient solution methods. This paper provides a general characterization of the representational requirements for this class of problems, and we describe how to achieve computational leverage using representations that make different types of dependency information explicit. Keywords: decision-theoretic planning, action repr...
Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results
, 1996
"... This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dyna ..."
Abstract
-
Cited by 80 (12 self)
- Add to MetaCart
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.
Kernel-Based Reinforcement Learning
- Machine Learning
, 1999
"... We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeo in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.

