Results 11  20
of
557
Valuefunction approximations for partially observable Markov decision processes
 Journal of Artificial Intelligence Research
, 2000
"... Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advanta ..."
Abstract

Cited by 136 (1 self)
 Add to MetaCart
(Show Context)
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain. 1.
A Survey of Computational Complexity Results in Systems and Control
, 2000
"... The purpose of this paper is twofold: (a) to provide a tutorial introduction to some key concepts from the theory of computational complexity, highlighting their relevance to systems and control theory, and (b) to survey the relatively recent research activity lying at the interface between these fi ..."
Abstract

Cited by 133 (20 self)
 Add to MetaCart
The purpose of this paper is twofold: (a) to provide a tutorial introduction to some key concepts from the theory of computational complexity, highlighting their relevance to systems and control theory, and (b) to survey the relatively recent research activity lying at the interface between these fields. We begin with a brief introduction to models of computation, the concepts of undecidability, polynomial time algorithms, NPcompleteness, and the implications of intractability results. We then survey a number of problems that arise in systems and control theory, some of them classical, some of them related to current research. We discuss them from the point of view of computational complexity and also point out many open problems. In particular, we consider problems related to stability or stabilizability of linear systems with parametric uncertainty, robust control, timevarying linear systems, nonlinear and hybrid systems, and stochastic optimal control.
The MAXQ Method for Hierarchical Reinforcement Learning
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchi ..."
Abstract

Cited by 127 (4 self)
 Add to MetaCart
(Show Context)
This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. Conditions under which the MAXQ decomposition can represent the optimal value function are derived. The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary "flat" Q learning. Finally, the paper discusses some interesting issues that arise in hierarchical reinforcement learning including the hierarchical credit assignment problem and nonhierarchical execution of the MAXQ hierarchy. 1 Introduction Hierarchical approaches to reinforcement learning (RL) problems promise ma...
Distance transforms of sampled functions
 Cornell Computing and Information Science
, 2004
"... This paper provides lineartime algorithms for solving a class of minimization problems involving a cost function with both local and spatial terms. These problems can be viewed as a generalization of classical distance transforms of binary images, where the binary image is replaced by an arbitrary ..."
Abstract

Cited by 125 (11 self)
 Add to MetaCart
(Show Context)
This paper provides lineartime algorithms for solving a class of minimization problems involving a cost function with both local and spatial terms. These problems can be viewed as a generalization of classical distance transforms of binary images, where the binary image is replaced by an arbitrary sampled function. Alternatively they can be viewed in terms of the minimum convolution of two functions, which is an important operation in grayscale morphology. A useful consequence of our techniques is a simple, fast method for computing the Euclidean distance transform of a binary image. The methods are also applicable to Viterbi decoding, belief propagation and optimal control. 1
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 124 (7 self)
 Add to MetaCart
(Show Context)
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
A tutorial on the crossentropy method
 Annals of Operations Research
, 2005
"... Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization ..."
Abstract

Cited by 121 (16 self)
 Add to MetaCart
(Show Context)
Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization problems. A comprehensive list of references on crossentropy methods and applications is included.
AntNet: A Mobile Agents Approach to Adaptive Routing
, 1997
"... This paper introduces AntNet, a new routing algorithm for communications networks. AntNet is an adaptive, distributed, mobileagentsbased algorithm whichwas inspired by recentwork on the ant colony metaphor. We apply AntNet to a datagram network and compare it with both static and adaptive stateof ..."
Abstract

Cited by 119 (7 self)
 Add to MetaCart
(Show Context)
This paper introduces AntNet, a new routing algorithm for communications networks. AntNet is an adaptive, distributed, mobileagentsbased algorithm whichwas inspired by recentwork on the ant colony metaphor. We apply AntNet to a datagram network and compare it with both static and adaptive stateoftheart routing algorithms. We ran experiments for various paradigmatic temporal and spatial traffic distributions. AntNet showed both very good performance and robustness under all the experimental conditions with respect to its competitors.
Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing HighDimensional Financial Derivatives
 IEEE Transactions on Automatic Control
, 1997
"... We develop a theory characterizing optimal stopping times for discretetime ergodic Markov processes with discounted rewards. The theory differs from prior work by its view of perstage and terminal reward functions as elements of a certain Hilbert space. In addition to a streamlined analysis establ ..."
Abstract

Cited by 98 (6 self)
 Add to MetaCart
(Show Context)
We develop a theory characterizing optimal stopping times for discretetime ergodic Markov processes with discounted rewards. The theory differs from prior work by its view of perstage and terminal reward functions as elements of a certain Hilbert space. In addition to a streamlined analysis establishing existence and uniqueness of a solution to Bellman's equation, this approach provides an elegant framework for the study of approximate solutions. In particular, we propose a stochastic approximation algorithm that tunes weights of a linear combination of basis functions in order to approximate a value function. We prove that this algorithm converges (almost surely) and that the limit of convergence has some desirable properties. We discuss how variations on this line of analysis can be used to develop similar results for other classes of optimal stopping problems, including those involving independent increment processes, finite horizons, and twoplayer zerosum games. We illustrate...
Bayesian Learning in Negotiation
, 1996
"... Recent growing interest in autonomous interacting software agents and their potential application in areas such as electronic commerce [Sandolm & Lesser 1995] has given increased importance to automated negotiation. MuchDAI and game theoretic research [Rosenschein & Zlotkin 1994; Osborne &am ..."
Abstract

Cited by 89 (8 self)
 Add to MetaCart
Recent growing interest in autonomous interacting software agents and their potential application in areas such as electronic commerce [Sandolm & Lesser 1995] has given increased importance to automated negotiation. MuchDAI and game theoretic research [Rosenschein & Zlotkin 1994; Osborne & Rubinstein 1994] deals with coordination and negotiation issues by giving precomputed solutions to specific problems. There has been much research reported on developing theoretical models in which learning plays an eminent role, especially in the area of adaptive dynamics of games (e.g., [Jordan 1992; Kalai & Lehrer 1993]). However, to build autonomous agents that improve their negotiation competence based on learning from their interactions with other agents is still an emerging area. We are interested in developing autonomous agents capable of reasoning based on experience and improving their negotiation behavior incrementally. Learning in negotiation is closely coupled with...
SimulationBased Optimization of Markov Reward Processes
 IEEE Transactions on Automatic Control
, 1998
"... We propose a simulationbased algorithm for optimizing the average reward in a Markov Reward Process that depends on a set of parameters. As a special case, the method applies to Markov Decision Processes where optimization takes place within a parametrized set of policies. The algorithm involves th ..."
Abstract

Cited by 82 (1 self)
 Add to MetaCart
We propose a simulationbased algorithm for optimizing the average reward in a Markov Reward Process that depends on a set of parameters. As a special case, the method applies to Markov Decision Processes where optimization takes place within a parametrized set of policies. The algorithm involves the simulation of a single sample path, and can be implemented online. Aconvergence result (with probability1)isprovided.