Results 11  20
of
713
FeatureBased Methods For Large Scale Dynamic Programming
 Machine Learning
, 1994
"... We develop a methodological framework and present a few different ways in which dynamic programming and compact representations can be Combined to solve large scale stochastic control problems. In particular, we develop algorithms that employ two types of featurebased compact representations, that ..."
Abstract

Cited by 180 (9 self)
 Add to MetaCart
(Show Context)
We develop a methodological framework and present a few different ways in which dynamic programming and compact representations can be Combined to solve large scale stochastic control problems. In particular, we develop algorithms that employ two types of featurebased compact representations, that is, representations that involve an arbitrarily complex feature extraction stage and a relatively simple approximation architecture. We prove the convergence of these algorithms and provide bounds on the approximation error. We also apply one of these algorithms to pro duce a computer program that plays Tetris at a respectable skill level. Furthermore, we provide a counterexample illustrating the difficulties of integrating compact representations and dynamic programming: which exemplifies the shortcomings of several methods in current practice, including Qlearning and temporaldifference learning.
Reinforcement Learning In Continuous Time and Space
 Neural Computation
, 2000
"... This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value f ..."
Abstract

Cited by 176 (7 self)
 Add to MetaCart
This paper presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the HamiltonJacobiBellman (HJB) equation for infinitehorizon, discounted reward problems, we derive algorithms for estimating value functions and for improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuoustime form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived and their correspondences with the conventional residual gradient, TD(0), and TD() algorithms are shown. For policy improvement, two methods, namely, a continuous actorcritic method and a valuegradient based greedy policy, are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived....
Distance transforms of sampled functions
 Cornell Computing and Information Science
, 2004
"... This paper provides lineartime algorithms for solving a class of minimization problems involving a cost function with both local and spatial terms. These problems can be viewed as a generalization of classical distance transforms of binary images, where the binary image is replaced by an arbitrary ..."
Abstract

Cited by 173 (11 self)
 Add to MetaCart
(Show Context)
This paper provides lineartime algorithms for solving a class of minimization problems involving a cost function with both local and spatial terms. These problems can be viewed as a generalization of classical distance transforms of binary images, where the binary image is replaced by an arbitrary sampled function. Alternatively they can be viewed in terms of the minimum convolution of two functions, which is an important operation in grayscale morphology. A useful consequence of our techniques is a simple, fast method for computing the Euclidean distance transform of a binary image. The methods are also applicable to Viterbi decoding, belief propagation and optimal control. 1
A tutorial on the crossentropy method
 Annals of Operations Research
, 2005
"... Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization ..."
Abstract

Cited by 173 (18 self)
 Add to MetaCart
(Show Context)
Abstract: The crossentropy method is a recent versatile Monte Carlo technique. This article provides a brief introduction to the crossentropy method and discusses how it can be used for rareevent probability estimation and for solving combinatorial, continuous, constrained and noisy optimization problems. A comprehensive list of references on crossentropy methods and applications is included.
LAO*: A heuristic search algorithm that finds solutions with loops
, 2001
"... Classic heuristic search algorithms can find solutions that take the form of a simple path (A*), a tree, or an acyclic graph (AO*). In this paper, we describe a novel generalization of heuristic search, called LAO*, that can find solutions with loops. We show that LAO* can be used to solve Markov de ..."
Abstract

Cited by 170 (18 self)
 Add to MetaCart
Classic heuristic search algorithms can find solutions that take the form of a simple path (A*), a tree, or an acyclic graph (AO*). In this paper, we describe a novel generalization of heuristic search, called LAO*, that can find solutions with loops. We show that LAO* can be used to solve Markov decision problems and that it shares the advantage heuristic search has over dynamic programming for other classes of problems. Given a start state, it can find an optimal solution without evaluating the entire state space. 2001 Elsevier Science B.V. All rights reserved. Keywords: Heuristic search; Dynamic programming; Markov decision problems 1.
Valuefunction approximations for partially observable Markov decision processes
 Journal of Artificial Intelligence Research
, 2000
"... Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advanta ..."
Abstract

Cited by 168 (1 self)
 Add to MetaCart
(Show Context)
Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price — exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain. 1.
Transmission with energy harvesting nodes in fading wireless channels: Optimal policies
 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
, 2011
"... Wireless systems comprised of rechargeable nodes have a significantly prolonged lifetime and are sustainable. A distinct characteristic of these systems is the fact that the nodes can harvest energy throughout the duration in which communication takes place. As such, transmission policies of the nod ..."
Abstract

Cited by 168 (43 self)
 Add to MetaCart
Wireless systems comprised of rechargeable nodes have a significantly prolonged lifetime and are sustainable. A distinct characteristic of these systems is the fact that the nodes can harvest energy throughout the duration in which communication takes place. As such, transmission policies of the nodes need to adapt to these harvested energy arrivals. In this paper, we consider optimization of pointtopoint data transmission with an energy harvesting transmitter which has a limited battery capacity, communicating in a wireless fading channel. We consider two objectives: maximizing the throughput by a deadline, and minimizing the transmission completion time of the communication session. We optimize these objectives by controlling the time sequence of transmit powers subject to energy storage capacity and causality constraints. We, first, study optimal offline policies. We introduce a directional waterfilling algorithm which provides a simple and concise interpretation of the necessary optimality conditions. We show the optimality of an adaptive directional waterfilling algorithm for the throughput maximization problem. We solve the transmission completion time minimization problem by utilizing its equivalence to its throughput maximization counterpart. Next, we consider online policies. We use stochastic dynamic programming to solve for the optimal online policy that maximizes the average number of bits delivered by a deadline under stochastic fading and energy arrival processes with causal channel state feedback. We also propose nearoptimal policies with reduced complexity, and numerically study their performances along with the performances of the offline and online optimal policies under various different configurations.
Convergence Results for SingleStep OnPolicy ReinforcementLearning Algorithms
 MACHINE LEARNING
, 1998
"... An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform e ..."
Abstract

Cited by 154 (7 self)
 Add to MetaCart
(Show Context)
An important application of reinforcement learning (RL) is to finitestate control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of singlestep onpolicy RL algorithms for control. Onpolicy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related onpolicy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.
The MAXQ Method for Hierarchical Reinforcement Learning
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1998
"... This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchi ..."
Abstract

Cited by 146 (5 self)
 Add to MetaCart
(Show Context)
This paper presents a new approach to hierarchical reinforcement learning based on the MAXQ decomposition of the value function. The MAXQ decomposition has both a procedural semanticsas a subroutine hierarchyand a declarative semanticsas a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. Conditions under which the MAXQ decomposition can represent the optimal value function are derived. The paper defines a hierarchical Q learning algorithm, proves its convergence, and shows experimentally that it can learn much faster than ordinary "flat" Q learning. Finally, the paper discusses some interesting issues that arise in hierarchical reinforcement learning including the hierarchical credit assignment problem and nonhierarchical execution of the MAXQ hierarchy. 1 Introduction Hierarchical approaches to reinforcement learning (RL) problems promise ma...
AntNet: A Mobile Agents Approach to Adaptive Routing
, 1997
"... This paper introduces AntNet, a new routing algorithm for communications networks. AntNet is an adaptive, distributed, mobileagentsbased algorithm whichwas inspired by recentwork on the ant colony metaphor. We apply AntNet to a datagram network and compare it with both static and adaptive stateof ..."
Abstract

Cited by 134 (6 self)
 Add to MetaCart
(Show Context)
This paper introduces AntNet, a new routing algorithm for communications networks. AntNet is an adaptive, distributed, mobileagentsbased algorithm whichwas inspired by recentwork on the ant colony metaphor. We apply AntNet to a datagram network and compare it with both static and adaptive stateoftheart routing algorithms. We ran experiments for various paradigmatic temporal and spatial traffic distributions. AntNet showed both very good performance and robustness under all the experimental conditions with respect to its competitors.