Results 1  10
of
23
Least Squares Policy Evaluation Algorithms With Linear Function Approximation
 Theory and Applications
, 2002
"... We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
We consider policy evaluation algorithms within the context of infinitehorizon dynamic programming problems with discounted cost. We focus on discretetime dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradientlike algorithm involving leastsquares subproblems and a diminishing stepsize, which is based on the #policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(#) algorithm recently proposed by Boyan, which for # =0coincides with the linear leastsquares temporaldifference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(#), with probability 1, for every # [0, 1].
Improved Temporal Difference Methods with Linear Function Approximation
"... This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional d ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
This chapter considers temporal difference algorithms within the context of infinitehorizon finitestate dynamic programming problems with discounted cost and linear cost function approximation. This problem arises as a subproblem in the policy iteration method of dynamic programming. Additional discussions of such problems can be found in Chapters 12 and 6. The advantage of the method presented here is that this is the first iterative temporal difference method that converges without requiring a diminishing step size. The chapter discusses the connections with Suttonfls TD(λ) and with various versions of leastsquares that are based on valueiteration. It is shown using both analysis and experiments that the proposed method is substantially faster, simpler, and more reliable than TD(λ). Comparisons are also made with the LSTD method of Boyan and Bradtke and Barto.
Control With Limited Information
, 2001
"... ... How does "information" interact with control of a system, in particular feedback control, and what is the value of "information " in achieving performance objectives for the system through the exercise of control? In answering this question we have to remember that in contrast to a variety of co ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
... How does "information" interact with control of a system, in particular feedback control, and what is the value of "information " in achieving performance objectives for the system through the exercise of control? In answering this question we have to remember that in contrast to a variety of communications settings, the issue of timedelay is of primary importance for control problems, especially control of systems which are unstable. We discuss various issues arising from these fundamental questions.
The Capacity of Communication Channels with Memory
, 2004
"... For a state machine channel, a simple form of the feedbackcapacityachieving source distribution is revealed. A Markov source, whose memory length equals the channel memory length, achieves the feedback capacity. Given the posterior channelstate distribution, the optimal source Markov transition p ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
For a state machine channel, a simple form of the feedbackcapacityachieving source distribution is revealed. A Markov source, whose memory length equals the channel memory length, achieves the feedback capacity. Given the posterior channelstate distribution, the optimal source Markov transition probabilities become independent of the whole history of past channel outputs. Further, when the feedback is delayed, the delayed feedback capacity is achieved by a Markov source whose memory length equals the sum of the channel memory length and the feedback delay. The Markov source optimization is formulated as a standard stochastic control problem and is solved by dynamic programming. The (delayed) feedback capacity is an upperbound on the feedforward channel capacity, and this bound can be made tight by increasing the feedback delay. The linear Gaussian channel with an average input power constraint can be equivalently modelled as a state machine channel. When the channel has feedback, by following similar procedures as developed for the state machine channel, it is shown that GaussMarkov sources achieve the feedback capacity and a KalmanBucy filter is optimal for processing
Opportunistic Spectrum Access for Energyconstrained Cognitive Radios
"... This paper considers a scenario in which a secondary user makes opportunistic use of a channel allocated to some primary network. The primary network operates in a timeslotted manner and switches between idle and active states according to a stationary Markovian process. At the beginning of each ti ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper considers a scenario in which a secondary user makes opportunistic use of a channel allocated to some primary network. The primary network operates in a timeslotted manner and switches between idle and active states according to a stationary Markovian process. At the beginning of each time slot, the secondary user can choose to stay idle or to carry out spectrum sensing to detect if the primary network is idle or active. If the primary network is detected as idle, the secondary user can carry out data transmission. Spectrum sensing consumes time and energy and introduces false alarms and misdetections. Given the delay cost associated with staying idle, the energy costs associated with spectrum sensing and data transmission, and the throughput gain associated with successful transmissions, the objective is to decide, for each time slot, whether the secondary user should stay idle or carry out sensing, and if so, for how long, to maximize the expected net reward. We formulate this problem as a partially observable Markov decision process (POMDP) and prove several structural properties of the optimal spectrum sensing/accessing policies. Based on these properties, heuristic control policies with low complexity and good performance are proposed. I.
Optimal Sequential Exploration: Bandits, Clairvoyants, and Wildcats. submitted, accessible at http://faculty.fuqua.duke.edu/ jes9/bio/OptimalSequentialExplorationBCW.pdf
, 2012
"... This paper was motivated by the problem of developing an optimal strategy for exploring a large oil and gas field in the North Sea. Where should we drill first? Where do we drill next? The problem resembles a classical multiarmed bandit problem, but probabilistic dependence plays a key role: outcome ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper was motivated by the problem of developing an optimal strategy for exploring a large oil and gas field in the North Sea. Where should we drill first? Where do we drill next? The problem resembles a classical multiarmed bandit problem, but probabilistic dependence plays a key role: outcomes at drilled sites reveal information about neighboring targets. Good exploration strategies will take advantage of this information as it is revealed. We develop heuristic policies for sequential exploration problems and complement these heuristics with upper bounds on the performance of an optimal policy. We begin by grouping the targets into clusters of manageable size. The heuristics are derived from a model that treats these clusters as independent. The upper bounds are given by assuming each cluster has perfect information about the results from all other clusters. The analysis relies heavily on results for bandit superprocesses, a generalization of the classical multiarmed bandit problem. We evaluate the heuristics and bounds using Monte Carlo simulation and, in our problem, we find that the heuristic policies are nearly optimal.
LabelSetting Methods for Multimode Stochastic Shortest Path Problems on Graphs INFORMS holds copyright to this article and distributed this copy as a courtesy to the author(s).
"... informs doi 10.1287/moor.1080.0321 ..."
Local Exploration: Online Algorithms and a Probabilistic Framework
 In Proc. IEEE Int. Conf. Robotics and Automation (ICRA 2003
, 2003
"... Mapping an environment with an imaging sensor becomes very challenging if the environment to be mapped is unknown and has to be explored. Exploration involves the planning of views so that the entire environment is covered. The majority of implemented mapping systems use a heuristic planning while t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Mapping an environment with an imaging sensor becomes very challenging if the environment to be mapped is unknown and has to be explored. Exploration involves the planning of views so that the entire environment is covered. The majority of implemented mapping systems use a heuristic planning while theoretical approaches regard only the traveled distance as cost. However, practical range acquisition systems spend a considerable amount of time for acquisition. In this paper, we address the problem of minimizing the cost of looking around a corner, involving the time spent in traveling as well as the time spent for reconstruction. Such a local exploration can be used as a subroutine for global algorithms. We prove competitive ratios for two online algorithms. Then, we provide two representations of local exploration as a Markov Decision Process and apply a known policy iteration algorithm. Simulation results show that for some distributions the probabilistic approach outperforms deterministic strategies. I.
Algorithms for Distributed and Mobile Sensing
, 2004
"... Sensing remote, complex and large environments is an important task that arises in diverse applications including planetary exploration, monitoring forest fires and the surveillance of large factories. Currently, automation of such sensing tasks in complex environments is achieved either by deployin ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Sensing remote, complex and large environments is an important task that arises in diverse applications including planetary exploration, monitoring forest fires and the surveillance of large factories. Currently, automation of such sensing tasks in complex environments is achieved either by deploying many stationary sensors to the environment, or by mounting a sensor on a mobile device and using the device to sense the environment. The
Modelling of power generation investment incentives under uncertainty in liberalised electricity markets
 Proceedings of the Sixth IAEE European Conference 2004
, 2004
"... liberalised electricity markets ..."