Results 1 - 10
of
43
Greedy linear value-approximation for factored Markov decision processes
- In Proceedings of the 18th National Conference on Artificial Intelligence
, 2002
"... Significant recent work has focused on using linear representations to approximate value functions for factored Markov decision processes (MDPs). Current research has adopted linear programming as an effective means to calculate approximations for a given set of basis functions, tackling very la ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
Significant recent work has focused on using linear representations to approximate value functions for factored Markov decision processes (MDPs). Current research has adopted linear programming as an effective means to calculate approximations for a given set of basis functions, tackling very large MDPs as a result. However, a number of issues remain unresolved: How accurate are the approximations produced by linear programs? How hard is it to produce better approximations ? and Where do the basis functions come from? To address these questions, we first investigate the complexity of minimizing the Bellman error of a linear value function approximation---showing that this is an inherently hard problem.
Pricing of Dialup Services: an Example of Congestion-Dependent Pricing in the Internet
- in the Internet. Proceedings of the 39th IEEE Conference on Decision and Control
, 2000
"... Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestion-dependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains o ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Recent research on pricing multiclass loss networks [19] has shown that the performance of optimal static pricing approaches that of optimal dynamic (congestion-dependent) pricing in the many small sources limit. In our own work with similar models, we have found it difficult to obtain large gains over static pricing in realistic settings, even when the many small sources assumption is violated. In this paper we give an example which is a stochastic control model for congestion-dependent pricing of Internet services. The model describes a local Internet service provider (ISP) with a single link to a peer network and two types of customers: (1) large institutions who are refunded for loss-rate violations and (2) small dialup users who "pay per click" on the world wide web according to prices set by the ISP. To understand the limits of performance, we assume that price information can be communicated instantaneously to the users. Our formulation captures the basic tradeoff in allocating bandwidth to the two classes of users in maximizing average net revenue. Optimal pricing requires that the ISP anticipate and respond to changes in bandwidth consumption. Our goal is to quantify the gain that can be achieved through dynamic pricing over open loop pricing strategies which may or may not account for time-of-day effects. We frame the problem as a continuous-time Markov decision process for which we numerically compute optimal solutions. We interpret the results for a wide range of parameter settings to isolate scenaria where real-time price feedback can substantially improve upon time of day pricing. Key Words: Network Pricing, Quality-of-Service, Discrete Stochastic Control, Markov Decision Processes This work is supported by the National Science Foundation through grants E...
An optimal flow assignment framework for heterogeneous network access
- in Proc. IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks
, 2007
"... We consider a scenario where devices with multiple networking capabilities access networks with heterogeneous characteristics. In such a setting, we address the problem of efficient utilization of multiple access networks (wireless and/or wireline) by devices via optimal assignment of traffic flows ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
We consider a scenario where devices with multiple networking capabilities access networks with heterogeneous characteristics. In such a setting, we address the problem of efficient utilization of multiple access networks (wireless and/or wireline) by devices via optimal assignment of traffic flows with given utilities to different networks. We develop and analyze a device middleware functionality that monitors network characteristics and employs a Markov Decision Process (MDP) based control scheme that in conjunction with stochastic characterization of the available bit rate and delay of the networks generates an optimal policy for allocation of flows to different networks. The optimal policy maximizes, under available bit rate and delay constraints on the access networks, a discounted reward which is a function of the flow utilities. The flow assignment policy is periodically updated and is consulted by the flows to dynamically perform network selection during their lifetimes. We perform measurement tests to collect traces of available bit rate and delay characteristics on Ethernet and WLAN networks on a work day in a corporate work environment. We implement our flow assignment framework in ns-2 and simulate the system performance for a set of elastic videolike flows using the collected traces. We demonstrate that the MDP based flow assignment policy leads to significant enhancement in the QoS provisioning (lower packet delays and packet loss rates) for the flows, as compared to policies which do not perform dynamic flow assignment but statically allocate flows to different networks using heuristics like average available bit rate on the networks. 1.
Optimality inequalities for average cost Markov decision processes and the optimality of (s,S) policies
- Mathematics of Operations Research
, 2006
"... informs ® doi 10.1287/moor.1070.0269 ..."
Universal reinforcement learning
- IEEE Transactions on Information Theory
"... We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose an algorithm for optimal control b ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We consider an agent interacting with an unmodeled environment. At each time, the agent makes an observation, takes an action, and incurs a cost. Its actions can influence future observations and costs. The goal is to minimize the long-term average cost. We propose an algorithm for optimal control based on ideas from the Lempel-Ziv scheme for universal data compression and prediction. We establish that, if there exists an integer K such that the future is conditionally independent of the past given a window of K consecutive actions and observations, then the average cost converges to the optimum. Experimental results involving the game of Rock-Paper-Scissors illustrate merits of the algorithm. 1.
Now or Later: Simple Policy for Effective Dual Sourcing
- in Capacitated Systems.”, Tepper Working Paper
, 2005
"... We examine a possibly capacitated, periodically reviewed, single stage inventory system where replenishment can be obtained either through a regular fixed leadtime channel, or, for a premium, via a channel with smaller fixed lead time. We consider the case when the unsatisfied demands are back-order ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We examine a possibly capacitated, periodically reviewed, single stage inventory system where replenishment can be obtained either through a regular fixed leadtime channel, or, for a premium, via a channel with smaller fixed lead time. We consider the case when the unsatisfied demands are back-ordered over an infinite horizon, introducing the easily implementable, yet informationally rich Dual Index policy. We show very general separability results for the optimal parameter values, providing a simulation-based optimization procedure that exploits these separability properties to calculate the optimal inventory parameters within seconds. We explore the performance of the Dual Index policy under stationary demands as well as capacitated production environments, demonstrating when the dual sourcing option is most valuable. We find that the optimal Dual Index policy mimics the behavior of the complex, globally optimal state dependent policy found via dynamic programming: The Dual Index policy is nearly optimal (within one or two percent) for the majority of cases, and significantly outperforms Single Sourcing (up to 50 % better). Our results on optimal Dual Index parameters are generic, extending to a variety of complex and realistic scenarios like non-stationary demand, random yields, demand spikes and supply disruptions.
Bounded Parameter Markov Decision Processes with Average Reward Criterion
"... Abstract. Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in the parameters of a Markov Decision Process (MDP). Unlike the case of an MDP, the notion of an optimal policy for a BMDP is not entirely straightforward. We consider two notions of optimali ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in the parameters of a Markov Decision Process (MDP). Unlike the case of an MDP, the notion of an optimal policy for a BMDP is not entirely straightforward. We consider two notions of optimality based on optimistic and pessimistic criteria. These have been analyzed for discounted BMDPs. Here we provide results for average reward BMDPs. We establish a fundamental relationship between the discounted and the average reward problems, prove the existence of Blackwell optimal policies and, for both notions of optimality, derive algorithms that converge to the optimal value function. 1
Stable Dual Dynamic Programming
"... Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoret ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1
Supply chain management with overtime and premium freight. Working Paper
, 2002
"... We consider a two-stage supply chain under centralized control. The downstream facility faces discrete stochastic demand and passes supply requests to the upstream facility. The upstream facility always meets the supply requests from downstream. If the upstream facility can not meet the supply reque ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We consider a two-stage supply chain under centralized control. The downstream facility faces discrete stochastic demand and passes supply requests to the upstream facility. The upstream facility always meets the supply requests from downstream. If the upstream facility can not meet the supply requests from inventory on hand, the shortage must be filled by either overtime production and/or premium freight shipments, both incurring per unit and setup costs. Overtime production occurs at the end of the period and incurs relatively high production costs; premium freight refers to building products at the beginning of the period they are needed and shipping them very quickly with relatively high shipping costs. Focusing primarily on the case where only one method of filling shortages is available, we determine novel optimal inventory policies under centralized control. At both stages, threshold policies that depend only on the current inventory in the system are optimal; for the total inventory in the system, a base-stock policy is optimal. Numerical analysis provides insight into the optimal policies and allows us to compare the supply chain under centralized In traditional supply chain situations, downstream facilities make decisions about their order
The Smoothed Approximate Linear Program
, 2009
"... We present a novel linear program for the approximation of the dynamic programming costto-go function in high-dimensional stochastic control problems. LP approaches to approximate DP have typically relied on a natural ‘projection ’ of a well studied linear program for exact dynamic programming. Such ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present a novel linear program for the approximation of the dynamic programming costto-go function in high-dimensional stochastic control problems. LP approaches to approximate DP have typically relied on a natural ‘projection ’ of a well studied linear program for exact dynamic programming. Such programs restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program—the ‘smoothed approximate linear program’— is distinct from such approaches and relaxes the restriction to lower bounding approximations in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate substantially superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude. 1.

