Results 1 - 10
of
87
Managing data transfers in computer clusters . . .
, 2011
"... Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50 % of job completion times. Despite this impact, there has been relatively little wo ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50 % of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5 × compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of highpriority transfers by 1.7×.
FairCloud: Sharing The Network In Cloud Computing
"... The network is a crucial resource in cloud computing, but in contrast to other resources such as CPU or memory, the network is currently shared in a best effort manner. However, sharing the network in a datacenter is more challenging than sharing the other resources. The key difficulty is that the n ..."
Abstract
-
Cited by 55 (4 self)
- Add to MetaCart
(Show Context)
The network is a crucial resource in cloud computing, but in contrast to other resources such as CPU or memory, the network is currently shared in a best effort manner. However, sharing the network in a datacenter is more challenging than sharing the other resources. The key difficulty is that the network allocation for a VM X depends not only on the VMs running on the same machine with X, but also on the other VMs that X communicates with, as well as on the crosstraffic on each link used by X. In this paper, we first propose a set of desirable properties for allocating the network bandwidth in a datacenter at the VM granularity, and show that there exists a fundamental tradeoff between the ability to share congested links in proportion to payment and the ability to provide minimal bandwidth guarantees to VMs. Second, we show that the existing allocation models violate one or more of these properties, and propose a mechanism that can select different points in the aforementioned tradeoff between payment proportionality and bandwidth guarantees.
Performance isolation and fairness for multi-tenant cloud storage
- In OSDI
, 2012
"... Shared storage services enjoy wide adoption in commercial clouds. But most systems today provide weak performance isolation and fairness between tenants, if at all. Misbehaving or high-demand tenants can overload the shared service and disrupt other well-behaved tenants, leading to unpredictable per ..."
Abstract
-
Cited by 40 (2 self)
- Add to MetaCart
(Show Context)
Shared storage services enjoy wide adoption in commercial clouds. But most systems today provide weak performance isolation and fairness between tenants, if at all. Misbehaving or high-demand tenants can overload the shared service and disrupt other well-behaved tenants, leading to unpredictable performance and violating SLAs. This paper presents Pisces, a system for achieving datacenter-wide per-tenant performance isolation and fairness in shared key-value storage. Today’s approaches for multi-tenant resource allocation are based either on per-VM allocations or hard rate limits that assume uniform workloads to achieve high utilization. Pisces achieves per-tenant weighted fair shares (or minimal rates) of the aggregate resources of the shared service, even when different tenants ’ partitions are co-located and when demand for different partitions is skewed, time-varying, or bottlenecked by different server resources. Pisces does so by decomposing the fair sharing problem into a combination of four complementary mechanisms—partition placement, weight allocation, replica selection, and weighted fair queuing—that operate on different time-scales and combine to provide system-wide max-min fairness. An evaluation of our Pisces storage prototype achieves nearly ideal (0.99 Min-Max Ratio) weighted fair sharing, strong performance isolation, and robustness to skew and shifts in tenant demand. These properties are achieved with minimal overhead (<3%), even when running at high utilization (more than 400,000 requests/second/server for 10B requests). 1.
The Only Constant is Change: Incorporating Time-Varying Network Reservations in Data Centers
"... In multi-tenant datacenters, jobs of different tenants compete for theshareddatacenternetworkandcansufferpoorperformanceand high cost from varying, unpredictable network performance. Recently, several virtual network abstractions have been proposed to provide explicit APIs for tenant jobs to specify ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
(Show Context)
In multi-tenant datacenters, jobs of different tenants compete for theshareddatacenternetworkandcansufferpoorperformanceand high cost from varying, unpredictable network performance. Recently, several virtual network abstractions have been proposed to provide explicit APIs for tenant jobs to specify and reserve virtual clusters (VC) with both explicit VMs and required network bandwidth between the VMs. However, all of the existing proposals reserveafixed bandwidth throughout theentire execution of a job. In the paper, we first profile the traffic patterns of several popular cloud applications, and find that they generate substantial traffic during only 30%-60 % of the entire execution, suggesting existing simple VC models waste precious networking resources. We then propose a fine-grained virtual network abstraction, Time-Interleaved Virtual Clusters (TIVC), that models the time-varying nature of the networking requirement of cloud applications. To demonstrate the effectiveness of TIVC, we develop PROTEUS, a systemthatimplementsthenewabstraction. Usinglarge-scalesimulationsofcloudapplicationworkloadsandprototypeimplementationrunningactualcloudapplications,weshowthenewabstraction significantlyincreasestheutilizationoftheentiredatacenterandreducesthecosttothetenants,comparedtopreviousfixed-bandwidth abstractions.
Fastpass: A Centralized “Zero-Queue ” Datacenter Network
"... An ideal datacenter network should provide several properties, in-cluding low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit th ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
An ideal datacenter network should provide several properties, in-cluding low median and tail latency, high utilization (throughput), fair allocation of network resources between users or applications, deadline-aware scheduling, and congestion (loss) avoidance. Current datacenter networks inherit the principles that went into the design of the Internet, where packet transmission and path selection deci-sions are distributed among the endpoints and routers. Instead, we propose that each sender should delegate control—to a centralized arbiter—of when each packet should be transmitted and what path it should follow. This paper describes Fastpass, a datacenter network architecture built using this principle. Fastpass incorporates two fast algorithms: the first determines the time at which each packet should be transmit-ted, while the second determines the path to use for that packet. In addition, Fastpass uses an efficient protocol between the endpoints and the arbiter and an arbiter replication strategy for fault-tolerant failover. We deployed and evaluated Fastpass in a portion of Face-book’s datacenter network. Our results show that Fastpass achieves high throughput comparable to current networks at a 240 × reduc-tion is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves much fairer and consistent flow throughputs than the baseline TCP (5200 × reduction in the standard deviation of per-flow throughput with five concurrent connections), scalability from 1 to 8 cores in the arbiter implementation with the ability to schedule 2.21 Terabits/s of traffic in software on eight cores, and a 2.5 × reduction in the number of TCP retransmissions in a latency-sensitive service at Facebook. 1.
Decentralized Task-aware Scheduling for Data Center Networks.
, 2013
"... ABSTRACT Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user's news-feed). From a network perspective, these tasks typically comprise multiple flows, which traverse different parts of the network at potentially different times. Most ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user's news-feed). From a network perspective, these tasks typically comprise multiple flows, which traverse different parts of the network at potentially different times. Most network resource allocation schemes, however, treat all these flows in isolation -rather than as part of a task -and therefore only optimize flow-level metrics. In this paper, we show that task-aware network scheduling, which groups flows of a task and schedules them together, can reduce both the average as well as tail completion time for typical data center applications. To achieve these benefits in practice, we design and implement Baraat, a decentralized task-aware scheduling system. Baraat schedules tasks in a FIFO order but avoids head-of-line blocking by dynamically changing the level of multiplexing in the network. Through experiments with Memcached on a small testbed and large-scale simulations, we show that Baraat outperforms state-of-the-art decentralized schemes (e.g., pFabric) as well as centralized schedulers (e.g., Orchestra) for a wide range of workloads (e.g., search, analytics, etc).
Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints
"... Max-Min Fairness is a flexible resource allocation mechanism used in most datacenter schedulers. However, an increasing number of jobs have hard placement constraints, restricting the machines they can run on due to special hardware or software requirements. It is unclear how to define, and achieve, ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Max-Min Fairness is a flexible resource allocation mechanism used in most datacenter schedulers. However, an increasing number of jobs have hard placement constraints, restricting the machines they can run on due to special hardware or software requirements. It is unclear how to define, and achieve, max-min fairness in the presence of such constraints. We propose Constrained Max-Min Fairness (CMMF), an extension to max-min fairness that supports placement constraints, and show that it is the only policy satisfying an important property that incentivizes users to pool resources. Optimally computing CMMF is challenging, but we show that a remarkably simple online scheduler, called Choosy, approximates the optimal scheduler well. Through experiments, analysis, and simulations, we show that Choosy on average differs 2 % from the optimal CMMF allocation, and lets jobs achieve their fair share quickly. 1.
A Cooperative Game Based Allocation for Sharing Data Center Networks
"... Abstract—In current IaaS datacenters, tenants are suffering unfairness since the network bandwidth is shared in a besteffort manner. To achieve predictable network performance for rented virtual machines (VMs), cloud providers should guarantee minimum bandwidth for VMs or allocate the network bandwi ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
(Show Context)
Abstract—In current IaaS datacenters, tenants are suffering unfairness since the network bandwidth is shared in a besteffort manner. To achieve predictable network performance for rented virtual machines (VMs), cloud providers should guarantee minimum bandwidth for VMs or allocate the network bandwidth in a fairness fashion at VM-level. At the same time, the network should be efficiently utilized in order to maximize cloud providers ’ revenue. In this paper, we model the bandwidth sharing problem as a Nash bargaining game, and propose the allocation principles by defining a tunable base bandwidth for each VM. Specifically, we guarantee bandwidth for those VMs with lower network rates than their base bandwidth, while maintaining fairness among other VMs with higher network rates than their base bandwidth. Based on rigorous cooperative game-theoretic approaches, we design a distributed algorithm to achieve efficient and fair bandwidth allocation corresponding to the Nash bargaining solution (NBS). With simulations under typical scenarios, we show that our strategy can meet the two desirable requirements towards predictable performance for tenants as well as high utilization for providers. And by tuning the base bandwidth, our solution can enable cloud providers to flexibly balance the tradeoff between minimum guarantees and fair sharing of datacenter networks. I.
ElasticSwitch: Practical Work-Conserving Bandwidth Guarantees for Cloud Computing
"... While cloud computing providers offer guaranteed allocations for resources such as CPU and memory, they do not offer any guarantees for network resources. The lack of network guarantees prevents tenants from predicting lower bounds on the performance of their applications. The research community has ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
While cloud computing providers offer guaranteed allocations for resources such as CPU and memory, they do not offer any guarantees for network resources. The lack of network guarantees prevents tenants from predicting lower bounds on the performance of their applications. The research community has recognized this limitation but, unfortunately, prior solutions have significant limitations: either they are inefficient, because they are not workconserving, or they are impractical, because they require expensive switch support or congestion-free network cores. In this paper, we propose ElasticSwitch, an efficient and practical approach for providing bandwidth guarantees. ElasticSwitch is efficient because it utilizes the spare bandwidth from unreserved capacity or underutilized reservations. ElasticSwitch is practical because it can be fully implemented in hypervisors, without requiring a specific topology or any support from switches. Because hypervisors operate mostly independently, there is no need for complex coordination between them or with a central controller. Our experiments, with a prototype implementation on a 100-server testbed, demonstrate that ElasticSwitch provides bandwidth guarantees and is work-conserving, even in challenging situations.
Verifiable Resource Accounting for Cloud Computing Services ABSTRACT
"... Cloud computing offers users the potential to reduce operating and capital expenses by leveraging the amortization benefits offered by large, managed infrastructures. However, the black-box and dynamic nature of the cloud infrastructure makes it difficult for them to reason about the expenses that t ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
Cloud computing offers users the potential to reduce operating and capital expenses by leveraging the amortization benefits offered by large, managed infrastructures. However, the black-box and dynamic nature of the cloud infrastructure makes it difficult for them to reason about the expenses that their applications incur. At the same time, the profitability of cloud providers depends on their ability to multiplex several customer applications to maintain high utilization levels. However, this multiplexing may cause providers to incorrectly attribute resource consumption to customers or implicitly bear additional costs thereby reducing their cost-effectiveness. Our position in this paper is that for cloud computing as a paradigm to be sustainable in the long term, we need a systematic approach for verifiable resource accounting. Verifiability here means that cloud customers can be assured that (a) their applications indeed physically consumed the resources they were charged for and (b) that this consumption was justified based on an agreed policy. As a first step toward this vision, in this paper we articulate the challenges and opportunities for realizing such a framework.