Results 1 - 10
of
11
Hedera: Dynamic flow scheduling for data center networks
- In Proc. of Networked Systems Design and Implementation (NSDI) Symposium
, 2010
"... Today’s data centers offer tremendous aggregate bandwidth to clusters of tens of thousands of machines. However, because of limited port densities in even the highest-end switches, data center topologies typically consist of multi-rooted trees with many equal-cost paths between any given pair of hos ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Today’s data centers offer tremendous aggregate bandwidth to clusters of tens of thousands of machines. However, because of limited port densities in even the highest-end switches, data center topologies typically consist of multi-rooted trees with many equal-cost paths between any given pair of hosts. Existing IP multipathing protocols usually rely on per-flow static hashing and can cause substantial bandwidth losses due to longterm collisions. In this paper, we present Hedera, a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources. We describe our implementation using commodity switches and unmodified hosts, and show that for a simulated 8,192 host data center, Hedera delivers bisection bandwidth that is 96 % of optimal and up to 113 % better than static load-balancing methods. 1
HyperX: Topology, Routing, and Packaging of Efficient Large-Scale Networks
"... In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened butterfly topologies, the HyperX, and give an adaptive routing algorithm, DAL. HyperX takes advantage of high-radix switch components that integrated photonics will make available. Our main contributions include a formal descriptive framework, enabling a search method that finds optimal HyperX configurations; DAL; and a low cost packaging strategy for an exascale HyperX. Simulations show that HyperX can provide performance as good as a folded Clos, with fewer switches. We also describe a HyperX packaging scheme that reduces system cost. Our analysis of efficiency, performance, and packaging demonstrates that the HyperX is a strong competitor for exascale networks. 1.
Improving datacenter performance and robustness with multipath TCP
- In Proceedings of SIGCOMM (2011
"... The latest large-scale data centers offer higher aggregate bandwidth and robustness by creating multiple paths in the core of the network. To utilize this bandwidth requires different flows take different paths, which poses a challenge. In short, a single-path transport seems ill-suited to such netw ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The latest large-scale data centers offer higher aggregate bandwidth and robustness by creating multiple paths in the core of the network. To utilize this bandwidth requires different flows take different paths, which poses a challenge. In short, a single-path transport seems ill-suited to such networks. We propose using Multipath TCP as a replacement for TCP in such data centers, as it can effectively and seamlessly use available bandwidth, giving improved throughput and better fairness on many topologies. We investigate what causes these benefits, teasing apart the contribution of each of the mechanisms used by MPTCP. Using MPTCP lets us rethink data center networks, with a different mindset as to the relationship between transport protocols, routing and topology. MPTCP enables topologies that single path TCP cannot utilize. As a proof-of-concept, we present a dual-homed variant of the FatTree topology. With MPTCP, this outperforms FatTree for a wide range of workloads, but costs the same. In existing data centers, MPTCP is readily deployable leveraging widely deployed technologies such as ECMP. We have run MPTCP on Amazon EC2 and found that it outperforms TCP by a factor of three when there is path diversity. But the biggest benefits will come when data centers are designed for multipath transports.
LogGOPSim – Simulating Large-Scale Applications in the LogGOPS Model
"... We introduce LogGOPSim—a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination withfullMPImessage matchingsemantics anddetailedsimulation of collective operations. In addition, it enables simulat ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We introduce LogGOPSim—a fast simulation framework for parallel algorithms at large-scale. LogGOPSim utilizes a slightly extended version of the well-known LogGPS model in combination withfullMPImessage matchingsemantics anddetailedsimulation of collective operations. In addition, it enables simulation in the traditional LogP, LogGP, and LogGPS models. Its simple and fast single-queue design computes more than 1 million events per second on a single processor and enables large-scale simulations of more than8millionprocesses. LogGOPSimalso supports the simulation of full MPI applications by reading and simulating MPI profiling traces. We analyze the accuracy and the performance of the simulation and propose a simple extrapolation scheme for parallel applications. Our scheme extrapolates collective operations withhighaccuracy by rebuilding the communication pattern. Point-to-point operation patterns can be copied in the extrapolation and thus retain the main characteristics of scalable parallel applications. 1.
Cluster Challenge 2008: Optimizing Cluster Configuration and Applications to Maximize Power Efficiency
"... Abstract. The goal of the Cluster Challenge is to design, build and operate a compute cluster. Although it is an artificial environment for cluster computing, many of its key constraints on operation of cluster systems are important to real world scenarios: high energy efficiency, reliability and sc ..."
Abstract
- Add to MetaCart
Abstract. The goal of the Cluster Challenge is to design, build and operate a compute cluster. Although it is an artificial environment for cluster computing, many of its key constraints on operation of cluster systems are important to real world scenarios: high energy efficiency, reliability and scalability. In this paper, we describe our approach to accomplish these goals. We present our original system design and illustrate changes to that system as well as to applications and system settings in order to achieve maximum performance within the given power and time limits. Finally we suggest how our conclusions can be used to improve current and future clusters. About our team The ClusterMeister team consisted of six undergraduate students, three of them
ORCS: An Oblivious Routing Congestion Simulator
, 2009
"... Bisection Bandwidth, as defined by Hennessy and Patterson in [4] as the bandwidth between the two equal sized halves of the network for the worst case partition, is widely used as a theoretical model for network performance. ..."
Abstract
- Add to MetaCart
Bisection Bandwidth, as defined by Hennessy and Patterson in [4] as the bandwidth between the two equal sized halves of the network for the worst case partition, is widely used as a theoretical model for network performance.
Processor Affinity and MPI Performance on SMP-CMP Clusters
"... with multi-core Chip-Multiprocessors (CMP), also known as SMP-CMP clusters, are becoming ubiquitous today. For Message Passing interface (MPI) programs, such clusters have a multilayer hierarchical communication structure: the performance of intra-node communication is usually higher than that of in ..."
Abstract
- Add to MetaCart
with multi-core Chip-Multiprocessors (CMP), also known as SMP-CMP clusters, are becoming ubiquitous today. For Message Passing interface (MPI) programs, such clusters have a multilayer hierarchical communication structure: the performance of intra-node communication is usually higher than that of internode communication; and the performance of intra-node communication is not uniform with communications between cores within a chip offering higher performance than communications between cores in different chips. As a result, the mapping from Message Passing Interface (MPI) processes to cores within each compute node, that is, processor affinity, may significantly affect the performance of intra-node communication, which in turn may impact the overall performance of MPI applications. In this work, we study the impacts of processor affinity on MPI performance in SMP-CMP clusters through extensive benchmarking and identify the conditions when processor affinity is (or is not) a major factor that affects performance. Keywords-Processor affinity; MPI; SMP-CMP clusters I.
Optimized Routing for Large-Scale InfiniBand Networks
"... Abstract—Point-to-point metrics, such as latency and bandwidth, are often used to characterize network performance with the consequent assumption that optimizing for these metrics is sufficient to improve parallel application performance. However, these metrics can only provide limited insight into ..."
Abstract
- Add to MetaCart
Abstract—Point-to-point metrics, such as latency and bandwidth, are often used to characterize network performance with the consequent assumption that optimizing for these metrics is sufficient to improve parallel application performance. However, these metrics can only provide limited insight into application behavior because they do not fully account for effects, such as network congestion, that significantly influence overall network performance. Because many high-performance networks use deterministic oblivious routing, one such effect is the choice of routing algorithm. In this paper, we analyze and compare practical and theoretical aspects of different routing algorithms that are used in today’s large-scale networks. We show that widely-used theoretical metrics, such as edge-forwarding index or bisection bandwidth, are not accurate predictors for average network bandwidth. Instead, we introduce an intuitive metric, which we call “effective bisection bandwidth ” to characterize quality of different routing algorithms. We present a simple algorithm that globally balances routes and therefore improves the effective bandwidth of the network. Compared to the best algorithm inusetoday, our newalgorithm shows an improvement in effective bisection bandwidth of 40 % on a 724-endpoint InfiniBand cluster. I.
DARD: Distributed Adaptive Routing for Datacenter Networks
"... Datacenter networks typically have many paths connecting each host pair to achieve high bisection bandwidth for arbitrary communication patterns. Fully utilizing the bisection bandwidth may require flows between the same source and destination pair to take different paths to avoid hot spots. However ..."
Abstract
- Add to MetaCart
Datacenter networks typically have many paths connecting each host pair to achieve high bisection bandwidth for arbitrary communication patterns. Fully utilizing the bisection bandwidth may require flows between the same source and destination pair to take different paths to avoid hot spots. However, the existing routing protocols have little support for load-sensitive adaptive routing. This work proposes DARD, a Distributed Adaptive Routing architecture for Datacenter networks. DARD allows each end host to adjust traffic from overloaded paths to underloaded ones without central coordination. We use openflow implementation and simulations to show that DARD can effectively use the network’s bisection bandwidth. It outperforms previous solutions based on random flow-level scheduling by 10%, and performs similarly to previous work that assigns flows to paths using a centralized scheduler but without its scaling limitation. We use competitive game theory to show that DARD’s flow scheduling algorithm is stable. It makes progress in every step and converges to a Nash equilibrium in finite steps. Our evaluation results suggest its gap to the optimal solution is likely to be small in practice. 1.
unknown title
"... Abstract—Folded-Clos networks, also referred to as fat-trees, have been widely used as interconnects in large scale high performance computing clusters. The switching capability of such interconnects in the computer communication environment, however, is not well understood. In particular, the conce ..."
Abstract
- Add to MetaCart
Abstract—Folded-Clos networks, also referred to as fat-trees, have been widely used as interconnects in large scale high performance computing clusters. The switching capability of such interconnects in the computer communication environment, however, is not well understood. In particular, the concept of nonblocking interconnects, which is often used by system vendors, has only been studied in the telephone communication environment with the assumption of a centralized controller. Such “nonblocking”networks do not support nonblocking communications in computer communication environments where the network control is distributed. In this paper, we investigate folded-Clos networks that are nonblocking in computer communication environments and establish nonblocking conditions for various routing schemes including deterministic routing and adaptive routing.

