Results 1 - 10
of
19
PVFS over InfiniBand: Design and Performance Evaluation
- IN THE 2003 INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 03
, 2003
"... I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. In this paper, we examine the feasibility of leveraging InfiniBand technology to improve I/O performance and scalability o ..."
Abstract
-
Cited by 30 (13 self)
- Add to MetaCart
I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. In this paper, we examine the feasibility of leveraging InfiniBand technology to improve I/O performance and scalability of cluster file systems. We use Parallel Virtual File System (PVFS) as a basis for exploring these features. In this
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
- in proceedings of Int'l Conference on Supercomputing, (SC'03
, 2003
"... comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation co ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation consists of application level benchmarks. We have used the NAS Parallel Benchmarks and the sweep3D benchmark. We not only present the overall performance results, but also relate application communication characteristics to the information we acquired from the micro-benchmarks. Our results show that the three MPI implementations all have their advantages and disadvantages. For our 8-node cluster, InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus. Even with just the PCI bus, InfiniBand can still perform better if the applications are bandwidth-bound.
Using one-sided rdma reads to build a fast, cpu-efficient key-value store
- In Proceedings of the USENIX Conference on Annual Technical Conference (ATC
, 2013
"... Recent technological trends indicate that future datacen-ter networks will incorporate High Performance Com-puting network features, such as ultra-low latency and CPU bypassing. How can these features be exploited in datacenter-scale systems infrastructure? In this pa-per, we explore the design of a ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
(Show Context)
Recent technological trends indicate that future datacen-ter networks will incorporate High Performance Com-puting network features, such as ultra-low latency and CPU bypassing. How can these features be exploited in datacenter-scale systems infrastructure? In this pa-per, we explore the design of a distributed in-memory key-value store called Pilaf that takes advantage of Re-mote Direct Memory Access to achieve high perfor-mance with low CPU overhead. In Pilaf, clients directly read from the server’s mem-ory via RDMA to perform gets, which commonly dominate key-value store workloads. By contrast, put operations are serviced by the server to simplify the task of synchronizing memory accesses. To detect in-consistent RDMA reads with concurrent CPU memory modifications, we introduce the notion of self-verifying data structures that can detect read-write races without client-server coordination. Our experiments show that Pilaf achieves low latency and high throughput while consuming few CPU resources. Specifically, Pilaf can surpass 1.3 million ops/sec (90 % gets) using a single CPU core compared with 55K for Memcached and 59K for Redis. 1
Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters
- In EuroPVM/MPI
, 2003
"... This paper describes a methodology for efficiently implementing the collective operations, in this case the barrier, on clusters with the emerging InfiniBand Architecture (IBA). IBA provides hardware level support for the Remote Direct Memory Access (RDMA) message passing model as well as the multic ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
(Show Context)
This paper describes a methodology for efficiently implementing the collective operations, in this case the barrier, on clusters with the emerging InfiniBand Architecture (IBA). IBA provides hardware level support for the Remote Direct Memory Access (RDMA) message passing model as well as the multicast operation. Exploiting these features of InfiniBand to efficiently implement the barrier operation is a challenge in itself. This paper describes the design, implementation and evaluation of three barrier algorithms that leverage these mechanisms. Performance evaluation studies indicate that considerable benefits can be achieved using these mechanisms compared to the traditional implementation based on the point-to-point message passing model. Our experimental results show a performance benefit of up to 1.29 times for a 16-node barrier and up to 1.71 times for non-powers-of-2 group size barriers. Each proposed algorithm performs the best for certain ranges of group sizes and the optimal algorithm can be chosen based on this range. To the best of our knowledge, this is the first attempt to characterize the multicast performance in IBA and to demonstrate the benefits achieved by combining it with RDMA operations for efficient implementations of barrier.
Efficient SMP-Aware MPI-Level Broadcast over
- Infiniband's Hardware Multicast, Communication Architecture for Clusters Workshop, in Proceedings of IPDPS, 2006
"... Most of the high-end computing clusters found today feature multi-way SMP nodes interconnected by an ultra-low latency and high bandwidth network. InfiniBand is emerging as a high-speed network for such systems. InfiniBand provides a scalable and efficient hardware multicast primitive to efficiently ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Most of the high-end computing clusters found today feature multi-way SMP nodes interconnected by an ultra-low latency and high bandwidth network. InfiniBand is emerging as a high-speed network for such systems. InfiniBand provides a scalable and efficient hardware multicast primitive to efficiently implement many MPI collective operations. However, employing hardware multicast as the communication method may not perform well in all cases. This is true especially when more than one process is running per node. In this context, shared memory channel becomes the desired communication medium within the node as it delivers latencies which are of an order of magnitude lower than the inter-node message latencies. Thus, to deliver optimal collective performance, coupling hardware multicast with shared memory channel becomes necessary. In this paper we propose mechanisms to address this issue. On a 16-node 2-way SMP cluster, the Leader-based scheme proposed in this paper improves the performance of the MPI Bcast operation by a factor of as much as 2.3 and 1.8 when compared to the point-to-point and original solution employing only hardware multicast. We have also evaluated our designs on NUMA based system and obtained a performance improvement of 1.7 using our designs on 2-node 4-way system. We also propose a Dynamic Attach Policy as an enhancement to this scheme to mitigate the impact of process skew on the performance of the collective operation.
Toward message passing for a million processes: Characterizing MPI on a massive scale Blue Gene/P. Computer Science-Research and Development
, 2009
"... High-end computing (HEC) systems have passed the petaflop barrier and continue to move toward the next frontier of exascale computing. Systems with hundreds of thousands of cores are already available and upcoming exascale capable systems are expected to comprise more than a million processing eleme ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
(Show Context)
High-end computing (HEC) systems have passed the petaflop barrier and continue to move toward the next frontier of exascale computing. Systems with hundreds of thousands of cores are already available and upcoming exascale capable systems are expected to comprise more than a million processing elements. As companies and research insti-tutes continue to work toward architecting these enormous systems, it is becoming increasingly clear that these systems will utilize a signif-icant amount of shared hardware between processing units, including shared caches, memory management engines and network infrastruc-ture. Thus, understanding how effective current message passing and communication infrastructure is in tying these processing elements to-gether, is critical to making educated guesses on what we can expect from such future machines. Thus, in this paper, we characterize the communication performance of the message passing interface (MPI) implementation on 32 racks (131,072 cores) of the largest Blue Gene/P (BG/P) system in the world (80 % of the total system size). Our studies show various interesting insights into the communication characteris-tics of MPI on the BG/P. 1
Implementing Efficient and Scalable Flow Control Schemes in MPI over Infiniband
"... In this paper, we present a detailed study of how to design efficient and scalable flow control mechanisms in MPI over the InfiniBand Architecture. Two of the central issues in flow control are performance and scalability in terms of buffer usage. We propose three different flow control schemes (har ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
In this paper, we present a detailed study of how to design efficient and scalable flow control mechanisms in MPI over the InfiniBand Architecture. Two of the central issues in flow control are performance and scalability in terms of buffer usage. We propose three different flow control schemes (hardware-based, user-level static and userlevel dynamic) and describe their respective design issues. We have implemented all three schemes in our MPI implementation over InfiniBand and conducted performance evaluation using both micro-benchmarks and the NAS Parallel Benchmarks. Our performance analysis shows that in our testbed, most NAS applications only require a very small number of pre-posted buffers for every connection to achieve good performance. We also show that the user-level dynamic scheme can achieve both performance and buffer efficiency by adapting itself according to the application communication pattern. These results have significant impact in designing large-scale clusters (in the order of 1,000 to 10,000 nodes) with InfiniBand.
Scalable Startup of Parallel Programs over InfiniBand
, 2004
"... Fast and scalable process startup is one of the major challenges in parallel computing over large scale clusters. The startup of a parallel job typically can be divided into two phases: process initiation and connection setup. Both of these phases can become performance bottlenecks. In this paper, w ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Fast and scalable process startup is one of the major challenges in parallel computing over large scale clusters. The startup of a parallel job typically can be divided into two phases: process initiation and connection setup. Both of these phases can become performance bottlenecks. In this paper, we characterize the startup of MPI programs in InfiniBand clusters and identify two startup scalability issues: serialized process initiation in the initiation phase and high communication overhead in the connection setup phase. We propose different approaches to reduce communication overhead and provide fast process initiation. Specifically, to reduce the connection setup time, we have developed one approach with data reassembly to reduce data volume and another with a bootstrap channel to parallelize the communication. Furthermore, we have exploited a process management framework, Multi-purpose Daemons (MPD) system to speed up the process initiation phase. The bootstrap channel is utilized to overcome the scalability limitations of MPD. Our experimental results show that job startup time has been improved by more than 4 times for 128-process jobs in an InfiniBand cluster. Scalability Models derived from these results suggest that the improvement can be more than two orders of magnitudes for the startup of 2048-process jobs.
Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits ⋆
"... Abstract. In cluster computing, InfiniBand has emerged as a popular high performance interconnect with MPI as the de facto programming model. However, even with InfiniBand, bandwidth can become a bottleneck for clusters executing communication intensive applications. Multi-rail cluster configuration ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract. In cluster computing, InfiniBand has emerged as a popular high performance interconnect with MPI as the de facto programming model. However, even with InfiniBand, bandwidth can become a bottleneck for clusters executing communication intensive applications. Multi-rail cluster configurations with MPI-1 are being proposed to alleviate this problem. Recently, MPI-2 with support for one-sided communication is gaining significance. In this paper, we take the challenge of designing high performance MPI-2 one-sided communication on multi-rail InfiniBand clusters. We propose a unified MPI-2 design for different configurations of multi-rail networks (multiple ports, multiple HCAs and combinations). We present various issues associated with one-sided communication such as multiple synchronization messages, scheduling of RDMA (Read, Write) operations, ordering relaxation and discuss their implications on our design. Our performance results show that multi-rail networks can significantly improve MPI-2 one-sided communication performance. Using PCI-Express with two-ports, we can achieve a peak MPI Put bidirectional bandwidth of 2620 Million Bytes/s, compared to 1910 MB/s for single-rail implementation. For PCI-X with two HCAs, we can almost double the throughput and reduce the latency to half for large messages. 1
coNCePTuaL: A Network Correctness and Performance Testing Language
- In: Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, IEEE
, 2004
"... This paper introduces a new, domain-specific specification language called CONCEPTUAL. CONCEPTUAL enables the expression of sophisticated communication benchmarks and network validation tests in comparatively few lines of code. Besides helping programmers save time writing and debugging code, CONCEP ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
This paper introduces a new, domain-specific specification language called CONCEPTUAL. CONCEPTUAL enables the expression of sophisticated communication benchmarks and network validation tests in comparatively few lines of code. Besides helping programmers save time writing and debugging code, CONCEPTUAL addresses the important--- but largely unrecognized---problem of benchmark opacity. Benchmark opacity refers to the current impracticality of presenting performance measurements in a manner that promotes reproducibility and independent evaluation of the results. For example, stating that a performance graph was produced by a "bandwidth" test says nothing about whether that test measures the data rate during a round-trip transmission or the average data rate over a number of back-toback unidirectional messages; whether the benchmark preregisters bu#ers, sends warm-up messages, and/or pre-posts asynchronous receives before starting the clock; how many runs were performed and whether these were aggregated by taking the mean, median, or maximum; or, even whether a data unit such as "MB/s" indicates 10 bytes per second.