Results 1 -
4 of
4
Automatic Memory Optimizations for Improving MPI Derived Datatype Performance,” selected for publication
- in Proceedings of the 13th European PVM/MPI Users' Group Meeting, 2006 (Euro PVM/MPI ’06
, 2006
"... Abstract. MPI derived datatypes allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This powerful feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, many implementations ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract. MPI derived datatypes allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This powerful feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, many implementations of MPI derived datatypes perform poorly, which makes application developers avoid using this feature. In this paper, we present a technique to automatically select templates that are optimized for memory performance based on the access pattern of derived datatypes. We implement this mechanism in the MPICH2 source code. The performance of our implementation is compared to well-written manual packing/unpacking routines and original MPICH2 implementation. We show that performance for various derived datatypes is significantly improved and comparable to that of optimized manual routines.
Micro-Applications for Communication Data Access Patterns and MPI Datatypes
"... Abstract. Data is often communicated from different locations in application memory and is commonly serialized (copied) to send buffers or from receive buffers. MPI datatypes are a way to avoid such intermediate copies and optimize communications, however, it is often unclear which implementation an ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Data is often communicated from different locations in application memory and is commonly serialized (copied) to send buffers or from receive buffers. MPI datatypes are a way to avoid such intermediate copies and optimize communications, however, it is often unclear which implementation and optimization choices are most useful in practice. We extracted the send/recv-buffer access pattern of a representative set of scientific applications into micro-applications that isolate their data access patterns. We also observed that the buffer-access patterns in applications can be categorized into three different groups. Our microapplications show that up to 90 % of the total communication time can be spent with local serialization and we found significant performance discrepancies between state-of-the-art MPI implementations. Our microapplications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to SPEC CPU and the Livermore loops do for compiler optimizations. 1
DOI 10.1007/s00607-013-0330-4 Application-oriented ping-pong benchmarking: how to assess the real communication overheads
, 2012
"... Abstract Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In pr ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional pingpong benchmarks to determine the holistic communication latency and bandwidth
RAMS: A RDMA-enabled I/O Cache Architecture for Clustered network Servers
, 2004
"... Abstract: Previous studies show that intra-cluster communication easily becomes a major performance bottleneck for a wide range of small write-sharing workloads especially read-only workloads in modern clustered network servers. A Remote Direct Memory Access (RDMA) technique has been recommended by ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: Previous studies show that intra-cluster communication easily becomes a major performance bottleneck for a wide range of small write-sharing workloads especially read-only workloads in modern clustered network servers. A Remote Direct Memory Access (RDMA) technique has been recommended by many researchers to address the problem but how to well utilize RDMA is still in its infancy. This paper proposed a novel solution to boost intra-cluster communication performance by creatively developing a RDMA-enabled collaborative I/O cache Architecture called RAMS, which aims to smartly cache the most recently used RDMAbased intra-cluster data transfer processes for future reuse. RAMS makes two major contributions to facilitate the RDMA deployment: 1) design a novel RDMAbased user-level buffer cache architecture to cache both intra-cluster transferred data and data references; 2) develop three propagated update protocols to attack a RDMA read failure problem. Comprehensive experimental results show that three proposed new update protocols of RAMS can slash the RDMA read failure rate by 75%, and indirectly boost the system throughput by more than 50%, compared with a baseline system using Remote Procedure Call (RPC).