Results 1 -
8 of
8
The Implementation of MPI-2 One-Sided Communication for the NEC SX-5
- In Proceedings of Supercomputing
, 2000
"... We describe the MPI/SX implementation of the MPI-2 standard for one-sided communication (Remote Memory Access) for the NEC SX-5 vector supercomputer. MPI/SX is a non-threaded implementation of the full MPI-2 standard. Essential features of the implementation are presented, including the synchronizat ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
We describe the MPI/SX implementation of the MPI-2 standard for one-sided communication (Remote Memory Access) for the NEC SX-5 vector supercomputer. MPI/SX is a non-threaded implementation of the full MPI-2 standard. Essential features of the implementation are presented, including the synchronization mechanisms, the handling of communication windows in global shared and in process local memory, as well as the handling of MPI derived datatypes. In comparative benchmarks the data transfer operations for one-sided communication and point-to-point message passing show very similar performance, both when data reside in global shared and when in process local memory. Derived datatypes, which are of particular importance for applications using one-sided communications, impose only a modest overhead and can be used without any significant loss of performance. Thus, the MPI/SX programmer can freely choose either the message passing or the one-sided communication model, whichever i...
High performance MPI-2 one-sided communication over InfiniBand
- In Proceedings of 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid
, 2004
"... Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers from high communication overhead and dependency on remote process for communication progress. To address these problems, we pro ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Many existing MPI-2 one-sided communication implementations are built on top of MPI send/receive operations. Although this approach can achieve good portability, it suffers from high communication overhead and dependency on remote process for communication progress. To address these problems, we propose a high performance MPI-2 onesided communication design over the InfiniBand Architecture. In our design, MPI-2 one-sided communication operations such as MPI Put, MPI Get and MPI Accumulate are directly mapped to InfiniBand Remote Direct Memory Access (RDMA) operations. Our design has been implemented based on MPICH2 over InfiniBand. We present detailed design issues for this approach and perform a set of micro-benchmarks to characterize different aspects of its performance. Our performance evaluation shows that compared with the design based on MPI send/receive, our design can improve throughput up to 77%, and reduce lantency and synchronization overhead up to 19 % and 13%, respectively. Under certain process skew, the bad impact can be significantly reduced by new design, from 41 % to nearly 0%. It also can achieve better overlap of communication and computation. 1
Designing Passive Synchronization for MPI-2 One-Sided Communication to Maximize Overlap ∗
"... Scientific computing has seen an immense growth in recent years. MPI has become the defacto standard for parallel programming model for distributed memory systems. MPI-2 standard also introduced the one-sided programming model. Computation and communication overlap is an important goal for one-sided ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Scientific computing has seen an immense growth in recent years. MPI has become the defacto standard for parallel programming model for distributed memory systems. MPI-2 standard also introduced the one-sided programming model. Computation and communication overlap is an important goal for one-sided applications. While the passive synchronization mechanism for MPI-2 one-sided communication allows for good overlap, the actual overlap achieved is often limited by the design of both the MPI library and the application. In this paper we aim to improve the performance of MPI-2 one-sided communication. In particular, we focus on the following important aspects: (i) designing one-sided passive synchronization (Direct Passive) support using InfiniBand atomic operations to handle both exclusive as well as shared locks (ii) enhancing one-sided communication progress to provide scope for better overlap that one-sided applications can leverage. (iii) study the overlap potential of passive synchronization and its impact on applications. We demonstrate the possible benefits of our approaches for the MPI-2 SPLASH LU application benchmark. Our results show an improvement of up to 87 % for a 64 processes run over the existing design. 1
Design Alternatives for Implementing Fence Synchronization in MPI-2 One-sided Communication for InfiniBand Clusters ∗
"... Scientific computing has seen an immense growth in recent years. The Message Passing Interface (MPI) has become the de-facto standard for parallel programming model for distributed memory systems. As the system scale increases, application writers often try to increase the overlap of computation and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Scientific computing has seen an immense growth in recent years. The Message Passing Interface (MPI) has become the de-facto standard for parallel programming model for distributed memory systems. As the system scale increases, application writers often try to increase the overlap of computation and communication. The MPI-2 standard expanded MPI to include one-sided communication semantics that has the potential for overlapping computation with communication. In this model synchronization between processes needs to be done explicitly to ensure completion before using the data. Fence is one of the mechanisms of providing such synchronization in the one-sided model. In this paper, we study a set of different alternatives for designing the fence synchronization mechanisms. We analyze the various trade-offs of these designs on networks like Infini-Band that provide Remote Direct Memory Access (RDMA) capabilities. We propose a novel design for implementing fence synchronization that uses RDMA write with Immediate mechanism (Fence-Imm-RI) provided by InfiniBand networks. We then characterize the performance of different designs with various one-sided communication pattern microbenchmarks for both latency as well as overlap capability. The new Fence-Imm-RI scheme performs the best in scenarios that require low synchronization overhead as well as good overlap capability (close to 90 % overlap for large messages) as opposed to the other designs that can provide either low synchronization overhead or good overlap capability. 1
MPI-2 One-sided Usage and Implementation for Read Modify Write Operations: A Case Study with HPCC ⋆
"... Abstract. MPI-2’s One-sided communication interface is being explored in scientific applications. One of the important operations in a one sided model is read-modify-write. MPI-2 semantics provide MPI Put, MPI Get and MPI Accumulate operations which can be used to implement read-modify-write functio ..."
Abstract
- Add to MetaCart
Abstract. MPI-2’s One-sided communication interface is being explored in scientific applications. One of the important operations in a one sided model is read-modify-write. MPI-2 semantics provide MPI Put, MPI Get and MPI Accumulate operations which can be used to implement read-modify-write functionality. The different strategies yield varying performance benefits depending on the underlying one-sided implementation. We use HPCC Random Access benchmark which primarily uses read-modify-write operations as a case study for evaluating the different implementation strategies in this paper. Currently this benchmark is implemented based on MPI two-sided semantics. In this work we design and evaluate MPI-2 versions of the HPCC Random Access benchmark using onesided operations. To improve the performance, we explore two different optimizations: (i) software based aggregation and (ii) hardware-based atomic operations. We implement aggregation techniques using MPI Accumulate with datatypes to improve the performance of one sided implementation. In order to study the impact of hardware capabilities provided by modern interconnects, we implement a prototype of Accumulate for MPI Sum (Direct Accumulate) using InfiniBand’s atomic fetch and add operation. We evaluate our different approaches on an InfiniBand cluster. The software based aggregation outperforms the basic one sided scheme without aggregation by a factor of 4.38. The hardware based scheme shows an improvement by a factor of 2.62 as compared to the basic one sided scheme. Our study shows that the software based aggregation performs the best. We also demonstrate the potential and scalability of the hardware based approach. keywords: MPI-2, One-sided, HPCC, Accumulate, InfiniBand 1
Ohio State University,
"... As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel progr ..."
Abstract
- Add to MetaCart
As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) have provided many programming constructs that enable implicit communication using one-sided communication operations. While MPI is the most widely used communication model for scientific computing, the usage of one-sided communication is restricted; this is mainly owing to the inefficiencies in current MPI implementations that internally rely on synchronization between processes even during one-sided communication, thus losing the potential of such constructs. In our previous work, we had utilized native one-sided communication primitives offered by high-speed networks such as InfiniBand (IB) to allow for true one-sided communication in MPI. In this paper, we extend this work to natively take advantage of one-sided atomic operations on cache-coherent multi-core/multi-processor architectures while still utilizing the benefits of networks such as IB. Specifically, we present a sophisticated hybrid design that uses locks that migrate between IB hardware atomics and multi-core CPU atomics to take advantage of both. We demonstrate the capability of our proposed design with a wide range of experiments illustrating its benefits in performance as well as its potential to avoid explicit synchronization. 1
9th IEEE/ACM International Symposium on Cluster Computing and the Grid Natively Supporting True One-sided Communication in MPI on Multi-core Systems with InfiniBand ∗
"... As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel progr ..."
Abstract
- Add to MetaCart
As high-end computing systems continue to grow in scale, the performance that applications can achieve on such large scale systems depends heavily on their ability to avoid explicitly synchronized communication with other processes in the system. Accordingly, several modern and legacy parallel programming models (such as MPI, UPC, Global Arrays) have provided many programming constructs that enable implicit communication using one-sided communication operations. While MPI is the most widely used communication model for scientific computing, the usage of one-sided communication is restricted; this is mainly owing to the inefficiencies in current MPI implementations that internally rely on synchronization between processes even during one-sided communication, thus losing the potential of such constructs. In our previous work, we had utilized native one-sided communication primitives offered by high-speed networks such as InfiniBand (IB) to allow for true one-sided communication in MPI. In this paper, we extend this work to natively take advantage of one-sided atomic operations on cache-coherent multi-core/multi-processor architectures while still utilizing the benefits of networks such as IB. Specifically, we present a sophisticated hybrid design that uses locks that migrate between IB hardware atomics and multi-core CPU atomics to take advantage of both. We demonstrate the capability of our proposed design with a wide range of experiments illustrating its benefits in performance as well as its potential to avoid explicit synchronization. 1
Dissertation Committee: Approved by
"... High-end computing (HEC) systems are enabling scientists and engineers to tackle grand challenge problems in their respective domains and make significant contributions to their fields. Examples of such problems include astro-physics, earthquake analysis, weather prediction, nanoscience modeling, mu ..."
Abstract
- Add to MetaCart
High-end computing (HEC) systems are enabling scientists and engineers to tackle grand challenge problems in their respective domains and make significant contributions to their fields. Examples of such problems include astro-physics, earthquake analysis, weather prediction, nanoscience modeling, multiscale and multiphysics modeling, biological computations, computational fluid dynamics, etc. There has been great emphasis on designing, building and deploying ultra scale HEC systems to provide true petascale performance for these grand challenge problems. At the same time, Clusters built from commodity PCs are being predominantly used as main stream tools for high-end computing owing to their cost-effectiveness and easy availability. Communication subsystem plays a pivotal role in achieving scalable performance in clusters. Of late there has been a lot of interest in one-sided communication model and they are seen as a viable option for petascale applications. The one-sided communication provides good potential for computation communication overlap. In order to provide high performance and scalability, the one-sided communication subsystem

