Results 1 -
2 of
2
Efficient Asynchronous Message Passing via SCI with Zero-Copying
"... Abstract. Passing messages between processes, as it is done when creating parallel applications based on MPI, does always involve copying data from the address space of the sending process to the address space in the receiving process. The fastest, and commonly most efficient way for this is a direc ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Passing messages between processes, as it is done when creating parallel applications based on MPI, does always involve copying data from the address space of the sending process to the address space in the receiving process. The fastest, and commonly most efficient way for this is a direct copy operation between these two locations without any intermediate copies. In this paper, we present different techniques, implemented in SCI-MPICH, to achieve such a behavior for MPI on SCI-connected clusters. These techniques use both, CPU and DMA driven data transfers, and require support through MPI API functions or are transparent for the MPI application by making use of new advanced techniques in the SCI driver software. We describe the implementation details in SCI-MPICH and the underlying SMI (Shared Memory Interface) library and evaluate the performance achieved in different communication setups.
Impact of Latency on Applications' Performance
, 2000
"... This paper investigates the impact of point-topoint latency on applications' performance on clusters of workstations interconnected with high-speed networks. At present, clusters are often evaluated through comparison of point-to-point latency and bandwidth obtained by ping-pong tests. This pap ..."
Abstract
- Add to MetaCart
This paper investigates the impact of point-topoint latency on applications' performance on clusters of workstations interconnected with high-speed networks. At present, clusters are often evaluated through comparison of point-to-point latency and bandwidth obtained by ping-pong tests. This paper shows that this approach to performance evaluation of clusters has limited validity and that latency has minimal impact on a large group of applications that use medium- to coarse-grain data-parallel algorithms. Message-passing systems with low latency often use polling for message completion, which leads to tight synchronization between the communicating processes and high CPU overhead. Systems with asynchronous message completion have higher point-to-point latency for short messages but offer a number of highperformance mechanisms such as overlapping of computation and communication, independent message progress, efficient collective algorithms, asynchronous processing of communicating nodes...