Results 1 -
4 of
4
SMARTMAP: Operating System Support for Efficient Data Sharing Among Processes on a Multi-Core Processor
"... Abstract—This paper describes SMARTMAP, an operating system technique that implements fixed offset virtual memory addressing. SMARTMAP allows the application processes on a multi-core processor to directly access each other’s memory without the overhead of kernel involvement. When used to implement ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—This paper describes SMARTMAP, an operating system technique that implements fixed offset virtual memory addressing. SMARTMAP allows the application processes on a multi-core processor to directly access each other’s memory without the overhead of kernel involvement. When used to implement MPI, SMARTMAP eliminates all extraneous memory-to-memory copies imposed by UNIX-based shared memory strategies. In addition, SMARTMAP can easily support operations that UNIXbased shared memory cannot, such as direct, in-place MPI reduction operations and one-sided get/put operations. We have implemented SMARTMAP in the Catamount lightweight kernel for the Cray XT and modified MPI and Cray SHMEM libraries to use it. Micro-benchmark performance results show that SMARTMAP allows for significant improvements in latency, bandwidth, and small message rate on a quad-core processor. I.
Processor Affinity and MPI Performance on SMP-CMP Clusters
"... with multi-core Chip-Multiprocessors (CMP), also known as SMP-CMP clusters, are becoming ubiquitous today. For Message Passing interface (MPI) programs, such clusters have a multilayer hierarchical communication structure: the performance of intra-node communication is usually higher than that of in ..."
Abstract
- Add to MetaCart
with multi-core Chip-Multiprocessors (CMP), also known as SMP-CMP clusters, are becoming ubiquitous today. For Message Passing interface (MPI) programs, such clusters have a multilayer hierarchical communication structure: the performance of intra-node communication is usually higher than that of internode communication; and the performance of intra-node communication is not uniform with communications between cores within a chip offering higher performance than communications between cores in different chips. As a result, the mapping from Message Passing Interface (MPI) processes to cores within each compute node, that is, processor affinity, may significantly affect the performance of intra-node communication, which in turn may impact the overall performance of MPI applications. In this work, we study the impacts of processor affinity on MPI performance in SMP-CMP clusters through extensive benchmarking and identify the conditions when processor affinity is (or is not) a major factor that affects performance. Keywords-Processor affinity; MPI; SMP-CMP clusters I.
Optimizing a Multi-Core Processor for Message-Passing Workloads ∗
"... Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the messagepassing interface (MPI), yet multi-cores continue to be optimized for shared-memory workloads. In this position paper, we put forth t ..."
Abstract
- Add to MetaCart
Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the messagepassing interface (MPI), yet multi-cores continue to be optimized for shared-memory workloads. In this position paper, we put forth the design of a unique chip that is optimized for MPI workloads. It introduces specialized hardware to optimize the transfer of messages between cores. It eliminates most aspects of on-chip cache coherence to not only reduce complexity and power, but also improve sharedmemory producer-consumer behavior and the efficiency of buffer copies used during message transfers. We also consider two optimizations (caching of read-only and private blocks) that alleviate the negative performance effects of a coherence-free system.
Author manuscript, published in "38th International Conference on Parallel Processing (ICPP-2009) (2009)" DOI: 10.1109/ICPP.2009.22 Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis
, 2009
"... The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-bu ..."
Abstract
- Add to MetaCart
The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25 % speedup and better cache efficiency. 1

