Results 1 - 10
of
26
Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem
- Proceedings of the Euro PVM/MPI Conference
, 2006
"... Abstract. This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its sharedmemory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Abstract. This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its sharedmemory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks as well as application benchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications. 1
Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System
- Seventh IEEE International Symposium on Cluster Computing and the Grid - Table of Contents
, 2007
"... Multi-core processor is a growing industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. In the new Top500 supercomputer list, more than 20 % processors belong to multi-core processor family. However, without an in-depth study on application beha ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Multi-core processor is a growing industry trend as single core processors rapidly reach the physical limits of possible complexity and speed. In the new Top500 supercomputer list, more than 20 % processors belong to multi-core processor family. However, without an in-depth study on application behaviors and trends on multi-core cluster, we might not be able to understand the characteristics of multicore cluster in a comprehensive manner and hence not be able to get optimal performance. In this paper, we take on the challenges and design a set of experiments to study the impact of multi-core architecture on cluster computing. We choose to use one of the most advanced multi-core servers, Intel Bensley system with Woodcrest processors, as our evaluation platform, and use popular benchmarks including HPL, NAMD, and NAS as the applications to study. From our message distribution experiments, we find that on an average about 50 % messages are transferred through intra-node communication, which is much higher than intuition. This trend indicates that optimizing intra-node communication is as important as optimizing inter-node communication in a multi-core cluster. We also observe that cache and memory contention may be a potential bottleneck in multi-core cluster, and communication middleware and applications should be multi-core aware to alleviate this problem. We demonstrate that multi-core aware algorithm, e.g. data tiling, improves benchmark execution time by up to 70%. We also compare the scalability of multicore cluster with that of single-core cluster and find that the scalability of multi-core cluster is promising.
Data transfers between processes in an SMP system: performance study and application to MPI
- in: Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), IEEE Computer Society
, 2006
"... Abstract — This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: sharedm ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Abstract — This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: sharedmemory buffers, message queues, the Ptrace system call, kernel module-based copy, and a high-speed network. We evaluate each mechanism based on latency, bandwidth, its impact on application cache usage, and its suitability to support MPI twosided and one-sided messages. I. MOTIVATION AND SCOPE Designing a communication system tailored for a particular architecture requires understanding the achievable performance levels of the underlying hardware and software. Such understanding is key to a more efficient design and
Virtual Machine Aware Communication Libraries for High Performance Computing
"... As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity computing. Meanwhile, virtual machine (VM) t ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity computing. Meanwhile, virtual machine (VM) technologies have become popular in both industry and academia due to various features designed to ease system management and administration. While a VM-based environment can greatly help manageability on large-scale computing systems, concerns over performance have largely blocked the HPC community from embracing VM technologies. In this paper, we follow three steps to demonstrate the ability to achieve near-native performance in a VM-based environment for HPC. First, we propose Inter-VM Communication
NewMadeleine: a Fast Communication Scheduling Engine for High Performance Networks
- in "CAC 2007: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2007
, 2007
"... Communication libraries have dramatically made progress over the fifteen years, pushed by the success of cluster architectures as the preferred platform for high performance distributed computing. However, many potential optimizations are left unexplored in the process of mapping application communi ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Communication libraries have dramatically made progress over the fifteen years, pushed by the success of cluster architectures as the preferred platform for high performance distributed computing. However, many potential optimizations are left unexplored in the process of mapping application communication requests onto low level network commands. The fundamental cause of this situation is that the design of communication subsystems is mostly focused on reducing the latency by shortening the critical path. In this paper, we present a new communication scheduling engine which dynamically optimizes application requests in accordance with the NICs capabilities and activity. The optimizing code is generic and portable. The database of optimizing strategies may be dynamically extended. ∗ This work was funded through the ANR-05-CIGC-11 grant. 1 Contents
Thread safety in an MPI implementation: Requirements and analysis
- Parallel Computing
, 2007
"... The MPI-2 Standard has carefully specified the interaction between MPI and usercreated threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementations to deliver high performance. However, a simple reading of the thread-safety spe ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
The MPI-2 Standard has carefully specified the interaction between MPI and usercreated threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementations to deliver high performance. However, a simple reading of the thread-safety specification does not reveal what its implications are for an implementation and what implementers must be aware (and careful) of. In this paper, we describe and analyze what the MPI Standard says about thread safety and what it implies for an implementation. We classify the MPI functions based on their thread-safety requirements and discuss several issues to consider when implementing thread safety in MPI. We use the example of generating new context ids (required for creating new communicators) to demonstrate how a simple solution for the single-threaded case does not naturally extend to the multithreaded case and how a naïve thread-safe algorithm can be expensive. We then present an algorithm for generating context ids that works efficiently in both single-threaded and multithreaded cases. Key words: Message Passing Interface (MPI), thread safety, MPI implementation, multithreaded programming 1
A Software Based Approach for Providing Network Fault Tolerance
- MPI Level Design and Performance Evaluation. In Proceedings of SuperComputing (SC’06
, 2006
"... In the arena of cluster computing, MPI has emerged as the de facto standard for writing parallel applications. At the same time, introduction of high speed RDMA-enabled interconnects like InfiniBand, Myrinet, Quadrics, RDMAenabled Ethernet has escalated the trends in cluster computing. Network APIs ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In the arena of cluster computing, MPI has emerged as the de facto standard for writing parallel applications. At the same time, introduction of high speed RDMA-enabled interconnects like InfiniBand, Myrinet, Quadrics, RDMAenabled Ethernet has escalated the trends in cluster computing. Network APIs like uDAPL (user Direct Access Provider Library) are being proposed to provide a networkindependent interface to different RDMA-enabled interconnects. Clusters with combination(s) of these interconnects are being deployed to leverage their unique features, and network failover in wake of transmission errors. In this paper, we design a network fault tolerant MPI using uDAPL interface, making this design portable for existing and upcoming interconnects. Our design provides failover
Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems ∗
"... The emergence of multi-core processors has made MPI intra-node communication a critical component in high performance computing. In this paper, we use a three-step methodology to design an efficient MPI intra-node communication scheme from two popular approaches: shared memory and OS kernel-assisted ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The emergence of multi-core processors has made MPI intra-node communication a critical component in high performance computing. In this paper, we use a three-step methodology to design an efficient MPI intra-node communication scheme from two popular approaches: shared memory and OS kernel-assisted direct copy. We use an Intel quad-core cluster for our study. We first run microbenchmarks to analyze the advantages and limitations of these two approaches, including the impacts of processor topology, communication buffer reuse, process skew effects, and L2 cache utilization. Based on the results and the analysis, we propose topology-aware and skew-aware thresholds to build an optimized hybrid approach. Finally, we evaluate the impact of the hybrid approach on MPI collective operations and applications using IMB, NAS, PSTSWM, and HPL benchmarks. We observe that the optimized hybrid approach can improve the performance of MPI collective operations by up to 60%, and applications by up to 17%. 1
Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OAT
"... Bulk memory copies incur large overheads such as CPU stalling (i.e., no overlap of computation with memory copy operation), small register-size data movement, cache pollution, etc. Asynchronous copy engines introduced by Intel’s I/O Acceleration Technology help in alleviating these overheads by offl ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Bulk memory copies incur large overheads such as CPU stalling (i.e., no overlap of computation with memory copy operation), small register-size data movement, cache pollution, etc. Asynchronous copy engines introduced by Intel’s I/O Acceleration Technology help in alleviating these overheads by offloading the memory copy operations using several DMA channels. However, the startup overheads associated with these copy engines such as pinning the application buffers, posting the descriptors and checking for completion notifications, limit their overlap capability. In this paper, we propose two schemes to provide complete overlap of memory copy operation with computation by dedicating the critical tasks to a single core in a multi-core system. In the first scheme, MCI (Multi-Core with I/OAT), we offload the memory copy operation to the copy engine and onload the startup overheads to the dedicated core. For systems without any hardware copy engine support, we propose a second scheme, MCNI (Multi-Core with No I/OAT) that onloads the memory copy operation to the dedicated core. We further propose a mechanism for an applicationtransparent asynchronous memory copy operation using memory protection. We analyze our schemes based on overlap efficiency, performance and associated overheads using several micro-benchmarks and applications. Our microbenchmark results show that memory copy operations can be significantly overlapped (up to 100%) with computation using the MCI and MCNI schemes. Evaluation with MPI-based applications such as IS-B and PSTSWM-small using the MCNI scheme show up to 4 % and 5 % improvement, respectively, as compared to traditional implementations. Evaluations with data-centers using the MCI scheme show up to 37 % improvement compared to the traditional implementation. Our evaluations with gzip SPEC benchmark using application-transparent asynchronous memory copy show a lot of potential to use such mechanisms in several application domains. I.
A Synchronous Mode MPI Implementation on the Cell BE ™ Architecture
"... Abstract. The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multi-core processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The Cell Broadband Engine shows much promise in high performance computing applications. The Cell is a heterogeneous multi-core processor, with the bulk of the computational work load meant to be borne by eight co-processors called SPEs. Each SPE operates on a distinct 256 KB local store, and all the SPEs also have access to a shared 512 MB to 2 GB main memory through DMA. The unconventional architecture of the SPEs, and in particular their small local store, creates some programming challenges. We have provided an implementation of core features of MPI for the Cell to help deal with this. This implementation views each SPE as a node for an MPI process, with the local store used as if it were a cache. In this paper, we describe synchronous mode communication in our implementation, using the rendezvous protocol, which makes MPI communication for long messages efficient. We further present experimental results on the Cell hardware, where it demonstrates good performance, such as throughput up to 6.01 GB/s and latency as low as 0.65 μs on the pingpong test. This demonstrates that it is possible to efficiently implement MPI calls even on the simple SPE cores. 1.

