Results 1 - 10
of
16
Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided
"... Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0standarddefinesaprogramminginterface for exploiting RDMA networks directly, however, it’s scalability andprac ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
(Show Context)
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0standarddefinesaprogramminginterface for exploiting RDMA networks directly, however, it’s scalability andpracticabilityhastobedemonstratedinpractice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. Toarmprogrammers, weprovideaspectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
An evaluation of implementation options for MPI one-sided communication
- IN RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 12TH EUROPEAN PVM/MPI USERS’ GROUP MEETING
, 2005
"... Abstract. MPI defines one-sided communication operations—put, get, and accumulate—together with three different synchronization mechanisms that define the semantics associated with the initiation and completion of these operations. In this paper, we analyze the requirements imposed by the MPI Standa ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract. MPI defines one-sided communication operations—put, get, and accumulate—together with three different synchronization mechanisms that define the semantics associated with the initiation and completion of these operations. In this paper, we analyze the requirements imposed by the MPI Standard on any implementation of one-sided communication. We discuss options for implementing the synchronization mechanisms and analyze the cost associated with each. An MPI implementer can use this information to select the implementation method that is best suited (has the lowest cost) for a particular machine environment. We also report on experiments we ran on a Linux cluster and a Sun SMP to determine the gap between the performance that could be achievable and what is actually achieved with MPI. 1
Rehm: An MPICH2 Channel Device Implementation over
- In: Proceedings of the 2004 Workshop on Communication Architecture for Clusters
, 2004
"... Abstract — MPICH2, the successor of one of the most popular open source message passing implementations, aims to fully support the MPI-2 standard. Due to a complete redesign, MPICH2 is also cleaner, more flexible, and faster. The InfiniBand network technology is an open industry standard and provide ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Abstract — MPICH2, the successor of one of the most popular open source message passing implementations, aims to fully support the MPI-2 standard. Due to a complete redesign, MPICH2 is also cleaner, more flexible, and faster. The InfiniBand network technology is an open industry standard and provides high bandwidth and low latency, as well as reliability, availability, serviceability (RAS) features. It is currently spreading its influence on the market of costeffective cluster computing. We expect for the near future that upcoming requirements in many cluster environments can only be satisfied by the functionality of MPICH2 and the performance of InfiniBand. Hence, there is the need for an effective support of the InfiniBand interconnect technology by MPICH2. In this paper we present our experience that has been gained during the implementation of our MPICH2 Device for InfiniBand. Further, a performance overview is given, as well as ideas for future developments. The device is implemented in terms of the Channel Interface (CH3) and uses both the channel semantics (Send/Receive) and memory semantics (RDMA) provided by Mellanox ’ Verbs implementation VAPI. With this combined approach a significant performance gain can be achieved. The design decisions discussed may also be of interest beyond the scope of this paper.
Parallel programming models applicable to cluster computing and beyond
- In Are Magnus Bruaset and Aslak Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers
, 2005
"... Summary. This chapter centers mainly on successful programming models that map algorithms and simulations to computational resources used in high-performance computing. These resources range from group-based or departmental clusters to high-end resources available at the handful of supercomputer cen ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Summary. This chapter centers mainly on successful programming models that map algorithms and simulations to computational resources used in high-performance computing. These resources range from group-based or departmental clusters to high-end resources available at the handful of supercomputer centers around the world. Also covered are newer programming models that may change the way we program high-performance parallel computers. 1
Design and Implementation of Open MPI over Quadrics/Elan4
"... Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable, and productionquality implementation of MPI-2 [20] interface based on experiences gained from FT-MPI [8], LA-MPI [10], LAM/MPI [28], and MVAPICH [23] projects. Its initial communication architecture is layer ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable, and productionquality implementation of MPI-2 [20] interface based on experiences gained from FT-MPI [8], LA-MPI [10], LAM/MPI [28], and MVAPICH [23] projects. Its initial communication architecture is layered on top of TCP/IP. In this paper, we have designed and implemented Open MPI point-to-point layer on top of a highend interconnect, Quadrics/Elan4[26]. Design challenges related to dynamic process/connection management, utilizing Quadrics RDMA capabilities and supporting asynchronous communication progression are overcome with salient strategies to utilize Quadrics Queued-based Direct Memory Access (QDMA) and Remote Direct Memory Access (RDMA) operations, along with the chained event mechanism. Experimental results indicate that the resulting point-to-point transport layer implementation is able to achieve comparable performance to Quadrics native QDMA operations, from which it is derived. While not taking advantages of Quadrics/Elan4 [26, 2] NIC-based tag matching due to its design requirements, this point-to-point transport layer provides a high performance implementation of MPI-2 [20] compliant message passing over Quadrics/Elan4.
Transparent Checkpoint-Restart over InfiniBand
"... Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpointrestart services that “tear down ” the network, checkpoint each ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpointrestart services that “tear down ” the network, checkpoint each node in isolation, and then re-connect the network again. This work presents the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach simplifies current practicebyavoidingtheneedforaprivilegedkernelmodule. The generality of this approach is demonstrated by applying it bothtoMPIandtoBerkeleyUPC(UnifiedParallelC),inits native mode (without MPI). Scalability is shown by checkpointing2,048MPIprocessesacross128nodes(with16cores per node). The run-time overhead varies between 0.8 % and 1.7%. While checkpoint times dominate, the network-only portion of the implementation is shown to require less than 100 milliseconds (not including the time to locally write application memory to stable storage).
Co-Designing MPI Library and Applications for InfiniBand Clusters
"... “Co-designing applications and communication libraries to leverage features of underlying communication network is imperative for achieving optimal performance on modern computing clusters.” ..."
Abstract
- Add to MetaCart
(Show Context)
“Co-designing applications and communication libraries to leverage features of underlying communication network is imperative for achieving optimal performance on modern computing clusters.”
Communication Characteristics of Message-Passing Applications, and Impact of RDMA on their Performance
, 2005
"... With the availability of Symmetric Multiprocessors (SMP) and high-speed interconnects, clusters of SMPs (CLUMPs) have become the ideal platform for performance computing. The performance of applications running on clusters mainly depends on the choice of parallel programming paradigm, workload chara ..."
Abstract
- Add to MetaCart
(Show Context)
With the availability of Symmetric Multiprocessors (SMP) and high-speed interconnects, clusters of SMPs (CLUMPs) have become the ideal platform for performance computing. The performance of applications running on clusters mainly depends on the choice of parallel programming paradigm, workload characteristics of the applications, and the performance of communication subsystem. This thesis addresses these issues in details. It is still open to debate whether pure message-passing or mixed MPI-OpenMP is the programming of choice for higher performance on SMP clusters. In this thesis we investigate the performance of the recently released NAS Multi-Zone (NPB-MZ) benchmarks consisting of BT-MZ, SP-MZ, and LU-MZ, and SMG2000 of the ASCI Purple benchmark. Our studies show that the applications studied have a better MPI performance on clusters of small SMPs interconnected by the Myrinet network. In this thesis, we examine the MPI characteristics of the three applications in the NPB-MZ suite as well as two applications (SPECseis and SPECenv) in the SPEChpc2002