Results 1 - 10
of
14
Efficient Structured Data Access in Parallel File Systems
- In Proceedings of the IEEE International Conference on Cluster Computing
, 2003
"... Parallel scientific applications store and retrieve very large, structured datasets. Directly supporting these structured accesses is an important step in providing high-performance I/O solutions for these applications. High-level interfaces such as HDF5 and Parallel netCDF provide convenient APIs f ..."
Abstract
-
Cited by 15 (6 self)
- Add to MetaCart
Parallel scientific applications store and retrieve very large, structured datasets. Directly supporting these structured accesses is an important step in providing high-performance I/O solutions for these applications. High-level interfaces such as HDF5 and Parallel netCDF provide convenient APIs for accessing structured datasets, and the MPI-IO interface also supports efficient access to structured data. However, parallel file systems do not traditionally support such access. In this work we present an implementation...
Supporting Efficient Noncontiguous Access in PVFS over InfiniBand
- In Proceedings of Cluster Computing ’03, Hong Kong
, 2003
"... Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for hig ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for high performance. In this paper, we address two main issues on supporting efficient noncontiguous I/O access in cluster file systems over a high performance network. One is noncontiguous data transmission between the client and the I/O server. The second is noncontiguous disk access on the I/O server itself.
Nonuniformly communicating noncontiguous data: A case study with PETSc and MPI
- In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007
, 2007
"... Due to the complexity associated with developing parallel applications, scientists and engineers rely on highlevel software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist application developers by providing abstractions for mathematical operations, data represe ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Due to the complexity associated with developing parallel applications, scientists and engineers rely on highlevel software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist application developers by providing abstractions for mathematical operations, data representation and management of parallel layouts of the data, while internally using communication libraries such as MPI and PVM. With high-level libraries managing data layout and communication internally, it can be expected that they organize application data suitably for performing the library operations optimally. However, this places additional overhead on the underlying communication library by making the data layout noncontiguous in memory and communication volumes (data transferred to each process) nonuniform. In this paper, we analyze the overheads associated with these two aspects (noncontiguous data layouts and nonuniform communication volumes) in the context of the PETSc software toolkit over the MPI communication library. We describe the issues with the current approaches used by MPICH2 (an implementation of MPI), propose different approaches to handle these issues and evaluate these approaches with microbenchmarks as well as an application over the PETSc software library. Our experimental results demonstrate close to an order of magnitude improvement in the per-∗ This work was supported by the Mathematical, Information,
Designing a common communication subsystem
- In Proceedings of the 12th European Parallel Virtual Machine and Message Passing Interface Conference (Euro PVM MPI
, 2005
"... Abstract. Communication subsystems are used in high-performance parallel computing systems to abstract the lower network layer. By using a communication subsystem, an upper middleware library or runtime system can be more easily ported to different interconnects. However by abstracting the network l ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. Communication subsystems are used in high-performance parallel computing systems to abstract the lower network layer. By using a communication subsystem, an upper middleware library or runtime system can be more easily ported to different interconnects. However by abstracting the network layer, the designer will typically make the communication subsystem more specialized for that particular middleware library, and less general, making it ineffective for supporting middleware for other programming models. In previous work we analyzed the requirements of various programming model middleware and the communication subsystems that support them. We found that although the are no mutually exclusive requirements, none of the existing communication subsystems could efficiently support the programming model middleware we considered. In this paper, we describe our design of a common communication subsystem, called CCS, that can efficiently support various programming model middleware. 1
Applying MPI Derived Datatypes to the NAS Benchmarks: A Case Study
, 2004
"... MPI derived datatypes are a powerful method to define arbitrary collections of non-contiguous data in memory and to enable non-contiguous data communication in a single MPI function call. In this paper, we employ MPI datatypes in four NAS benchmarks (MG, LU, BT, and SP) to transfer non-contiguous da ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
MPI derived datatypes are a powerful method to define arbitrary collections of non-contiguous data in memory and to enable non-contiguous data communication in a single MPI function call. In this paper, we employ MPI datatypes in four NAS benchmarks (MG, LU, BT, and SP) to transfer non-contiguous data. Comprehensive performance evaluation was carried out on two clusters: an Itanium-2 Myrinet cluster and a Xeon InfiniBand cluster. Performance results show that using datatypes can achieve performance comparable to manual packing/unpacking in the original benchmarks, though the MPI implementations that were studied also perform internal packing and unpacking on noncontiguous datatype communication. In some cases, better performance can be achieved because of the reduced costs to transfer non-contiguous data. This is because optimizations in the MPI packing/unpacking implementations can be easily overlooked in manual packing and unpacking by users. Our case study demonstrates that MPI datatypes simplify the implementation of non-contiguous communication and lead to application code with portable performance. We expect that with further improvement of datatype processing and datatype communication such as [10, 24], datatypes can outperform the conventional methods of noncontiguous data communication. Our modified NAS benchmarks can be used to evaluate datatype processing and datatype communication in MPI implementations.
High Performance Implementation of MPI Derived Datatype Communication over InfiniBand
, 2004
"... In this paper, a systematic study of two main types of approach for MPI datatype communication (Pack/Unpack- based approaches and Copy-Reduced approaches)iscar- ried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the Pack/Unpack-based approac ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, a systematic study of two main types of approach for MPI datatype communication (Pack/Unpack- based approaches and Copy-Reduced approaches)iscar- ried out on the InfiniBand network. We focus on overlapping packing, network communication, and unpacking in the Pack/Unpack-based approaches. We use RDMA operations to avoid packing and/or unpacking in the CopyReduced approaches. Four schemes (Buffer-Centric Segment Pack/Unpack, RDMA Write Gather With Unpack, Pack with RDMA Read Scatter, and Multiple RDMA Writes have been proposed. Three of them have been implemented and evaluated based on one MPI implementation over InfiniBand. Performance results of a vector microbenchmark demonstrate that latency is improved by a factor of up to 3.4 and bandwidth by a factor of up to 3.6 compared to the current datatype communication implementation. Collective operations like MPI Alltoall are demonstrated to benefit. A factor of up to 2.0 improvement has been seen in our measurements of those collective operations on an 8-node system.
Can MPI Be Used for Persistent Parallel Services?
"... Abstract. MPI is routinely used for writing parallel applications, but it is not commonly used for writing long-running parallel services, such as parallel file systems or job schedulers. Nonetheless, MPI does have many features that are potentially useful for writing such software. Using the PVFS2 ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. MPI is routinely used for writing parallel applications, but it is not commonly used for writing long-running parallel services, such as parallel file systems or job schedulers. Nonetheless, MPI does have many features that are potentially useful for writing such software. Using the PVFS2 parallel file system as a motivating example, we studied the needs of software that provide persistent parallel services and evaluated whether MPI is a good match for those needs. We also ran experiments to determine the gaps between what the MPI Standard enables and what MPI implementations currently support. The results of our study indicate that MPI can enable persistent parallel systems to be developed with less effort and can provide high performance, but MPI implementations will need to provide better support for certain features. We also describe an area where additions to the MPI Standard would be useful. 1
Automatic Memory Optimizations for Improving MPI Derived Datatype Performance,” selected for publication
- in Proceedings of the 13th European PVM/MPI Users' Group Meeting, 2006 (Euro PVM/MPI ’06
, 2006
"... Abstract. MPI derived datatypes allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This powerful feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, many implementations ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. MPI derived datatypes allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This powerful feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, many implementations of MPI derived datatypes perform poorly, which makes application developers avoid using this feature. In this paper, we present a technique to automatically select templates that are optimized for memory performance based on the access pattern of derived datatypes. We implement this mechanism in the MPICH2 source code. The performance of our implementation is compared to well-written manual packing/unpacking routines and original MPICH2 implementation. We show that performance for various derived datatypes is significantly improved and comparable to that of optimized manual routines.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P
"... high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local preand post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of serialization wi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local preand post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of serialization within the MPI stack that were acceptable on small/medium systems can be brutal on massively parallel systems. In this paper, we study different non-data-communication overheads within the MPI implementation on the IBM Blue Gene/P system. 1
Implementation and Evaluation of Shared-Memory Communication and Synchronization Operations in MPICH2 using the Nemesis Communication Subsystem ⋆
"... b LaBRI, Université Bordeaux I – INRIA Futurs This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performanc ..."
Abstract
- Add to MetaCart
b LaBRI, Université Bordeaux I – INRIA Futurs This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.

