Results 1 - 10
of
16
Parallel Programmability and the Chapel Language
- Int. J. High Perform. Comput. Appl
"... It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effecti ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
It is an increasingly common belief that the programmability of parallel machines is lacking, and that the high-end computing (HEC) community is suffering as a result of it. The population of users who can effectively program parallel machines comprises only a small fraction of those who can effectively program traditional sequential computers, and this gap seems only to be widening as time passes. The parallel computing community’s inability to tap the skills
Decoupling Synchronization and Data Transfer in Message Passing Systems of Parallel Computers
- In Proc. Intl. Conf. on Supercomputing
, 1995
"... Synchronization is an important issuefor the design of a scalable parallel computer, and some systems include special hardware support for control messages or barriers. The cost of synchronization has a high impact on the design of the message passing (communication) services. In this paper, we inve ..."
Abstract
-
Cited by 30 (9 self)
- Add to MetaCart
Synchronization is an important issuefor the design of a scalable parallel computer, and some systems include special hardware support for control messages or barriers. The cost of synchronization has a high impact on the design of the message passing (communication) services. In this paper, we investigate three different communication libraries that are tailored toward the synchronization services available: (1) a version of generic send-receive message passing (PVM), which relies on traditional flow control and buffering to synchronize the data transfers; (2) message passing with pulling, i.e. a message is transferred only when the recipient is ready and requests it (as, e.g., used in NX for large messages); and (3) the decoupled direct deposit message passing that uses separate, global synchronization to ensure that nodes send messages only when the message data can be deposited directly into the final destination in the memory of the remote recipient. Measurements of these three st...
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
- INTERN. J. HIGH PERF. COMP. APPLICATIONS
, 2005
"... This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to th ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters
- In: Proc. Workshop Communication Architecture for Clusters (CAC02) of IPDPS’02, Ft
, 2002
"... this paper, we describe software architecture for supporting remote memory operations on clusters with networks such as Myrinet or cLAN. When combined with protocols and strategies for efficient management of network and host resources, this architecture can both deliver high performance and match n ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
this paper, we describe software architecture for supporting remote memory operations on clusters with networks such as Myrinet or cLAN. When combined with protocols and strategies for efficient management of network and host resources, this architecture can both deliver high performance and match network protocols with requirements of remote memory operations. The protocols and strategies address issues such as buffer memory consumption, management of GM tokens, dynamic memory registration, zero-copy data transfers and adaptive data streaming. For example, the adaptive data streaming technique bridges the performance gap between remote memory operations that target registered and those that use regular memory. Our approach relies on the standard unmodified system software and drivers for Myrinet and cLAN rather than on custom/alternative drivers and interfaces (e.g., AM [1], PM [2], BIP [3], and FM [4]) interfaces that replace the standard Myrinet Control Program (MCP) on the network interface card
The high-level parallel language ZPL improves productivity and performance
- In Proceedings of the IEEE International Workshop on Productivity and Performance in High-End Computing
, 2004
"... In this paper, we qualitatively address how high-level parallel languages improve productivity and performance. Using ZPL as a case study, we discuss advantages that stem from a language having both a global (rather than a perprocessor) view of the computation and an underlying performance model tha ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
In this paper, we qualitatively address how high-level parallel languages improve productivity and performance. Using ZPL as a case study, we discuss advantages that stem from a language having both a global (rather than a perprocessor) view of the computation and an underlying performance model that statically identifies communication in code. We also candidly discuss several disadvantages to ZPL. 1.
Sparse Matrix Block-Cyclic Redistribution
- Proceeding of IEEE Int'l. Parallel Processing Symposium (IPPS'99
, 1999
"... Run-time support for the CYCLIC(k) redistribution on the SPMD computation model is presently very relevant for the scientific community. This work is focused to the characterization of the sparse matrix redistribution and its associate problematic due to the use of compressed representations. Two ma ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Run-time support for the CYCLIC(k) redistribution on the SPMD computation model is presently very relevant for the scientific community. This work is focused to the characterization of the sparse matrix redistribution and its associate problematic due to the use of compressed representations. Two main improvements about the buffering and the coordinates calculation modify the original algorithm. Our solutions contain a Collecting, a Communication and Mixing stage with different influence in the execution time depending on the sparsity of the matrix and the number of processors. Experimental results have been carried out on a Cray T3E for real matrices and different redistribution parameters. 1. Introduction Data distribution plays a very important role in the performance of parallel applications on a distributed-memory machine. In this way, the redistribution is a very important problem for the scientific community. Related work about dynamic data-distribution modification is only foc...
Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand
"... This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computa ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper describes how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for non contiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance
A Compiler Abstraction for Machine Independent Parallel Communication Generation
- In Tenth International Workshop on Languages and Compilers for Parallel Computing
, 1998
"... . In this paper, we consider the problem of generating efficient, portable communication in compilers for parallel languages. We introduce the Ironman abstraction, which separates data transfer from its implementing communication paradigm. This is done by annotating the compiler-generated code w ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
. In this paper, we consider the problem of generating efficient, portable communication in compilers for parallel languages. We introduce the Ironman abstraction, which separates data transfer from its implementing communication paradigm. This is done by annotating the compiler-generated code with legal ranges for data transfer in the form of calls to the Ironman library. On each target platform, these library calls are instantiated to perform the transfer using the machine's optimal communication paradigm. We confirm arguments against generating message passing calls in the compiler based on our experiences using PVM and MPI --- specifically, the observation that these interfaces do not perform well on machines that are not built with a message passing communication paradigm. The overhead for using Ironman, as opposed to a machine-specific back end, is demonstrated to be negligible. We give performance results for a number of benchmarks running with PVM, MPI, and machin...
Index Translation Schemes for Adaptive Computations on Distributed Memory Multicomputers
- Proceedings of the Ninth International Parallel Processing Symposium, IEEE Computer
, 1995
"... Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with an adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langley on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance. 1 Introduction Distr...

