Results 1 - 10
of
29
Implementation and performance analysis of non-blocking collective operations for MPI
- SC07
, 2007
"... Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance lib ..."
Abstract
-
Cited by 21 (12 self)
- Add to MetaCart
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other highperformance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.
Productivity and performance using partitioned global address space languages
- PASCO'07
, 2007
"... Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.
Parallel languages and compilers: Perspective from the Titanium experience
- THE INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2007
"... We describe the rationale behind the design of key features of Titanium—an explicitly parallel dialect of Java for high-performance scientific programming—and our experiences in building applications with the language. Specifically, we address Titanium’s partitioned global address space model, singl ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We describe the rationale behind the design of key features of Titanium—an explicitly parallel dialect of Java for high-performance scientific programming—and our experiences in building applications with the language. Specifically, we address Titanium’s partitioned global address space model, single program multiple data parallelism support, multi-dimensional arrays and array-index calculus, memory management, immutable classes (class-like types that are value types rather than reference types), operator overloading, and generic programming. We provide an overview of the Titanium compiler implementation, covering various parallel analyses and optimizations, Titanium runtime technology and the GASNet network communication layer. We summarize results and lessons learned from
Performance without pain = productivity: Data layout and collective communication in upc
- In Principles and Practices of Parallel Programming (PPoPP
, 2008
"... The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use o ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI’s communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.
Automatic nonblocking communication for partitioned global address space programs
- In Proceedings of the International Conference on Supercomputing (ICS
, 2007
"... Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overl ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16 % for some of the NAS Parallel Benchmarks, 48 % for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers ’ productivity by freeing them from the details of communication management.
Optimizing Collective Communication on Multicores
"... As the gap in performance between the processors and the memory systems continue to grow, the communication component of an application will dictate the overall application performance and scalability. Therefore it is useful to abstract common communication operations across cores as collective comm ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
As the gap in performance between the processors and the memory systems continue to grow, the communication component of an application will dictate the overall application performance and scalability. Therefore it is useful to abstract common communication operations across cores as collective communication operations and tune them through a runtime library that can employ sophisticated automatic tuning techniques. Our focus of this paper is on collective communication in Partitioned Global Address Space languages which are a natural extension of the shared memory hardware of modern multicore systems. In particular we highlight how automatic tuning can lead to significant performance improvements and show how loosening the synchronization semantics of a collective can lead to a more efficient use of the memory system. We demonstrate that loosely synchronized collectives can realize consistent 3 × speedups over their strictly synchronized counterparts on the highly threaded Sun Niagara2 for message sizes ranging from 8 bytes to 64kB. We thus argue that the synchronization requirements for a collective must be exposed in the interface so the collective and the synchronization can be optimized together. 1
Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap
"... In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM Blue-Gene/P, an architecture that ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM Blue-Gene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66 % for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages. 1.
Optimizing Communication Overlap for High-Speed Networks Abstract
"... Modern networking hardware supports true non-blocking communication and effective exploitation of this feature can lead to significant application performance improvements. We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap wil ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Modern networking hardware supports true non-blocking communication and effective exploitation of this feature can lead to significant application performance improvements. We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap will facilitate obtaining good parallel efficiency and performance on the highly concurrent contemporary systems. Finding an optimal, performance portable implementation when using non-blocking communication primitives is non-trivial and intimidating to many application developers. In this paper we present a methodology for discovering optimal message sizes and schedules for a variety of application scenarios. This is achieved by combining an analytic model that takes into account the variability of performance parameters with system scale and load with heuristics designed to avoid network congestion. We perform experiments to understand network behavior in the presence of overlap and purge the optimization space for any system based on either resource or implementation constraints. Our approach is able to choose optimal or nearly optimal implementation parameters for a variety of highly non-trivial scenarios and networks with different performance characteristics. Implementations based on parameters chosen by the models are able to hide over 90 % of communication overhead in all cases. 1.
Portable High Performance and Scalability for Partitioned Global Address Space Languages
, 2007
"... Large scale parallel simulations are fundamental tools for engineers and scientists. Con-sequently, it is critical to develop both programming models and tools that enhance devel-opment time productivity, enable harnessing of massively-parallel systems, and to guide the diagnosis of poorly scaling p ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Large scale parallel simulations are fundamental tools for engineers and scientists. Con-sequently, it is critical to develop both programming models and tools that enhance devel-opment time productivity, enable harnessing of massively-parallel systems, and to guide the diagnosis of poorly scaling programs. This thesis addresses this challenge in two ways. First, we show that Co-array Fortran (CAF), a shared-memory parallel program-ming model, can be used to write scientific codes that exhibit high performance on modern parallel systems. Second, we describe a novel technique for analyzing parallel program performance and identifying scalability bottlenecks, and apply it across multiple program-ming models. Although the message passing parallel programming model provides both portability and high performance, it is cumbersome to program. CAF eases this burden by providing a partitioned global address space, but has before now only been implemented on shared-memory machines. To significantly broaden CAF’s appeal, we show that CAF programs can deliver high-performance on commodity cluster platforms. We designed and imple-
Performance portable optimizations for loops containing communication operations
- In International Conference on Supercomputing (ICS
, 2008
"... Effective use of communication networks is critical to the performance and scalability of parallel applications. Partitioned Global Address Space languages like UPC bring the promise of performance and programmer productivity. Studies of well-tuned programs have suggested that PGAS languages are eff ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Effective use of communication networks is critical to the performance and scalability of parallel applications. Partitioned Global Address Space languages like UPC bring the promise of performance and programmer productivity. Studies of well-tuned programs have suggested that PGAS languages are effective at utilizing modern networks because their one-sided communication is a good match to the underlying network hardware. An open question is whether the manual optimizations required to achieve good performance can be performed automatically by the compiler in a performance portable manner. In this paper we present a compiler and runtime optimization framework for loops containing communication operations. Our framework performs compile time message vectorization and strip-mining and defers until runtime the selection of the actual communication operations. At runtime, the communication requirements of the program are analyzed, and communication is instantiated and scheduled based on highly tuned network and application performance models. The runtime analysis takes into account network flow control and quality-of-service restrictions, and it is able to select from a large class of available communication primitives the communication schedule best suited for the dynamic combination of input size and system parameters. The results indicate that our framework produces code that scales and performs better than that of manually optimized implementations. Our approach not only improves performance, but increases programmer productivity as well.

