Results 1 - 10
of
31
Flexible collective communication tuning architecture applied to open MPI
- In 2006 Euro PVM/MPI
, 2006
"... Abstract. Collective communications are invaluable to modern high performance applications, although most users of these communication patterns do not always want to know their inner most working. The implementation of the collectives are often left to the middle-ware developer such as those providi ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. Collective communications are invaluable to modern high performance applications, although most users of these communication patterns do not always want to know their inner most working. The implementation of the collectives are often left to the middle-ware developer such as those providing an MPI library. As many of these libraries are designed to be both generic and portable the MPI developers commonly offer internal tuning options suitable only for knowledgeable users that allow some level of customization. The work presented in this paper aims not only to provide a very efficient set of collective operations for use with the Open MPI implementation but also to make the control and tuning of them straightforward and flexible. Additionally this paper demonstrates a novel example of the proposed frameworks flexibility, by dynamically tuning a MPI Alltoallv algorithm during runtime. 2
A Case for NonBlocking Collective Operations
- In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops
, 2006
"... Abstract. Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmar ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract. Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that nonblocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10 % for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90 % idle CPU time can be freed for the application.
Automatic performance optimization of the discrete Fourier transform on distributed memory computers
- In Proc. International Symposium on Parallel and Distributed Processing and Applications (ISPA
, 2006
"... Abstract. This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithm ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Abstract. This paper introduces a formal framework for automatically generating performance optimized implementations of the discrete Fourier transform (DFT) for distributed memory computers. The framework is implemented as part of the program generation and optimization system SPIRAL. DFT algorithms are represented as mathematical formulas in SPIRAL’s internal language SPL. Using a tagging mechanism and formula rewriting, we extend SPIRAL to automatically generate parallelized formulas. Using the same mechanism, we enable the generation of rescaling DFT algorithms, which redistribute the data in intermediate steps to fewer processors to reduce communication overhead. It is a novel feature of these methods that the redistribution steps are merged with the communication steps of the algorithm to avoid additional communication overhead. Among the possible alternative algorithms, SPIRAL’s search mechanism now determines the fastest for a given platform, effectively generating adapted code without human intervention. Experiments with DFT MPI programs generated by SPIRAL show performance gains of up to 30 % due to rescaling. Further, our generated programs compare favorably with FFTW-MPI 2.1.5. 1
Improving communication performance in dense linear algebra
"... via topology aware collectives ..."
Message Progression in Parallel Computing - To Thread or not to Thread
- In Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society
, 2008
"... Abstract—Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progr ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract—Message progression schemes that enable communication and computation to be overlapped have the potential to improve the performance of parallel applications. With currently available high-performance networks there are several options for making progress: manual progression, use of a progress thread, and communication offload. In this paper we analyze threaded progression approaches, comparing the effects of using shared or dedicated CPU cores for progression. To perform these comparisons, we propose time-based and work-based benchmark schemes. As expected, threaded progression performs well when a spare core is available to be dedicated to communication progression, butanumber of operatingsystem effects prevent the same benefits from beingobtained when communication progress must share a core with computation. We show that some limited performance improvement can be obtained in the shared-core case by real-time scheduling of the progress thread. I.
Decision trees and MPI collective algorithm selection problem
, 2006
"... Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step in achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem. ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step in achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem. We construct C4.5 decision trees from the measured algorithm performance data and analyze the decision tree properties and expected run time performance penalty. In cases we considered, results show that the C4.5 decision trees can be used to generate a reasonably small and very accurate decision function. For example, the Broadcast decision tree with only 21 leaves was able to achieve a mean performance penalty of 2.08%. Similarly, combining experimental data for Reduce and Broadcast and generating a decision function from the combined decision trees resulted in less than 2.5 % relative performance penalty. The results indicate that C4.5 decision trees are applicable to this problem and should be more widely used in this domain. 1
MPI collective algorithm selection and quadtree encoding
- In 2006 Euro PVM/MPI
, 2006
"... We explore the applicability of the quadtree encoding method to the run-time MPI collective algorithm selection problem. Measured algorithm performance data was used to construct quadtrees with different properties. The quality and performance of generated decision functions and in-memory decision s ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We explore the applicability of the quadtree encoding method to the run-time MPI collective algorithm selection problem. Measured algorithm performance data was used to construct quadtrees with different properties. The quality and performance of generated decision functions and in-memory decision systems was evaluated. Experimental data shows that in some cases, a decision function based on a quadtree structure with a mean depth of 3 can incur as little as a 5 % performance penalty on average. Experimentally measured data was fully represented using quadtrees with maximum of 6 levels. Our results indicate that quadtrees may be a feasible choice for both processing of the performance data and automatic decision function generation. 1
TOTAL EXCHANGE PERFORMANCE PREDICTION ON GRID ENVIRONMENTS modeling and algorithmic issues
"... Keywords: One of the most important collective communication patterns used in scientific applications is the complete exchange, also called All-to-All. Although efficient algorithms have been studied for specific networks, general solutions like those available in well-known MPI distributions (e.g. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Keywords: One of the most important collective communication patterns used in scientific applications is the complete exchange, also called All-to-All. Although efficient algorithms have been studied for specific networks, general solutions like those available in well-known MPI distributions (e.g. the MPI Alltoall operation) are strongly influenced by the congestion of network resources. In this paper we address the problem of modeling the performance of Total Exchange communication operations in grid environments. Because traditional performance models are unable to predict the real completion time of an All-to-All operation, we try to cope with this problem by identifying the factors that can interfere in both local and distant transmissions. We observe that the traditional MPI Alltoall implementation is not suited for grid environments, as it is both inefficient and hard to model. We focus therefore in an alternative algorithm for the total exchange redistribution problem. In our approach we perform communications in two different phases, aiming to minimize the number of communication steps through the wide-area network. This reduction has a direct impact on the performance modeling of the MPI Alltoall operation, as we minimize the factors that interfere with wide-area communications. Hence, we are able to define an accurate performance modeling of a total exchange between two clusters. MPI, all-to-all, total exchange, network contention, performance modeling, computational grids, personalized many-to-many communications
Group Operation Assembly Language - A Flexible Way to Express Collective Communication
"... The implementation and optimization of collective communication operations is an important field of active research. Such operations directly influence application performance and need to map the communication requirements in an optimal way to steadily changing network architectures. In this work, w ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The implementation and optimization of collective communication operations is an important field of active research. Such operations directly influence application performance and need to map the communication requirements in an optimal way to steadily changing network architectures. In this work, we define an abstract domain-specific language to express arbitrary group communication operations. We show the universality of this language and how all existing collective operations can be implemented with it. By design, it readily lends itself to blocking and nonblocking execution, as well as to off-loaded execution of complex group communication operations. We also define several offline and online optimizations (compiler transformations and scheduling decisions, respectively) to improve the overall performance of the operation. Performance results show that the overhead to express current collective operations is negligible in comparison to the potential gains in a highly optimized implementation.
Decision trees and MPI collective algorithm selection problem
, 2007
"... Abstract. Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step for achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selecti ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Selecting the close-to-optimal collective algorithm based on the parameters of the collective call at run time is an important step for achieving good performance of MPI applications. In this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm selection problem. We construct C4.5 decision trees from the measured algorithm performance data and analyze both the decision tree properties and the expected run time performance penalty. In cases we considered, results show that the C4.5 decision trees can be used to generate a reasonably small and very accurate decision function. For example, the broadcast decision tree with only 21 leaves was able to achieve a mean performance penalty of 2.08%. Similarly, combining experimental data for reduce and broadcast and generating a decision function from the combined decision trees resulted in less than 2.5 % relative performance penalty. The results indicate that C4.5 decision trees are applicable to this problem and should be more widely used in this domain. 1

