Results 1 - 10
of
47
Performance Analysis of MPI Collective Operations
- In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15
, 2005
"... Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library. 1
An Architecture for Optimal All-to-All Personalized Communication
, 1994
"... In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremel ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
In all-to-all personalized communication (AAPC), every node of a parallel system sends a potentially unique packet to every other node. AAPCis an important primitive operation for modern parallel compilers, since it is used to redistribute data structures during parallel computations. As an extremely dense communication pattern, AAPC causes congestion in many types of networks and therefore executes very poorly on general purpose, asynchronous message passing routers. We presentand evaluate a network architecture that executesallto-all communication optimally on a two-dimensional torus. The router combines optimal partitions of the AAPC step with a selfsynchronizing switching mechanism integrated into a conventional wormhole router. Optimality is achieved by routing along shortest paths while fully utilizing all links. A simple hardware addition for synchronized message switching can guarantee optimal AAPC routing in many existing network architectures. The flexible communication agent of the iWarp VLSI component allowed us to implement an efficient prototype for the evaluation of the hardware complexity as well as possible software overheads. The measured performance on an 8 8 torus exceeded 2 GigaBytes/sec or 80 % of the limit set by the raw speed of the interconnects. We make a quantitative comparison of the AAPC router with a conventional message passing system. The potential gain of such a router for larger parallel programs is illustrated with the example of a two-dimensional Fast Fourier Transform. 1
Optimization of Collective communication operations in MPICH
- International Journal of High Performance Computing Applications
, 2005
"... We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizin ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM’s MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection (Extended Abstract)
, 1996
"... David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 E-mail: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data ..."
Abstract
-
Cited by 25 (10 self)
- Add to MetaCart
David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 E-mail: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data. This paper presents a fast and portable parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank i, for an arbitrarily given integer i. Practical algorithms needed by our selection algorithm for the dynamic redistribution of data are also discussed. Our general framework is a dis- tributed memory programming model enhanced by a set of communication primitives. We use efficient techniques for distributing, coalescing, and load balancing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in SPLIT-C and run on a varie ,ty of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scientific CS-2, Intel Paragon, and workstation clusters. Our experimental results illustrate the scalability and efficiency of our algorithms across different platforms and improve upon all the related experimental results known to the authors.
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a per-processor parameter g> 1, such that eachpro cessor can send/receive at ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a per-processor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a per-processor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
Automatic Generation and Tuning of MPI Collective Communication Routines
- 19th ACM International Conference on Supercomputing (ICS’05
, 2005
"... In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental results show that the tuned routines consistently achieve high performance on clusters with different network topologies.
Exchange of Messages of Different Sizes
- In IRREGULAR '98
"... In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, non-uniform versions of all-to-all (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We exten ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, non-uniform versions of all-to-all (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We extend previous complexity results to show that the general asynchronous problems are NP-complete. We present several approximation algorithms and determine which heuristics are best suited to several parallel systems. We conclude with experimental results that show that our algorithms outperform the native all-to-all exchange algorithm on an IBM SP2 when the number of processors is odd.
Efficient Communication Using Total-Exchange
"... ... programs using high-level, general-purpose, and architecture-independent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, general-purposeparallelcomputingisindeedpossiblepro ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
... programs using high-level, general-purpose, and architecture-independent programming language and have them executedonavarietyofparallelanddistributed architectureswithout sacricing efficiency. Alargebodyofresearchsuggeststhat,atleastintheory, general-purposeparallelcomputingisindeedpossibleprovided certainconditionsaremet: anexcessoflogicalparallelismin the program,andtheabilityofthetargetarchitectureto efficientlyrealizebalancedcommunication patterns. Thecanonicalexampleofabalancedcommunicationpatternisan h-relation, inwhicheachprocessoristheorigin and destination of at most h messages. A plethoraofprotocolshasbeendesigned forrouting h-relations inavarietyofnetworks. Thegoalhasbeentominimizethevalueofhwhile guaranteeingdeliveryofthemessageswithintime aconstantfactorfromoptimal.Inthispaperwe describeprotocolsthatmeetthemoststringent efficiency requirement, namely deliveryofmessages withintimethatisalowerorderadditivetermfrom thebestachievable. Suchprotocolsarecalled 1-optimal. Whiletheseprotocolsachieve1-optimality only forheavilyloadednetworks,thatis,for largevaluesofh, theyareremarkablefortheirsimplicityinthattheyonly usethetotal-exchange communication primitive. The total-exchange canberealizedinmanynetworksusingverysimple, contention-free,andextremely efficient schemes. Thetechnicalcontributionofthispaperisaprotocol torouterandomh-relationsinan N-processor networkusing hN(1+o(1))+O(loglogN) total-exchange roundswithhighprobability. Usingmessageduplication, wecanimprovetheboundto hN(1+o(1))+O(logN). This improves upon the hN(1+o(1))+O(logN) bound of Gerbessiotis and Valiant. While our theoretical improvements are modest, our experimental results show an improvement over the protocol of Gerebessiotis and Valiant.
Optimization of Collective Reduction Operations
- In ######## ###### ####### ######, Springer-Verlag LNCS 3036
, 2004
"... ..."
Efficient Implementation of Reduce-Scatter in MPI
- J. Syst. Archit
, 1998
"... We discuss the efficient implementation of a collective operation called reduce-scatter , which is defined in the MPI standard. The reduce-scatter is equivalent to the combination of a reduction on vectors of length n with a scatter of the resulting n-vector to all processors. We describe the imp ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We discuss the efficient implementation of a collective operation called reduce-scatter , which is defined in the MPI standard. The reduce-scatter is equivalent to the combination of a reduction on vectors of length n with a scatter of the resulting n-vector to all processors. We describe the implementation issues and the performance characterization of two new algorithms for the reduce-scatter that have been proven to be highly efficient in theory under the assumption of fully connected parallel system. A performance comparison with existing mainstream implementations of the operation is presented which confirms the practical advantage of the new algorithms. Experiments show that the two algorithms have different characteristics which make them complementary in providing a performance gain over standard algorithms. Our study has been carried out in the context of the MPI standard on two different platforms: an SP2 and a Myrinet interconnected cluster of Pentium PRO. However...

