Results 1 - 10
of
27
Performance Analysis of MPI Collective Operations
- In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15
, 2005
"... Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library. 1
Optimization of Collective communication operations in MPICH
- International Journal of High Performance Computing Applications
, 2005
"... We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizin ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM’s MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
Implementing a Low Cost, Low Latency Parallel Platform
- Parallel Computing
, 1997
"... The cost of high-performance parallel platforms prevents parallel processing techniques from spreading in present applications. Networks of Workstations (NOW) exploiting off-the-shelf communication hardware, high-end PCs and standard communication software provide much cheaper but poorly performing ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
The cost of high-performance parallel platforms prevents parallel processing techniques from spreading in present applications. Networks of Workstations (NOW) exploiting off-the-shelf communication hardware, high-end PCs and standard communication software provide much cheaper but poorly performing parallel platforms. In our NOW prototype called GAMMA (Genoa Active Message MAchine) every node is a PC running a Linux operating system kernel enhanced with efficient communication mechanisms based on the Active Message paradigm. Active Messages supply virtualization of the network interface close enough to the raw hardware to guarantee good performance. The preliminary performance measures obtained by GAMMA show how competitive such a cheap NOW is. 1 Introduction Historically Local Area Network (LAN) device drivers in the Operating System (OS) kernel of a workstation have never been optimized like other devices whose performance is critical for user applications (such as disk drivers, memo...
Assessing the Performance of the New IBM SP2 Communication Subsystem
, 1996
"... IBM has recently launched an upgrade of the communication subsystem of its SP2 parallel computer. This change affects the communication hardware (high-performance switch and interface adapters) as well as the communication software (MPI implementation). In order to characterize to what extent the ex ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
IBM has recently launched an upgrade of the communication subsystem of its SP2 parallel computer. This change affects the communication hardware (high-performance switch and interface adapters) as well as the communication software (MPI implementation). In order to characterize to what extent the execution times of parallel applications will be affected by these changes, a collection of benchmarks has been run on a SP2 with the old communication subsystem and on the same machine after being upgraded. These benchmarks include point to point and collective communication tests as well as complete parallel applications. The performance indicators are the latency and throughput exhibited by the basic communication tests, and the execution time in the case of real applications. Keywords Communication subsystem, message passing networks, massively parallel computers, performance evaluation, IBM SP2. 1 Introduction A long time has passed since the high-performance computing community realized ...
Lonsdale: Comparative Efficiencies of Domain Decompositions
- Parallel Computing
, 1994
"... Abstract We identify and compare two different classes of domain decomposition algorithms suitable for grid-based simulation programs, such as those commonly found in finite element and finite volume codes. The domain decomposition schemes are either, logical methods that only account for the connec ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract We identify and compare two different classes of domain decomposition algorithms suitable for grid-based simulation programs, such as those commonly found in finite element and finite volume codes. The domain decomposition schemes are either, logical methods that only account for the connectedness of the grid, or physical methods that take account only of the spatial separation of grid points. We use Engineering Systems International's finite element PAM-CRASH code on a number of grids of varying size and complexity. The recursive spectral partitioning method (a logical method) is consistently better than all the other methods for small to medium sized grids, while the physical methods are marginally better for comparatively larger grid sizes.
Architectural Issues and Preliminary Benchmarking of a Low-cost Network of Workstations based on Active Messages
- In Proc. 14th ITG/GI conference on Architecture of Computer Systems (ARCS'97
, 1997
"... The cheapest platforms for parallel processing are Networks of Workstations (NOWs) equipped with off-the-shelf communication hardware and high-end Personal Computers running an operating system with networking capabilities. However standard network protocols and mechanisms cannot deliver a satisfact ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The cheapest platforms for parallel processing are Networks of Workstations (NOWs) equipped with off-the-shelf communication hardware and high-end Personal Computers running an operating system with networking capabilities. However standard network protocols and mechanisms cannot deliver a satisfactory amount of the raw communication performance to the application level. The Genoa Active Message MAchine (GAMMA) overcomes such a limitation by implementing the Active Message communication paradigm in a NOW architecture. The Active Message layer of GAMMA drastically shortens the communication path between two communicating processes while preserving a high-level and easy-to-use interface to SPMD/MIMD programming. A Parallel Matrix Multiplication (PMM) algorithm was run on GAMMA as a preliminary benchmark. The comparison between the performance achieved by PMM on GAMMA and the performance achieved by PMM with Linux TCP/IP communications on the same hardware platform demonstrate the success...
A Performance-oriented Operating System Approach to Fast Communications in a Low-cost Network of Workstations
- In Proc. 1998 International Conference on Parallel and Distributed Processing, Techniques and Applications (PDPTA'98), volume I
, 1998
"... The use of workstations connected by a fast Local Area Network (LAN) to form a so called Network of Workstations (NOW) is a very appealing idea to implement a low-cost parallel processing platform. Interprocess communication is the most difficult feature for such a system to implement with an accept ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The use of workstations connected by a fast Local Area Network (LAN) to form a so called Network of Workstations (NOW) is a very appealing idea to implement a low-cost parallel processing platform. Interprocess communication is the most difficult feature for such a system to implement with an acceptable level of performance. Standard protocols and mechanisms implemented at Operating System (OS) level usually do not provide satisfactory performance in a NOW architecture, especially with respect to the communication performance offered by the raw interconnection hardware. Two main solutions to such efficiency issue have been proposed so far, namely: standard OS mechanisms relying on simplified communication protocols, and user-level protected access to the raw communication hardware. We show that a third way, namely efficient OS mechanisms supporting an Active Message communication layer, can not only offer higher level communication primitives in a multiprogrammed environment but also o...
GAMMA: Architecture, Programming Interface and Preliminary Benchmarking
, 1996
"... The cost of high-performance parallel platforms prevents parallel processing techniques from spreading in present applications. Networks of Workstations (NOW) exploiting off-the-shelf communication hardware, high-end PCs and standard communication software provide much cheaper but poorly performing ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The cost of high-performance parallel platforms prevents parallel processing techniques from spreading in present applications. Networks of Workstations (NOW) exploiting off-the-shelf communication hardware, high-end PCs and standard communication software provide much cheaper but poorly performing parallel platforms. Indeed standard network protocols and mechanisms cannot deliver a satisfactory amount of the communication performances of the raw hardware to applications. The GAMMA (Genoa Active Message MAchine) prototype is an attempt to overcome such a limitation by adopting a minimal communication protocol and the Active Message communication paradigm. The virtualization of the network interface provided by GAMMA is close enough to the raw hardware to guarantee good performance, while still providing a usable programming interface. This paper illustrates the software architecture of the communication layer of GAMMA. Two main optimizations are discussed and their remarkable impact on...
A comparison of MPI performance on different MPPs
, 1997
"... . Since MPI [1] has become a standard for message-passing on distributed memory machines a number of implementations have evolved. Today there is an MPI implementation available for all relevant MPP systems, a number of which is based on MPICH [2]. In this paper we are going to present performance c ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
. Since MPI [1] has become a standard for message-passing on distributed memory machines a number of implementations have evolved. Today there is an MPI implementation available for all relevant MPP systems, a number of which is based on MPICH [2]. In this paper we are going to present performance comparison for several implementations of MPI on different MPPs. Results for the Cray T3E, the IBM RS/6000 SP, the Hitachi SR2201 and the Intel Paragon are presented. In addition we compare those results to the NEC SX-4, a shared memory PVP. Results presented will show latency and bandwidth for point-to-point communication. In addition results for global communications and synchronization will be given. This covers a wide range of MPI features used by typical numerical simulation codes. Finally we investigate a core conjugate gradient solver operation to show the behaviour of latency-hiding techniques on different platforms. 1 Introduction Since MPI has become the standard for message-passin...
Self adapting numerical software (SANS) effort
- University of Tennessee, Computer Science Department
, 2006
"... The challenge for the development of next generation software is the successful management of the complex computational environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. Self-Adapting Numerical Software (SANS) systems are ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The challenge for the development of next generation software is the successful management of the complex computational environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. Self-Adapting Numerical Software (SANS) systems are intended to meet this significant challenge. The process of arriving at an efficient numerical solution of problems in computational science involves numerous decisions by a numerical expert. Attempts to automate such decisions distinguish three levels: • Algorithmic decision; • Management of the parallel environment; • Processor-specific tuning of kernels. Additionally, at any of these levels we can decide to rearrange the user’s data. In this paper we look at a number of efforts at the University of Tennessee that are investigating these areas. 1

