Results 1 - 10
of
12
Design of OpenMP Compiler for an SMP Cluster
- In EWOMP ’99
, 1999
"... In this paper, we present a design of OpenMP compiler for an SMP cluster. Although clusters of SMPs are expectedtobe one of the cost-effective parallel computing platforms, both of inter and intra node parallelism must be exploited to achieve high performance. These two levels of structure complicat ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
In this paper, we present a design of OpenMP compiler for an SMP cluster. Although clusters of SMPs are expectedtobe one of the cost-effective parallel computing platforms, both of inter and intra node parallelism must be exploited to achieve high performance. These two levels of structure complicate parallel programming. The OpenMP is an emerging standard for parallel programming on shared-memory multiprocessors. We extend the OpenMP model for an SMP cluster by "compiler-directed" software distributed shared memory system. Our OpenMP compiler instruments an OpenMP program by inserting remote communication primitives to keep consistency of memory between different nodes, and provides a view of shared memory model on the SMP cluster. Different from multithreaded programs on conventional software DSMs, an OpenMP program is so well-structured that it allows the compiler to analyze extent of parallel region for the optimization of efficient communication and synchronization. We report some preliminary results of OpenMP programs on a Pentium Pro based SMP cluster, COMPaS.
Investigating the performance of two programming models for clusters of SMP PCs
, 2000
"... Multiprocessors and high performance networks allow to build CLUsters of MultiProcessors (CLUMPs). A main distinctive feature over traditional parallel computers is their hybrid memory model (message passing between the nodes and shared memory inside the nodes). We eval- uate the performance of a ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Multiprocessors and high performance networks allow to build CLUsters of MultiProcessors (CLUMPs). A main distinctive feature over traditional parallel computers is their hybrid memory model (message passing between the nodes and shared memory inside the nodes). We eval- uate the performance of a cluster of 2-way SMP PCs connected by a Myrinet network for NAS benchmarks from two programming: a Single Memory Model based on the MPICH-PM/CLUMP library of the RWCP and a Hybrid Memory Model using MPICH-PM and OpenMP. We compare 2-way SMP configurations speed-up versus single CPU configurations for each model. We demonstrate that better model depends on the features of the applications. In particular, we detail the speed-up results from breakdowns of the benchmarks execution times and from measurements of hardware counters. Then, we show that these two models give performance for PC based CLUMPs close to performance of scalable high-end multicomputers up to large con- figurations (36 nodes).
Performance characteristics of a network of commodity multiprocessors for the NAS benchmarks using a hybrid memory model
, 1998
"... The availability of multiprocessors and high performance netvorks offer an opportunity to construct CLUster of MultiProcessors (CLUMPs) and use them as parallel computing platforms. The distinctive feature of the CLUMPs over traditional parallel computers is their hybrid memory model (message pas ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The availability of multiprocessors and high performance netvorks offer an opportunity to construct CLUster of MultiProcessors (CLUMPs) and use them as parallel computing platforms. The distinctive feature of the CLUMPs over traditional parallel computers is their hybrid memory model (message passing betveen the nodes and shared memory inside the nodes). In this paper, ve investigate the performance characteristics of a CLUMP using a programming model close to the hardvare memory model. The programming model is based on MPI for message passing part and OpenMP for shared memory part. The paper provides three contributions. These include: a) Performance potential of biprocessor PC as a single node in the context of shared memory parallel programs and also as being the processing node of a parallel platform in the context of MPI programs, b) Performance measurements of a cluster of biprocessor PCs for NAS 2.3 parallel benchmarks using the hybrid memory model and c) Some explanations for the performance results by examining a breakdovn of the benchmarks execution time and also by showing the existence of a theoretical limit for the intra-mukiprocessor speedup.
Efficient Communication Using Message Prediction for Cluster of Multiprocessors
- Proceedings of the CANPC’00, Fourth Workshop on Communication, Architecture, and Applications for Networkbased Parallel Computing, held in conjunction with HPCA6
, 1999
"... . With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user envi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
. With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user environment overhead (multithreading, multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying. Ideally, it is desirable to have a true zero-copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering. However, due to the fact that message -passing applications at the send side do not know the final receive buffer addresses, early arrival messages have to be buffered at a temporary area. In this paper, we show that there is a message reception communication locality in...
G.: Task Pool Teams: A Hybrid Programming Environment for Irregular Algorithms
- on SMP Clusters. Concurrency and Computation: Practice and Experience
, 2006
"... Clusters of SMPs (symmetric multiprocessors) are popular platforms for parallel programming since they provide large computational power for a reasonable price. For irregular application programs with dynamically changing computation and data access behavior a flexible programming model is needed to ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Clusters of SMPs (symmetric multiprocessors) are popular platforms for parallel programming since they provide large computational power for a reasonable price. For irregular application programs with dynamically changing computation and data access behavior a flexible programming model is needed to achieve efficiency. In this paper we propose Task Pool Teams as a hybrid parallel programming environment to realize irregular algorithms on clusters of SMPs. Task Pool Teams combine task pools on single cluster nodes by an explicit message passing layer. They offer load balance together with multi-threaded, asynchronous communication. Appropriate communication protocols and task pool implementations are provided and accessible by an easy to use application programmer interface. As application examples we present a branch & bound algorithm and the hierarchical radiosity algorithm. 1
GENERIC PROGRAMMING FOR HIGH-PERFORMANCE SCIENTIFIC COMPUTING
, 2002
"... by Lie-Quan Lee Generic programming is an important paradigm for software development, with an emphasis on reusability and performance, qualities that would seemingly make this para-digm especially suited for application to scientific computing. We apply generic pro-gramming to the development of a ..."
Abstract
- Add to MetaCart
by Lie-Quan Lee Generic programming is an important paradigm for software development, with an emphasis on reusability and performance, qualities that would seemingly make this para-digm especially suited for application to scientific computing. We apply generic pro-gramming to the development of a message passing framework (the Generic Message Passing library) for parallel computing in hybrid execution architectures (i.e., those hav-ing both shared and distributed memory). Although GMP supports both shared-memory and distributed-memory execution, it explicitly separates its programming and execution models, presenting a uniform message-based programming interface to enable source-code portability of parallel programs. At the same time, the implementation of GMP fully exploits the architectural characteristics of its execution target for maximum run-time performance. GMP is specifically designed to seamlessly integrate with modern generic C++ libraries such as the C++ Standard Library. C++ objects with complex data
Network Interface Active Messages for Low Overhead Communication on SMP PC Clusters
- Communication on SMP PC Clusters. In Proc. on HPCN'99
, 1999
"... . NICAM is a communication layer for SMP PC clusters connected via Myrinet, designed to reduce overhead and latency by directly utilizing a micro-processor equipped on the network interface. It adopts remote memory operations to reduce much of the overhead found in message passing. NICAM employs an ..."
Abstract
- Add to MetaCart
. NICAM is a communication layer for SMP PC clusters connected via Myrinet, designed to reduce overhead and latency by directly utilizing a micro-processor equipped on the network interface. It adopts remote memory operations to reduce much of the overhead found in message passing. NICAM employs an Active Messages framework for flexibility in programming on the network interface, and this flexibility will compensate for the large latency resulting from the relatively slow micro-processor. Running message handlers directly on the network interface reduces the overhead by freeing the main processors from the work of polling incoming messages. The handlers also make synchronizations faster by avoiding the costly interactions between the main processors and the network interface. In addition, this implementation can completely hide latency of barriers in data-parallel programs, because handlers running in the background of the main processors allow reposition of barriers to any place where...
Understanding performance of SMP clusters running MPI programs
- Future Generation Computer Systems
, 2001
"... CLUsters of MultiProcessors (CLUMPS) have an hybrid memory model, with message passing between nodes and shared memory inside nodes. We examine the performance of Myrinet clusters of SMP PCs when using a Single Memory Model (SMM) based on the MPICH-PM/CLUMP library of the RWCP, which can directly us ..."
Abstract
- Add to MetaCart
CLUsters of MultiProcessors (CLUMPS) have an hybrid memory model, with message passing between nodes and shared memory inside nodes. We examine the performance of Myrinet clusters of SMP PCs when using a Single Memory Model (SMM) based on the MPICH-PM/CLUMP library of the RWCP, which can directly use the MPI programs written for a cluster of uniprocessors. The specicities of the communication patterns with the SMM approach are detailed. PC clusters with 2-way and 4-way nodes are considered and compared.
Parallelization of Sparse Cholesky Factorization on an SMP Cluster
- In Proc. HPCN Europe 1999, LNCS 1593
, 1999
"... . In this paper, we present parallel implementations of the sparse Cholesky factorization kernel in the SPLASH-2 programs to evaluate performance of a Pentium Pro based SMP cluster. Solaris threads and remote memory operations are utilized for intranode parallelism and internode communications, ..."
Abstract
- Add to MetaCart
. In this paper, we present parallel implementations of the sparse Cholesky factorization kernel in the SPLASH-2 programs to evaluate performance of a Pentium Pro based SMP cluster. Solaris threads and remote memory operations are utilized for intranode parallelism and internode communications, respectively. Sparse Cholesky factorization is a typical irregular application with a high communication to computation ratio and no global synchronization between steps. We efficiently parallelized using asynchronous message handling instead of lock-based mutual exclusion between nodes, because synchronization between nodes reduces the performance significantly. We also found that the mapping of processes to processors on an SMP cluster affects the performance especially when the communication latency can not be hidden. 1 Introduction Recent progress in microprocessors and interconnection networks motivated a trend towards high performance computing using clusters made out of commo...
Push-Pull Messaging: A High-Performance Communication Mechanism for Commodity SMP Clusters
- Proc. of International Conference on Parallel Processing
, 1999
"... Push-Pull Messaging is a novel messaging mechanism for high-speed interprocess communication in a cluster of symmetric multi-processors (SMP) machines. This messaging mechanism exploits the parallelism in SMP nodes by allowing the execution of communication stages of a messaging event on different p ..."
Abstract
- Add to MetaCart
Push-Pull Messaging is a novel messaging mechanism for high-speed interprocess communication in a cluster of symmetric multi-processors (SMP) machines. This messaging mechanism exploits the parallelism in SMP nodes by allowing the execution of communication stages of a messaging event on different processors to achieve maximum performance. Some optimizing techniques were implemented along with Push-Pull Messaging to further improve its performance. Cross-space Zero Buffer provides a unified buffer management mechanism to achieve a copy-less communication for the data transfer among processes within a SMP node. Address Translation Overhead Masking removes the address translation overhead from the critical path in the internode communication. Push-andAcknowledge Overlapping overlaps the push and acknowledge phases to hide the acknowledge latency. Push-Pull Messaging effectively utilizes the system resources. It has been implemented to support high-speed communication for connecting quad ...

