### Table 1 Execution time in seconds of the initial and the tuned versions

"... In PAGE 5: ... Finally, these codes have been used as a platform for the implementation of the uniprocessor version of Level 3 BLAS on the BBN TC2000 (see next Section). We show in Table1 the MFlops rates of the parallel matrix-matrix multiplication, and in Table 2 the performance of the LU factorization (we use a blocked code similar to the LAPACK one) on the ALLIANT FX/80, the CRAY-2, and the IBM 3090-600J obtained using our parallel version of the Level 3 BLAS. Note that our parallel Level 3 BLAS uses the serial manufacturer-supplied versions of GEMM on all the computers.... In PAGE 6: ... This package is available without payment and will be sent to anyone who is interested. We show in Table1 the performance of the single and double precision GEMM on di erent numbers of processors. Table 2 shows the performance of the LAPACK codes corresponding to the blocked LU factorization (GETRF, right-looking variant), and the blocked Cholesky factorization (POTRF, top-looking variant).... In PAGE 8: ... The second part concerned the performance we obtained with tuning and parallelizing these codes, and by introducing library kernels. We give in Table1 a brief summary of the results we have obtained: One of the most important points to mention here is the great impact of the use of basic linear algebra kernels (Level 3 BLAS) and the LAPACK library. The conclusion involves recommendations for a methodology for both porting and developing codes on parallel computers, performance analysis of the target computers, and some comments relating to the numerical algorithms encountered.... In PAGE 12: ... Because of the depth rst search order, the contribution blocks required to build a new frontal matrix are always at the top of the stack. The minimum size of the LU area (see column 5 of Table1 ) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table 1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors.... In PAGE 12: ... The minimum size of the LU area (see column 5 of Table 1) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors. Frontal matrices are stored in a part of the global working space that will be referred to as the additional space.... In PAGE 12: ... In a uniprocessor environment, only one active frontal matrix need be stored at a time. Therefore, the minimum real space (see column 7 of Table1 ) to run the numerical factorization is the sum of the LU area, the space to store the largest frontal matrix and the space to store the original matrix. Matrix Order Nb of nonzeros in Min.... In PAGE 13: ... In this case the size of the LU area can be increased using a user-selectable parameter. On our largest matrix (BBMAT), by increasing the space required to run the factorization (see column 7 in Table1 ) by less than 15 percent from the minimum, we could handle the ll-in due to numerical pivoting and run e ciently in a multiprocessor environment. We reached 1149 M ops during numerical factorization with a speed-up of 4.... In PAGE 14: ...ack after computation. Interleaving and cachability are also used for all shared data. Note that, to prevent cache inconsistency problems, cache ush instructions must be inserted in the code. We show, in Table1 , timings obtained for the numerical factorization of a medium- size (3948 3948) sparse matrix from the Harwell-Boeing set [3]. The minimum degree ordering is used during analysis.... In PAGE 14: ... -in rows (1) we exploit only parallelism from the tree; -in rows (2) we combine the two levels of parallelism. As expected, we rst notice, in Table1 , that version 1 is much faster than version 2... In PAGE 15: ... Results obtained on version 3 clearly illustrate the gain coming from the modi cations of the code both in terms of speed-up and performance. Furthermore, when only parallelism from the elimination tree is used (see rows (1) in Table1 ) all frontal matrices can be allocated in the private area of memory. Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1].... In PAGE 15: ... Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1]. We nally notice, in Table1 , that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table 1.... In PAGE 15: ... We nally notice, in Table 1, that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table1 . The main reason is that frontal matrices must be allocated in the shared area when the second level of parallelism is enabled.... In PAGE 17: ...5 28.2 Table1 : Results in Mega ops on parallel computers. In Table 1, it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion.... In PAGE 17: ...2 Table 1: Results in Mega ops on parallel computers. In Table1 , it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion. The reason for this is that the single precision mathematical library routines are less optimized.... In PAGE 18: ... block diagonal) preconditioner appears to be very suitable and is superior to the Arnoldi-Chebyshev method. Table1 shows the results of the computation on an Alliant FX/80 of the eight eigenpairs with largest real parts of a random sparse matrix of order 1000. The nonzero o -diagonal and the full diagonal entries are in the range [-1,+1] and [0,20] respectively.... In PAGE 19: ... A comparison with the block preconditioned conjugate gradient is presently being investigated.In Table1 , we compare three partitioning strategies of the number of right-hand sides for solving the system of equations M?1AX = M?1B, where A is the ma- trix BCSSTK27 from Harwell-Boeing collection, B is a rectangular matrix with 16 columns, and M is the ILU(0) preconditioner. Method 1 2 3 1 block.... In PAGE 26: ...111 2000 lapack code 0.559 Table1 : Results on matrices of bandwith 9.... In PAGE 30: ... We call \global approach quot; the use of a direct solver on the entire linear system at each outer iteration, and we want to compare it with the use of our mixed solver, in the case of a simple splitting into 2 subdomains. We show the timings (in seconds) in Table1 on 1 processor and in Table 2 on 2 processors, for the following operations : construction amp; assembly : Construction and Assembly, 14% of the elapsed time, factorization : Local Factorization (Dirichlet+Neumann), 23%, substitution/pcg : Iterations of the PCG on Schur complement, 55%, total time The same code is used for the global direct solver and the local direct solvers, which takes advantage of the block-tridiagonal structure due to the privileged direction. Moreover, there has been no special e ort for parallelizing the mono-domain approach.... ..."

### Table 2. Tuned performance for a parallel mesh application

2003

"... In PAGE 5: ... public void meshMethod() { preProcess(); while (notDone()) { (* includes prepare() before barrier *) operate(); } postProcess(); } Figure 6. The modified execution loop for mesh partitions After modifying the generated code, we re-ran the experiments and the results are shown in Table2 . This true anecdote is a good illustration of how performance tuning can work when coupled with the readable parallel structure code that is generated by CO2P3S.... ..."

Cited by 19

### Table 3: Statistics from the execution of the NAS benchmarks with different page placement schemes and our page migration engine.

"... In PAGE 22: ...y as much as a factor of 8. There are also cases (e.g. SP and MG on 32 processors) in which the worst-case page placement combined with our page migration engine performs considerably better than first-touch. Table3 provides some additional statistics collected by manually inserting event counters in the runtime system. The second, third and fourth columns of the table report the slowdown of the benchmarks in the last 75% of the iterations of the main parallel computation for round-robin, random and worst-case page placement on 16 processors6.... In PAGE 23: ... as an iterative parallel computation evolves in time. The fifth, sixth and seventh column of Table3 show the fraction of page migrations performed by our page migration engine in the first iteration of the parallel computation. In three out of five cases (CG, FT and MG), all page migrations are performed in the first iteration.... ..."

### Table 3. Performance of microbenchmark and benchmark programs. Each program executed 100 iterations, and the average cost is shown. Times are in seconds. Results where parallel performance is better than sequential performance are shown in boldface.

1999

"... In PAGE 6: ... Performance is similar for synchronized blocks. The top portion of Table3 presents the performance of these three pro- grams. Parallel execution of the forall statement introduced an overhead of 10% to 40%, because of object creation and synchronized methods introduced by the transformation.... In PAGE 7: ... Our program uses both forall statements and aggregate operations. The bottom of Table3 presents the results for these two algorithms. EM3d executed 100 iterations on a graph with 4000 H nodes and 4000 E nodes, where each node had 30 neighbors.... ..."

Cited by 8

### Table 3. Performance of microbenchmark and benchmarkprograms. Each program executed100 iterations, and the average cost is shown. Times are in seconds. Results where parallel perfor- mance is better than sequential performance are shown in boldface.

1999

"... In PAGE 5: ... The third program performs a plus scan operation on a vector of 1 million integers. The top portion of Table3 presents the performance of these three programs. Par- allel execution of the forall statement introduced an overhead of 10% to 40%, be- cause of object creation and synchronized methods introduced by the transformation.... In PAGE 6: ...raphics to statistics. We implement the quickhull algorithm [6] for this problem. Our program uses both forall statements and aggregate operations. The bottom of Table3 presents the results for these two algorithms. EM3d executed 100 iterations on a graph with 4000 H nodes and 4000 E nodes, where each node had... ..."

Cited by 8

### Table 1. Relative performance results in parallel execution times (in seconds) of various parallel algorithms for the form-factor computation phase

"... In PAGE 14: ... These algorithms were tested on various room scenes containing various objects discretized into varying numbers of patches ranging from 496 to 2600. Table1 illustrates the relative performance results of various parallel algorithms for the form-factor computation phase. The execution times of the algorithms are illustrated in Fig.... In PAGE 14: ...nstances (e.g., 15 out of 19) than the random assignment in patch circulation due to the de- crease in communication overhead. As seen in Table1 and in Fig. 6a, the demand-driven scheme always performs better than the static assignment scheme due to a better load balance.... ..."

### Table 5. Average Performance of Genetically Tuned Heuristic

2005

"... In PAGE 8: ... 6.6 Summary of Results Table5 shows the average reductions for running times and total times for both our benchmark suites. Since our genetic al- gorithm was tuned over SPECjvm98, as to be expected we always outperform the default heuristic giving a 6% and 17% reduction in total execution time on the PowerPC and Intel platform respec- tively.... ..."

Cited by 5

### Table 5. Average Performance of Genetically Tuned Heuristic

2005

"... In PAGE 8: ... 6.6 Summary of Results Table5 shows the average reductions for running times and total times for both our benchmark suites. Since our genetic al- gorithm was tuned over SPECjvm98, as to be expected we always outperform the default heuristic giving a 6% and 17% reduction in total execution time on the PowerPC and Intel platform respec-... ..."

Cited by 5

### Table 2. Statistics from the execution of the NAS benchmarks with different page placement schemes and our page migration engine.

2000

"... In PAGE 11: ... The slow- downs of the same page placement schemes with page mi- gration enabled in the IRIX kernel were 16%, 17% and 61% respectively. Table2 provides some additional statistics which were collected by manually inserting event counters in the run- time system. The second, third and fourth columns of the table report the slowdown of the benchmarks in the last 75% of the iterations of the main parallel computation for round-robin, random and worst-case page placement re- spectively4.... In PAGE 11: ....7%, while in most cases it was less than 1%. The results indicate that the page migration engine achieves robust and stable memory performance as the iterative computations evolve. The fifth, sixth and seventh column of Table2 show the fraction of page migrations performed by our page migra- tion engine after the first iteration of the parallel computa- tion. In three out of five cases, CG, FT and MG, all page migrations were performed after the first iteration of the program.... ..."

Cited by 12

### Table 2. Statistics from the execution of the NAS benchmarks with different page placement schemes and our page migration engine.

2000

"... In PAGE 11: ... The slow- downs of the same page placement schemes with page mi- gration enabled in the IRIX kernel were 16%, 17% and 61% respectively. Table2 provides some additional statistics which were collected by manually inserting event counters in the run- time system. The second, third and fourth columns of the table report the slowdown of the benchmarks in the last 75% of the iterations of the main parallel computation for round-robin, random and worst-case page placement re- spectively4.... In PAGE 11: ....7%, while in most cases it was less than 1%. The results indicate that the page migration engine achieves robust and stable memory performance as the iterative computations evolve. The fifth, sixth and seventh column of Table2 show the fraction of page migrations performed by our page migra- tion engine after the first iteration of the parallel computa- tion. In three out of five cases, CG, FT and MG, all page migrations were performed after the first iteration of the program.... ..."

Cited by 12