### Table 2. Speed in MFLOPs#2Fcell of parallel C #20 C + AA using the non-systolic method on an

"... In PAGE 12: ... This idea can easily be integrated into the parallel `non-systolic apos; multiply-add, thus amortizing communication costs. The performance of this partitioning method is given in Table2 . As the maximum matrix size corresponds to 4MB, results for a 4 #02 4 AP1000 are given; however the results for an 8 #02 8 AP1000 appear identical for the corresponding matrix sizes.... In PAGE 14: ...2, and hence it is only appropriate for large matrices. Table2 gives the results of our implementation; in parentheses are the MFLOPs rating if 2n 3 arithmetic operations are assumed. The actual e#0Eciency decreases primarily because the FPU can operate at no more than half speed during the matrix addition and subtraction operations.... ..."

### Table 1. COMPARISON OF SYSTOLIC V/S NON-SYSTOLIC ARCHITECTURES

1975

"... In PAGE 6: ... Since the interpolation points which are not re-encoded are usually unreliable, the average multiplicity CM D1CPDACV, among these points is about half the average multiplicity D1CPDACV. With this assumption, (6) can be reduced to: BWBWC8BV BP BEBV CM D1BE CPDACV CGCPDACV C8BW B7 BEBVB4D8 B7 BDB5 CM D1CPDACVC8BW B7 BEBV CM D1CPDACV (7) The minimum total interpolation latency obtained by optimizing over the parallelization factors, is given in Table1 . BTBN BWBN BXBN BZ are constants given by BT BP BEBV CM D1CPDACV B4CGCPDACV CM D1CPDACV B7 D8 B7 BDB5, BW BP CGCPDACV A2 BV, BX BP C6C5BTBV B4D8B7BDB5BE A0BD, BZ BP BEBV CM D1CPDACV and the optimum C8BW BP BX BDB7D5BWBT .... In PAGE 6: ... For the high-throughput, block pipelined systolic architecture, the block pipelining period as a function of the block pipelining depth BU is given in Table 1. We now evaluate the formulae given in Table1 for a CJBEBHBHBN BEBFBLCL RS soft-decoder operating with D1CPDACV BP BI. Since we use the re- encoding technique [5], the constraints to be satisfied by interpola- tion reduce from D2A0D1CPDACVB7BD BE A1 to around B4D2A0CZB5A0D1CPDACVB7BD BE A1 at a high channel SNR.... ..."

Cited by 2

### Table 4: Correlation Coe#0Ecients of DLC and LC, DLC and MC, and DLC and TC for each partition of

"... In PAGE 16: ... The test uses the ranks of the values of variables rather than the values themselves. Table4 shows the result of the correlation test, where a signi#0Ccance value corresponding to each correlation coe#0Ecientvalue is 0.0001.... ..."

### Table 4: Correlation Coe cients of DLC and LC, DLC and MC, and DLC and TC for each partition of collected programs.

"... In PAGE 16: ... The test uses the ranks of the values of variables rather than the values themselves. Table4 shows the result of the correlation test, where a signi cance value corresponding to each correlation coe cient value is 0.0001.... ..."

### Table 2. Speed in MFLOPs/cell of parallel C C + AA using the non-systolic method on an 4 4 AP1000 with n n matrices (single precision)

"... In PAGE 12: ... This idea can easily be integrated into the parallel `non-systolic apos; multiply-add, thus amortizing communication costs. The performance of this partitioning method is given in Table2 . As the maximum matrix size corresponds to 4MB, results for a 4 4 AP1000 are given; however the results for an 8 8 AP1000 appear identical for the corresponding matrix sizes.... In PAGE 14: ...2, and hence it is only appropriate for large matrices. Table2 gives the results of our implementation; in parentheses are the MFLOPs rating if 2n3 arithmetic operations are assumed. The actual e ciency decreases primarily because the FPU can operate at no more than half speed during the matrix addition and subtraction operations.... In PAGE 20: ... This version ran 7% slower even for large matrices, indicating the need for hardware support for these operations. The results in Table 4 are for n = 1000 and should be compared with those in Table2 of 7). The results in Table 5 are for n almost as large as possible (constrained by the storage of 16 MB/cell), and should be compared with those in Table 3 of 7).... ..."

### Table 4: Speed in MFLOPs/cell of parallel C C + AA using non-systolic method on an 4 4 AP1000 with N N matrices

in \Lambda

"... In PAGE 8: ... This can cause ring bu er over ow unless the AP1000 cells are synchronized periodically. The performance of this partitioning method are given in Table4 . As the maximum matrix size corresponds to 4MB, results for a 4 4 AP1000 are given; the results for an 8 8 appear 2While some authors have implemented distributed matrix multiplication without copying of the input matrices [14], this is not in general possible for the BLAS-3 as the input matrices can be the same (eg.... ..."

### Table 1. Speed in MFLOPs/cell of parallel multiply-add methods on an 8 8 AP1000 with n n matrices (single precision)

1992

"... In PAGE 4: ... A third variation is the `full-systolic apos; method (also known as Cannon apos;s algorithm) in which both A and B sub-blocks are rotated at each step; this however has the overhead that both A and B must be initially `aligned apos;. Table1 indicates the relative e ciency of each method for single precision. The overhead of the initial matrix alignment of the `full-systolic apos; method makes it the slowest.... In PAGE 5: ...5%). Table1 indicates that for square matrices, there is little di erence between the explicit and implicit methods, except for small matrices, which favour the implicit method. This is due to the high relative speed of the AP1000 communication routines, which make the choice of communication patterns less critical.... In PAGE 5: ... 4.1 E ect of Communication Comparison of Table1 with the results of Section 3 shows that the e ect of communication on performance is appreciable, at least for moderate matrix sizes. In the AP1000 apos;s xy communication routines, copying of matrices is avoided on message send; however, upon message receipt, messages are copied from a `ring bu er apos; to user space.... In PAGE 5: ... Consider an m n global matrix A having an m0 k0 (sub-) matrix A0 on a particular AP1000 cell, where m0 = m=Ny; k0 = k=Nx. Partition A0 into k0 k0 sub-blocks denoted A0 ij where 0 i dm0=k0e, 0 j dk0=k0e and the optimal block size k0 = 128 (for single precision) is chosen from Table1 . Let B be a k n global matrix partitioned in a similar way.... In PAGE 14: ... The high-bandwidth hardware row/column broadcast capability of the AP1000, extremely useful in linear algebra applications, and the low latency of the send/receive routines are also signi cant. As shown in Table1 , the speed of the former make the use of apos;systolic apos; versions of linear algebra algorithms unnecessary. The large, direct-mapped cache, while requiring extra e ort for full optimization, and the large cell memory are also very important features.... ..."

Cited by 4

### Table 1. Speed in MFLOPs/cell of parallel multiply-add methods on an 8 8 AP1000 with n n matrices (single precision)

1992

"... In PAGE 4: ... A third variation is the `full-systolic apos; method (also known as Cannon apos;s algorithm) in which both A and B sub-blocks are rotated at each step; this however has the overhead that both A and B must be initially `aligned apos;. Table1 indicates the relative e ciency of each method for single precision. The overhead of the initial matrix alignment of the `full-systolic apos; method makes it the slowest.... In PAGE 5: ...5%). Table1 indicates that for square matrices, there is little di erence between the explicit and implicit methods, except for small matrices, which favour the implicit method. This is due to the high relative speed of the AP1000 communication routines, which make the choice of communication patterns less critical.... In PAGE 5: ... 4.1 E ect of Communication Comparison of Table1 with the results of Section 3 shows that the e ect of communication on performance is appreciable, at least for moderate matrix sizes. In the AP1000 apos;s xy communication routines, copying of matrices is avoided on message send; however, upon message receipt, messages are copied from a `ring bu er apos; to user space.... In PAGE 5: ... Consider an m n global matrix A having an m0 k0 (sub-) matrix A0 on a particular AP1000 cell, where m0 = m=Ny; k0 = k=Nx. Partition A0 into k0 k0 sub-blocks denoted A0 ij where 0 i dm0=k0e, 0 j dk0=k0e and the optimal block size k0 = 128 (for single precision) is chosen from Table1 . Let B be a k n global matrix partitioned in a similar way.... In PAGE 14: ... The high-bandwidth hardware row/column broadcast capability of the AP1000, extremely useful in linear algebra applications, and the low latency of the send/receive routines are also signi cant. As shown in Table1 , the speed of the former make the use of apos;systolic apos; versions of linear algebra algorithms unnecessary. The large, direct-mapped cache, while requiring extra e ort for full optimization, and the large cell memory are also very important features.... ..."

Cited by 4

### Table 1. Speed in MFLOPs#2Fcell of parallel multiply-add methods on an 8#028 AP1000 with n#02n

"... In PAGE 10: ...Table1 indicates the relative e#0Eciency of each method for single precision. The overhead of the initial matrix alignment of the `full-systolic apos; method makes it the slowest.... In PAGE 10: ...5#25#29. Table1 indicates that for square matrices, there is little di#0Berence between the explicit and implicit methods, except for small matrices, whichfavour the implicit method. This is due to the high relative speed of the AP1000 communication routines, which make the choice of communication patterns less critical.... In PAGE 10: ... 4.1 E#0Bect of Communication Comparison of Table1 with the results of Section 3 shows that the e#0Bect of communication on performance is appreciable, at least for moderate matrix sizes. In the AP1000 apos;s xy communication routines, copying of matrices is avoided on message send; however, upon message receipt, messages are copied from a `ring bu#0Ber apos; to user space.... In PAGE 12: ... Consider an m #02 n global matrix A having an m 0 #02 k 0 #28sub-#29 matrix A 0 on a particular AP1000 cell, where m 0 = m=N y ;k 0 = k=N x . Partition A 0 into k 0 #02 k 0 sub-blocks denoted A 0 ij where 0 #14 i #14dm 0 =k 0 e,0#14 j #14dk 0 =k 0 e and the optimal block size k 0 = 128 #28for single precision#29 is chosen from Table1 . Let B be a k #02 n global matrix partitioned in a similar way.... In PAGE 24: ... The high-bandwidth hardware row#2Fcolumn broadcast capability of the AP1000, extremely useful in linear algebra applications, and the low latency of the send#2Freceive routines are also signi#0Ccant. As shown in Table1 , the speed of the former make the use of apos;systolic apos; versions of linear algebra algorithms unnecessary. The large, direct-mapped cache, while requiring extra e#0Bort for full optimization, and the large cell memory are also very important features.... ..."

### Table 1. Speed in MFLOPs/cell of parallel multiply-add methods on an 8 8 AP1000 with n n matrices (single precision)

"... In PAGE 10: ...Table1 indicates the relative e ciency of each method for single precision. The overhead of the initial matrix alignment of the `full-systolic apos; method makes it the slowest.... In PAGE 10: ...5%). Table1 indicates that for square matrices, there is little di erence between the explicit and implicit methods, except for small matrices, which favour the implicit method. This is due to the high relative speed of the AP1000 communication routines, which make the choice of communication patterns less critical.... In PAGE 10: ... 4.1 E ect of Communication Comparison of Table1 with the results of Section 3 shows that the e ect of communication on performance is appreciable, at least for moderate matrix sizes. In the AP1000 apos;s xy communication routines, copying of matrices is avoided on message send; however, upon message receipt, messages are copied from a `ring bu er apos; to user space.... In PAGE 12: ... Consider an m n global matrix A having an m0 k0 (sub-) matrix A0 on a particular AP1000 cell, where m0 = m=Ny; k0 = k=Nx. Partition A0 into k0 k0 sub-blocks denoted A0 ij where 0 i dm0=k0e, 0 j dk0=k0e and the optimal block size k0 = 128 (for single precision) is chosen from Table1 . Let B be a k n global matrix partitioned in a similar way.... In PAGE 24: ... The high-bandwidth hardware row/column broadcast capability of the AP1000, extremely useful in linear algebra applications, and the low latency of the send/receive routines are also signi cant. As shown in Table1 , the speed of the former make the use of apos;systolic apos; versions of linear algebra algorithms unnecessary. The large, direct-mapped cache, while requiring extra e ort for full optimization, and the large cell memory are also very important features.... ..."