### Table 2: Parallel elapsed time for each linear algebra kernel involved in the numerical scheme.

in Incremental

"... In PAGE 14: ... We notice that this property is not necessary true for sparse linear systems, where the cost of the incremental preconditioner might dominate even for small values of p so that the preconditioner might not be e ective if it does not signi cantly reduce the number of iterations. In Table2 , we report on the parallel elapsed time required by each linear algebra kernel involved in the solution scheme. For MISLRU the application time corresponds the time to apply the preconditioner for the last linear system; that is, when the preconditioner is the most expensive.... ..."

### Table 2: Parallel elapsed time for each linear algebra kernel involved in the numerical scheme.

"... In PAGE 14: ... We notice that this property is not necessary true for sparse linear systems, where the cost of the incremental preconditioner might dominate even for small values of p so that the preconditioner might not be effective if it does not significantly reduce the number of iterations. In Table2 , we report on the parallel elapsed time required by each linear algebra kernel involved in the solution scheme. For MISLRU the application time corresponds the time to apply the preconditioner for the last linear system; that is, when the preconditioner is the most expensive.... ..."

### Table 6: Speed in MFLOPs/cell for B A?1B for n n matrices on the AP1000 (single precision) and blocking (Sections 5.1 and 5.3) are also applicable to the implementation of other linear algebra applications, on the AP1000 and on similar architectures. The LINPACK Benchmark and BLAS-3 results show that the AP1000 is a good machine for numerical linear algebra, and that on moderate to large problems we can consistently achieve close to 80% of its theoretical peak performance, for the former, and 85-90% for the latter. They signify that the AP1000 architecture is well balanced on all levels, with respect to oating point computation. The main reason for this is the high ratio of communication speed to oating- point speed compared to machines such as the Intel Delta and nCUBE. The high-bandwidth hardware row/column broadcast capability of the AP1000, extremely useful in linear algebra applications, and the low latency of the send/receive routines are also signi cant. As shown in

1992

"... In PAGE 11: ...1. Table6 gives results for this computation for single precision, with ! = 4Nyqn=(2Ny). For the unblocked algorithm, the performance does not even approach that of Rank1Update(), due to communication overheads (for small n) and the fact that rank-1 update is a Level 2 operation and hence makes poor use of the cache (for large n).... In PAGE 11: ...2 must also be employed. The performance for moderate sized matrices of this `super-blocked apos; scheme is given in Table6 ; for larger matrices, performance steadily improves up to 7.3 MFLOPs for n=Nx = 1024.... ..."

Cited by 4

### Table 1: Speedups of automatically restructured linear algebra routines on Con#0Cg-

1991

"... In PAGE 12: ... The initial results were encouraging. Table1 shows the speedup results for a set of linear algebra routines. The #0Crst routine is a conjugate gradient algorithm #5B23#5D; the other routines are from Numerical Recipes #5B27#5D.... ..."

Cited by 22

### Table 1: Elapsed time (sec) to perform BLAS-2 and LAPACK routines on various platforms when the size m of the matrices is varied. CRAY XD1 AMD Opteron processor n DGEMV SGEMV Ratio DPOTRF SPOTRF Ratio DPOTRS SPOTRS Ratio

"... In PAGE 2: ... This class of chip includes for instance the IBM PowerPC, the Power MAC G5, the AMD Opteron, the CELL, and the Intel Pentium. Table1 reports the performance of basic dense kernels involved in numerical linear algebra: the GEMV BLAS-2 matrix vector product and the POTRF/ POTRS LAPACK Cholesky factor- ization and backward/forward substitution. It can be seen that single precision calculation generally outperforms double precision.... ..."

### Table 3 Linear algebra and max-plus algebra

"... In PAGE 23: ... (1992). Table3 illustrates some similarities between linear algebra and max-plus algebra. Here, we will not give an extensive treatment of the algebraic and system theoretic properties of the max-plus algebra but restrict to the modelling issues.... ..."

### TABLE 1. Linear Algebra Kernel Descriptions

"... In PAGE 3: ... Sor improves more because it has a higher percentage of references removed. BLU Block LU BLUP Block LUP Chol Cholesky Decomposition Afold Adjoint Convolution Fold Convolution Seval Spline Evaluation Sor Successive Over Relaxation Linear Algebra Kernel Description TABLE1 . Linear Algebra Kernel Descriptions Linear Algebra Kernel Normalized Execution Time 92 93 61 69 90 95 85 90 VM MMk MMi LU LUP BLUBLUP Chol Afold Fold Seval Sor Mean 0 20 40 60 80 100 Original Optimized... ..."

### TABLE 1. Linear Algebra Kernel Descriptions

"... In PAGE 3: ... Sor improves more because it has a higher percentage of references removed. BLU Block LU BLUP Block LUP Chol Cholesky Decomposition Afold Adjoint Convolution Fold Convolution Seval Spline Evaluation Sor Successive Over Relaxation Linear Algebra Kernel Description TABLE1 . Linear Algebra Kernel Descriptions Linear Algebra Kernel Normalized Execution Time 92 93 61 69 90 95 85 90 VM MMk MMi LU LUP BLUBLUP Chol Afold Fold Seval Sor Mean 0 20 40 60 80 100 Original Optimized... ..."

### TABLE 1. Linear Algebra Kernel Descriptions

"... In PAGE 3: ... Sor improves more because it has a higher percentage of references removed. BLU Block LU BLUP Block LUP Chol Cholesky Decomposition Afold Adjoint Convolution Fold Convolution Seval Spline Evaluation Sor Successive Over Relaxation Linear Algebra Kernel Description TABLE1 . Linear Algebra Kernel Descriptions Linear Algebra Kernel Normalized Execution Time 92 93 61 69 90 95 85 90 VM MMk MMi LU LUP BLUBLUP Chol Afold Fold Seval Sor Mean 0 20 40 60 80 100 Original Optimized ... ..."

### Table 3. Communication of linear algebra kernels

1995

"... In PAGE 5: ... Table 2 gives an overview of the data repre- sentation and layout for the dominating computations of the linear algebra kernels. Table3 shows the benchmarks clas- sified by the communication operations that they use, along with their associated array ranks. Finally, Table 4 demon- strates the computation (FLOP count) to communication ratio in the main loop of each linear algebra benchmark, memory usage for the implemented data types, as well as... ..."

Cited by 2