### Table 1. Parallelization of the major matrix operations that appear in the ARE solver.

"... In PAGE 5: ... Depending on the structure of the state matrix pair (A, E) the banded linear system solver in ScaLAPACK or the sparse linear system solvers in MUMPS are also invoked. Table1 lists the specific routines employed for each one of the major operations in the algorithm. In the table we do not include the operations required to compute the shift parameters for the LR-ADI iteration as that part of the algorithm was not described.... ..."

### Table 4. Performance results for the sparse conjugate solver using the data set from a unstructured grid solver concerning the parallelization of pressure correction method on unstructured grid

2001

"... In PAGE 18: ... A sparse conjugate gradient solver using our compressed schemes and intrinsic libraries are employed to solve the sparse linear systems. The performance results on 16-node IBM SP2 machines are shown in Table4 . All the times measured are in seconds.... ..."

Cited by 3

### Table 6. Performance comparison on linear equation solver.

"... In PAGE 9: ... CRA V apos;s SCILIB provides a routine for matrix inversion, which runs at 300 to 400 MFLOPS. The times reported in Table6 for the X-MP, however, are optimal. The linear equation solver used on the X-MP is using an out-of-core Gaussian elimination algorithm, based on block matrix-matrix products [Grimes 1988].... In PAGE 9: ... The linear equation solver was executed five times on both ma~ines. The remark- able result in Table6 is not so much the actual performance, but the considerable performance variation on the CRA Y -2. While all the routines varied about 15 to 35% in performance depending on system load, a 70% difference was noted in the linear equation solver.... ..."

### Table 1: Performance of the Linear Equation Solver Application Execution time

"... In PAGE 17: ... To simulate moderate load, idle periods are between four and eight minutes and busy periods are from 20 to 30 minutes. The results of performance tests run using the linear equation solver application to solve a system of 1024 equations are shown in Table1 . The table shows the degradation which results when just two other processes per node are competing for computing resources.... ..."

### Table 7: Comparison of Tafti apos;s CGS code and HPFStab for a 643 grid size calculation. Speedups are calculated based on the measured cpu time for one processor. Communication issues involved with distributed memory matrix-vector multiplication have been ad- dressed in [8]. Lewis and van de Geijn describe the data distribution and communication patterns of ve general implementations of CG solvers and have applied these schemes to the problem of solving sparse symmetric linear systems. These realizations demonstrate that the cost of communication can be overcome to a much larger extent than is often assumed. Various types of preconditioners have also been implemented in an attempt to boost the e ciency of CG implementations on both parallel and non-parallel systems [14, 5, 22]. Hayami and Harada implement a diagonal-scaling preconditioning CG method similar to one contained in CMStab, and demonstrate that the method lacks interprocessor communication and is completely parallelizable. An e cient parallel CG method based on a multigrid [16] preconditioning algorithm has been implemented on Fujitsu multicom- puter [22]. This method has high parallelism and fast convergence. It is more than 10 times faster than the scaled CG method on Fujitsu AP1000.

"... In PAGE 20: ...optimized for a 4 GByte, 16 processor SGI Power Challenge [20]. Their results for up to 8 processors are given in Table7 . Not only do their results show faster execution times than HPFStab, their speedups show greater superlinear speedups as well.... ..."

### Table 1. Characteristics of the sample matrices. The sparsity is measured as average number of nonzeros per row (i.e., nnz(A)/N), and the Fill-ratio shows the ratio of number of nonzeros in L+U over that in A. Here, MeTiS is used to reorder the equations to reduce fill.

"... In PAGE 2: ....3. SuperLU efficiency with these applications SuperLU [6] is a leading scalable solver for sparse linear systems using direct methods, of which the development is mainly funded through the TOPS SciDAC project (led by David Keyes) [7]. Table1 shows the characteristics of a few typical matrices taken from these simulation codes. Figure 1 shows the parallel runtime of the two important phases of SuperLU: factorization and triangular solution.... ..."

### Table 3: Times (in s) for solving the equilibrium problem (19 linear systems).

"... In PAGE 7: ... We also make a comparison with MUMPS used as a black-box parallel sparse direct solver on the complete original prob- lem, where the sti ness matrices are considered as in a distributed format (a feature of the MUMPS software). In Table3 , we display the elapsed times to solve the complete equilibrium problem, by using direct substructuring (DSS), MUMPS as a black box direct parallel solver (MS) or iterative substructuring with preconditioned conjugate... In PAGE 7: ...Table 3: Times (in s) for solving the equilibrium problem (19 linear systems). Table3 shows that the substructuring algorithms are more e cient than the multifrontal parallel solver that does not scale very well because of the relative modest size of the linear systems. We also observe that iterative substructuring is slightly more e cient than direct substructuring.... In PAGE 7: ... However, this behaviour might not be true for other equations and to make a fair comparison, we required the same accuracy for all the linear solvers. Table3 shows that the one-level algorithms (AS and NN) perform better than the two-level algorithms, even though the latter require slightly fewer it- erations for larger number of subdomains. This shows that a smaller number of iterations is not always an indication for better overall e ciency of a precondi- tioned Krylov solver.... ..."

### Table 1: Performance of the Linear Equation Solver Application

"... In PAGE 17: ... To simulate moderate load, idle periods are between four and eight minutes and busy periods are from 20 to 30 minutes. The results of performance tests run using the linear equation solver application to solvea system of 1024 equations are shown in Table1 . The table shows the degradation which results when just two other processes per node are competing for computing resources.... ..."

### Table 1: Cenju-3 matrix-vector product performance (n = 160; 000 with 16 nonzeros/row). Figure 5 shows the performance of PLUMP using the parspai preconditioner and the cgs solver when applied to a large system of linear equations (n = 16; 384) as resulting from PDE on a rectangular mesh. This problem could not be solved on a single processor because of memory limitations, hence the speedup is computed using 16 PEs as a base. The scaling behavior of the solver, which depends almost entirely on the performance of the matrix-vector product and the calculation of the preconditioner, seems to be promis- ing for tackling larger size systems. A good partitioning of the mesh, however, is impor- tant to exploit the data locality on each processor and to ensure a good compute-to- communication ratio on distributed-memory parallel machines. The performance of the preconditioner could be substantially improved by optimizing the communication, tuning the di erent parameters and by improvement of the basic algorithm used in parspai. TR-96-15, May 1996 15

"... In PAGE 17: ... This operation has been optimized with non-blocking MPI communication primitives for optimum performance. Table1 shows the performance of the matrix-vector multiplication for a relatively large matrix (n = 1600000) on the Cenju-3. The e cient implementation of the matrix-vector product in indeed of utmost importance for proper scaling of the solvers.... ..."

### Table 4: Relationship of memory con icts, strides, and algorithms for the Perfect Club Bench- marks. G/S = gather/scatter. The algorithms are: 1) Sparse linear systems solvers, 2) Non- linear algebraic system solvers, 3) Fast Fourier transform, 4) Rapid elliptic problem solvers, 5) Multigrid schemes, 6) Ordinary di erential equation solvers, 7) Monte Carlo schemes, 8) Integral transforms, 9) Convolution.

"... In PAGE 8: ... Stride information is not available from the hardware performance monitor, but this infor- mation can be extracted from the sim traces. Table4 shows the percentage of some selected strides for the Perfect Club Benchmarks. The strides have been taken modulo 32 since only strides which are multiples of eight (0, 8, 16, and 24) cause internal memory con icts.... In PAGE 22: ... All counts are in millions. Table4 : Relationship of memory con icts, strides, and algorithms for the Perfect Club Bench- marks. G/S = gather/scatter.... ..."