### Table 4. Pseudocode for the parallel TRGRID algorithm for the CMTM environment

"... In PAGE 5: ... The pseudocode for our parallel algorithm is listed in Table 4. Remember that after completing the (parallel) Arnoldi iterations ( Table4 , line 1), with the row distribution described earlier, the rows of the bases CF and CM CF are dis-... ..."

### Table 1. Comparison of results between grids with and without diagonals. New results

1994

"... In PAGE 2: ... For two-dimensional n n meshes without diagonals 1-1 problems have been studied for more than twenty years. The so far fastest solutions for 1-1 problems and for h-h problems with small h 9 are summarized in Table1 . In that table we also present our new results on grids with diagonals and compare them with those for grids without diagonals.... ..."

Cited by 11

### Table 1 Performance in MFlops of parallel matrix-matrix multiplication on the BBN TC2000 using 1024- by-1024 matrices.

"... In PAGE 5: ... Finally, these codes have been used as a platform for the implementation of the uniprocessor version of Level 3 BLAS on the BBN TC2000 (see next Section). We show in Table1 the MFlops rates of the parallel matrix-matrix multiplication, and in Table 2 the performance of the LU factorization (we use a blocked code similar to the LAPACK one) on the ALLIANT FX/80, the CRAY-2, and the IBM 3090-600J obtained using our parallel version of the Level 3 BLAS. Note that our parallel Level 3 BLAS uses the serial manufacturer-supplied versions of GEMM on all the computers.... In PAGE 6: ... This package is available without payment and will be sent to anyone who is interested. We show in Table1 the performance of the single and double precision GEMM on di erent numbers of processors. Table 2 shows the performance of the LAPACK codes corresponding to the blocked LU factorization (GETRF, right-looking variant), and the blocked Cholesky factorization (POTRF, top-looking variant).... In PAGE 8: ... The second part concerned the performance we obtained with tuning and parallelizing these codes, and by introducing library kernels. We give in Table1 a brief summary of the results we have obtained: One of the most important points to mention here is the great impact of the use of basic linear algebra kernels (Level 3 BLAS) and the LAPACK library. The conclusion involves recommendations for a methodology for both porting and developing codes on parallel computers, performance analysis of the target computers, and some comments relating to the numerical algorithms encountered.... In PAGE 12: ... Because of the depth rst search order, the contribution blocks required to build a new frontal matrix are always at the top of the stack. The minimum size of the LU area (see column 5 of Table1 ) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table 1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors.... In PAGE 12: ... The minimum size of the LU area (see column 5 of Table 1) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors. Frontal matrices are stored in a part of the global working space that will be referred to as the additional space.... In PAGE 12: ... In a uniprocessor environment, only one active frontal matrix need be stored at a time. Therefore, the minimum real space (see column 7 of Table1 ) to run the numerical factorization is the sum of the LU area, the space to store the largest frontal matrix and the space to store the original matrix. Matrix Order Nb of nonzeros in Min.... In PAGE 13: ... In this case the size of the LU area can be increased using a user-selectable parameter. On our largest matrix (BBMAT), by increasing the space required to run the factorization (see column 7 in Table1 ) by less than 15 percent from the minimum, we could handle the ll-in due to numerical pivoting and run e ciently in a multiprocessor environment. We reached 1149 M ops during numerical factorization with a speed-up of 4.... In PAGE 14: ...ack after computation. Interleaving and cachability are also used for all shared data. Note that, to prevent cache inconsistency problems, cache ush instructions must be inserted in the code. We show, in Table1 , timings obtained for the numerical factorization of a medium- size (3948 3948) sparse matrix from the Harwell-Boeing set [3]. The minimum degree ordering is used during analysis.... In PAGE 14: ... -in rows (1) we exploit only parallelism from the tree; -in rows (2) we combine the two levels of parallelism. As expected, we rst notice, in Table1 , that version 1 is much faster than version 2... In PAGE 15: ... Results obtained on version 3 clearly illustrate the gain coming from the modi cations of the code both in terms of speed-up and performance. Furthermore, when only parallelism from the elimination tree is used (see rows (1) in Table1 ) all frontal matrices can be allocated in the private area of memory. Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1].... In PAGE 15: ... Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1]. We nally notice, in Table1 , that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table 1.... In PAGE 15: ... We nally notice, in Table 1, that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table1 . The main reason is that frontal matrices must be allocated in the shared area when the second level of parallelism is enabled.... In PAGE 17: ...5 28.2 Table1 : Results in Mega ops on parallel computers. In Table 1, it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion.... In PAGE 17: ...2 Table 1: Results in Mega ops on parallel computers. In Table1 , it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion. The reason for this is that the single precision mathematical library routines are less optimized.... In PAGE 18: ... block diagonal) preconditioner appears to be very suitable and is superior to the Arnoldi-Chebyshev method. Table1 shows the results of the computation on an Alliant FX/80 of the eight eigenpairs with largest real parts of a random sparse matrix of order 1000. The nonzero o -diagonal and the full diagonal entries are in the range [-1,+1] and [0,20] respectively.... In PAGE 19: ... A comparison with the block preconditioned conjugate gradient is presently being investigated.In Table1 , we compare three partitioning strategies of the number of right-hand sides for solving the system of equations M?1AX = M?1B, where A is the ma- trix BCSSTK27 from Harwell-Boeing collection, B is a rectangular matrix with 16 columns, and M is the ILU(0) preconditioner. Method 1 2 3 1 block.... In PAGE 26: ...111 2000 lapack code 0.559 Table1 : Results on matrices of bandwith 9.... In PAGE 30: ... We call \global approach quot; the use of a direct solver on the entire linear system at each outer iteration, and we want to compare it with the use of our mixed solver, in the case of a simple splitting into 2 subdomains. We show the timings (in seconds) in Table1 on 1 processor and in Table 2 on 2 processors, for the following operations : construction amp; assembly : Construction and Assembly, 14% of the elapsed time, factorization : Local Factorization (Dirichlet+Neumann), 23%, substitution/pcg : Iterations of the PCG on Schur complement, 55%, total time The same code is used for the global direct solver and the local direct solvers, which takes advantage of the block-tridiagonal structure due to the privileged direction. Moreover, there has been no special e ort for parallelizing the mono-domain approach.... ..."

### Table 1 The parallel multidirectional search algorithm.

1991

"... In PAGE 14: ... A distributed memory implementation. We begin with a statement of the basic algorithm, shown in Table1 . Each of the p processors3 constructs one vertex vi and its function value fvi.... ..."

Cited by 101

### Table 10: Test matrix generated from a discretization on a 64 64 grid: Laplace apos;s equation. Times shown in table are in microseconds. The experiments are performed on the BBN TC2000. In the Parallel distributed Cimmino solver the number of generated subsystems, for numerical reasons, is related with the structure of the problem and not with the number of available computing elements. Therefore, we implemented a scheduler that statically distributes tasks to the computing elements trying to keep the work load balanced among the processing elements and to take advantage of available interconnection networks. Part of our current research objectives is to test the scheduler in a heterogeneous environment using 11

"... In PAGE 11: ... Finally, we developped an implementation where a single process performs the steps of the Block-CG, and only the matrix?matrix products that involve the iteration matrix are performed in parallel (Master-Slave : centralized). In Arioli, Drummond, Du , and Ruiz (1994a), we present results obtained for the three implementations using PVM 3 on a BBN TC2000 computer (see Table10 ) and a heterogeneous network of IBM RS6000 and SUN Sparc 10 workstations. Laplace Matrix 4096 x 4096 (Block size = 4, 171 iterations) Elapsed Time of sequential version = 279142... ..."

### Table 2. Parallel-algorithm characteristics.

"... In PAGE 10: ... Comparisons with a hierarchical implementation of Count Distribution/CCPD showed orders-of- magnitude improvements of ParMax- Clique over Count Distribution. SUMMARY OF PARALLEL ALGORITHMS Table2 shows the essential differences among the different methods reviewed ear- lier and groups together related algo- rithms. As you can see, there are only a handful of distinct paradigms.... ..."

### Table 4: Some parallel genetic algorithm implementations. Platform GA type Topology Researcher/year

"... In PAGE 63: ... For this reason the scope of many parallel genetic algorithm implementations and experiments is to find out how the population should be divided into parallel processors and how possible sub-populations should interact for quick convergence. Some experiments are listed in Table4 , where it can be seen how the same platform have been used to study different topologies for centralized, distributed and network model parallel genetic algorithms [44]. The most popular computing platforms have been transputer-based systems, which is no wonder due to their low cost and simple interconnection scheme [125].... ..."

### Table 1: Routines with nested parallelism. Both the inner part and the outer part can be executed in parallel.

1993

"... In PAGE 9: ... The implementation of the permute function on a distributed-memory parallel machine could use its communication network and the implementation on a shared-memory machine could use an indirect write into the memory. Table1 lists some of the sequence functions available in Nesl. A subset of the functions (the starred ones) form a complete set of primitives.... In PAGE 12: ...Outer Parallelism Inner Parallelism Quicksort For lesser and greater Quicksort elements Mergesort For rst and second Mergesort half Closest Pair For each half of Closest Pair space Strassen apos;s For each of the 7 Strassen apos;s Matrix Multiply sub multiplications Matrix Multiply Fast For two sets of Fast Fourier Transform interleaved points Fourier Transform Table 2: Some divide and conquer algorithms. Table1 lists several examples of routines that could take advantage of nested parallelism. Nested parallelism also appears in most divide-and-conquer algorithms.... In PAGE 16: ... The work complexity for most of the sequence functions is simply the size of one of its arguments. A complete list is given in Table1 . The size of an object is de ned recursively: the size of a scalar value is 1, and the size of a sequence is the sum of the sizes of its elements plus 1.... ..."

Cited by 87

### Table II: Implementation Results of Segmentation Algorithm on Image 3 from [13], seven grey circles (128 128) The source code for the parallel algorithms presented in this paper is available for distribution to interested parties.

1996

Cited by 19

### Table II: Implementation Results of Segmentation Algorithm on Image 3 from [13], seven grey circles (128 128) The source code for the parallel algorithms presented in this paper is available for distribution to interested parties.

1996

Cited by 19