### Table 1. The BSP cost parameters for a variety of shared and distributed-memory parallel machines.

1997

"... In PAGE 3: ... Similarly, [10] shows how careful construction of barriers can reduce the value of l. Table1 shows the values for l and g for a variety of parallel machines (the benchmarks used to calculate these constants are described in [7]). Returning to the problem of summing n values posed at the start of this section, it is natural to distribute the the data amongst the processors in n=p sized chunks, when n gt; p.... In PAGE 3: ... Combin- ing the cost of locally summing each processors n=p sized chunk of data with the cost of the summation of p values gives a total cost for summing n values on p processors of n=p + log p (1 + g + l). It is clear from this cost formula, and from the values of l and g in Table1 , that the logarithmic number of barrier synchronisations used in this algorithm dominate the cost unless n gt; p log p (1 + g + l). For a network of eight workstations, therefore, n must be greater than 20; 000; 000 elements before the computation time starts to dominate the communication time; even for an eight-processor Cray T3D, n must be greater than 4; 200.... In PAGE 8: ... In general g and p are functions of p but, for purpose- built parallel machines, they are sub-linear in p. For exam- ple, Table1 shows that g is approximately constant for the Cray T3E and l is logarithmic in p. Therefore, to provide a meaningful lower bound on the speedup, upper bounds on the values of l and g can be used as long as the dependence is not too great, as in the case of the Cray systems.... In PAGE 8: ... However, due to the shared bus nature of Ethernet, only a single pair of processors can be involved in communication at any time. This can be observed in Table1 as g / p and l / p log p for full h-relations; where the constants of proportionality are half the values of g and l for a two-processor configuration. The speedup can now be refined to: k1n log n k2 np log n + k1p2 log p2 + p2g2 + 0:5ng2 + 1:5l2p log p (6) For reasonably large p and n p2, this simplifies to:... ..."

Cited by 8

### Table 2 shows the pre-processing times spent in AZ transform. The times for METIS, RCM, and SAW are comparable, and are usually an order of magnitude larger than the corresponding times for AZ matvec mult. The AZ transform times show some scalability up to 32 processors. However, for ORIG, the times are two to three orders of magnitude larger, and show very little scalability. Clearly, the ORIG ordering is too ine cient and unacceptable on distributed-memory machines.

2000

"... In PAGE 4: ... Table2 : Runtimes (in seconds) of AZ transform using di erent orderings on the Cray T3E. To better understand the various partitioning and ordering algorithms, we have built a simple per- formance model to predict the parallel runtime of AZ matvec mult.... ..."

Cited by 3

### Table 2 shows the pre-processing times spent in AZ transform. The times for METIS, RCM, and SAW are comparable, and are usually an order of magnitude larger than the corresponding times for AZ matvec mult. The AZ transform times show some scalability up to 32 processors. However, for ORIG, the times are two to three orders of magnitude larger, and show very little scalability. Clearly, the ORIG ordering is too ine cient and unacceptable on distributed-memory machines.

2000

"... In PAGE 7: ... Table2 : Runtimes (in seconds) of AZ transform using di erent orderings on the Cray T3E. To better understand the various partitioning and ordering algorithms, we have built a simple performance model to predict the parallel runtime of AZ matvec mult.... ..."

Cited by 3

### Table 2. Distributed-memory results.

2001

"... In PAGE 19: ... By partitioning the simulation over four processors, the system could complete the simulation using less than 500 Mbytes on each node. Table2 shows maximum memory used and execution time for sequential and four-processor execution. The four nodes used for parallel execution had 512 Mbytes of memory each; the sequential execution was performed both on a node having 2 Gbytes of memory and on a 512-Mbyte node.... In PAGE 20: ... Owing to the large amount of communication in the model, this case also carries a performance penalty, whose severity depends on the size of the interaction regions between transmitters in the model. Table2 shows results for the first circle (six closest interfering cells), second circle (18 closest interfering cells), and the whole system. Slowdowns range between 6 and 27 times.... ..."

Cited by 7

### Table 1, we report the runtimes of these two routines on the 450 MHz Cray T3E at NERSC. The original natural ordering (ORIG) is the slowest and clearly unacceptable on distributed-memory machines. For AZ matvec mult, the key kernel routine, RCM is slightly but consistently faster than SAW, while METIS requires almost twice the RCM execution time. However, METIS, RCM, and SAW, all demonstrate excellent scalability up to the 64 processors that were used for these experiments. The pre-processing times in AZ transform are more than an order of magnitude larger than the corresponding times for AZ matvec mult (except for ORIG where it is two to three orders of magnitude larger).

"... In PAGE 3: ... Table1 : Runtimes (in seconds) for di erent orderings on the Cray T3E. To better understand the various partitioning/ordering algorithms, we have built a simple per- formance model to predict the parallel runtime.... ..."

### Table 1: Distribution of memory request

in Analysis Of Interconnection Networks For Cache Coherent Multiprocessors With Scientific Applications

"... In PAGE 9: ... The parameters are measured from the simulator and are then fed to our queueing network model as inputs. Table1 gives the values of pi;j for the di erent applications in a 4 4 system. It may be observed that the memory accesses are almost equally distributed for applications, except for FWA.... ..."

### Table 1: Run time, speedup and efficiency for p-processor steady state solution for the FMS model with k=7. Results are presented for an AP3000 distributed memory parallel computer and a PC cluster.

2002

"... In PAGE 5: ... Setting k (the number of unprocessed parts in the system) to 7 results in the underlying Markov chain of the GSPN having 1 639 440 tangible states and produces 13 552 968 off-diagonal entries in its generator matrix Q. Table1 summarises the performance of the implementation on a distributed memory parallel computer and a cluster of workstations. The parallel computer is a Fujitsu AP3000 which has 60 processing nodes (each with an UltraSparc 300MHz processor and 256MB RAM) connected by a 2D wraparound mesh network.... ..."

### Table 2: Summary of distributed memory Tuplespace implementations

"... In PAGE 42: ...There have been several other published implementations of the Linda tuplespace targeting distributed memory machines [6, 44, 3, 16, 7, 31]. The di erences and similarities of these systems are summarized in Table2 . Blank entries in the table are due to inadequate information in the publication relating to that characteristic.... ..."

### Table 2: Summary of distributed memory Tuplespace implementations

"... In PAGE 33: ... IMPLEMENTATION COMPARISONS There have been several other published implementations of the Linda tuplespace targeting distributed memory machines [6, 44, 3, 16, 7, 31]. The di erences and similarities of these systems are summarized in Table2 . Blank entries in the table are due to inadequate information in the publication relating to that characteristic.... ..."