### Table 2. Distributed-memory results.

2001

"... In PAGE 19: ... By partitioning the simulation over four processors, the system could complete the simulation using less than 500 Mbytes on each node. Table2 shows maximum memory used and execution time for sequential and four-processor execution. The four nodes used for parallel execution had 512 Mbytes of memory each; the sequential execution was performed both on a node having 2 Gbytes of memory and on a 512-Mbyte node.... In PAGE 20: ... Owing to the large amount of communication in the model, this case also carries a performance penalty, whose severity depends on the size of the interaction regions between transmitters in the model. Table2 shows results for the first circle (six closest interfering cells), second circle (18 closest interfering cells), and the whole system. Slowdowns range between 6 and 27 times.... ..."

Cited by 7

### Table 1: Distribution of memory request

in Analysis Of Interconnection Networks For Cache Coherent Multiprocessors With Scientific Applications

"... In PAGE 9: ... The parameters are measured from the simulator and are then fed to our queueing network model as inputs. Table1 gives the values of pi;j for the di erent applications in a 4 4 system. It may be observed that the memory accesses are almost equally distributed for applications, except for FWA.... ..."

### Table 2: Summary of distributed memory Tuplespace implementations

"... In PAGE 42: ...There have been several other published implementations of the Linda tuplespace targeting distributed memory machines [6, 44, 3, 16, 7, 31]. The di erences and similarities of these systems are summarized in Table2 . Blank entries in the table are due to inadequate information in the publication relating to that characteristic.... ..."

### Table 2: Summary of distributed memory Tuplespace implementations

"... In PAGE 33: ... IMPLEMENTATION COMPARISONS There have been several other published implementations of the Linda tuplespace targeting distributed memory machines [6, 44, 3, 16, 7, 31]. The di erences and similarities of these systems are summarized in Table2 . Blank entries in the table are due to inadequate information in the publication relating to that characteristic.... ..."

### Table 1. The BSP cost parameters for a variety of shared and distributed-memory parallel machines.

1997

"... In PAGE 3: ... Similarly, [10] shows how careful construction of barriers can reduce the value of l. Table1 shows the values for l and g for a variety of parallel machines (the benchmarks used to calculate these constants are described in [7]). Returning to the problem of summing n values posed at the start of this section, it is natural to distribute the the data amongst the processors in n=p sized chunks, when n gt; p.... In PAGE 3: ... Combin- ing the cost of locally summing each processors n=p sized chunk of data with the cost of the summation of p values gives a total cost for summing n values on p processors of n=p + log p (1 + g + l). It is clear from this cost formula, and from the values of l and g in Table1 , that the logarithmic number of barrier synchronisations used in this algorithm dominate the cost unless n gt; p log p (1 + g + l). For a network of eight workstations, therefore, n must be greater than 20; 000; 000 elements before the computation time starts to dominate the communication time; even for an eight-processor Cray T3D, n must be greater than 4; 200.... In PAGE 8: ... In general g and p are functions of p but, for purpose- built parallel machines, they are sub-linear in p. For exam- ple, Table1 shows that g is approximately constant for the Cray T3E and l is logarithmic in p. Therefore, to provide a meaningful lower bound on the speedup, upper bounds on the values of l and g can be used as long as the dependence is not too great, as in the case of the Cray systems.... In PAGE 8: ... However, due to the shared bus nature of Ethernet, only a single pair of processors can be involved in communication at any time. This can be observed in Table1 as g / p and l / p log p for full h-relations; where the constants of proportionality are half the values of g and l for a two-processor configuration. The speedup can now be refined to: k1n log n k2 np log n + k1p2 log p2 + p2g2 + 0:5ng2 + 1:5l2p log p (6) For reasonably large p and n p2, this simplifies to:... ..."

Cited by 8

### Table 3: Data parallel performance on Thinking Machines CM-5 message-passing distributed-memory multiprocessor Circuit Processors

"... In PAGE 22: ... It is important to notice that there are cases in which a combination of data and task parallelism provides better performance over either type of parallelism individually. Compared to Table3 , the results on the 128 processor runs show that the combined task... ..."

### Table 5.4: Data parallel performance on Thinking Machines CM-5 message-passing distributed-memory multiprocessor

### Table 5.8: Task parallel performance on Thinking Machines CM-5 message-passing distributed-memory multiprocessor

### Table 5.10: Data and task parallel performance on Thinking Machines CM-5 message- passing distributed-memory multiprocessor

### Table 1: Run time, speedup and efficiency for p-processor steady state solution for the FMS model with k=7. Results are presented for an AP3000 distributed memory parallel computer and a PC cluster.

2002

"... In PAGE 5: ... Setting k (the number of unprocessed parts in the system) to 7 results in the underlying Markov chain of the GSPN having 1 639 440 tangible states and produces 13 552 968 off-diagonal entries in its generator matrix Q. Table1 summarises the performance of the implementation on a distributed memory parallel computer and a cluster of workstations. The parallel computer is a Fujitsu AP3000 which has 60 processing nodes (each with an UltraSparc 300MHz processor and 256MB RAM) connected by a 2D wraparound mesh network.... ..."