### Table 3: Computational times associated with parallel implementation of the finite element model.

2007

"... In PAGE 64: ... This ensured that the perturbed displacement data sets were still contained within the atlases. Results Parallel implementation of the Finite Element Model Table3 illustrates the computational time necessary to solve the biphasic model on a finite element mesh containing 19468 nodes and 104596 elements, using 16 processors (2.8GHz,... ..."

### Table 11: CFD Convection Diffusion Problem Implemented on NCUBE2

"... In PAGE 10: ... Therefore they appear to be the right compromise class of method for highly parallel architectures. This is demon- strated n the results in Table11 which show the considrably faster performance of a multigrid algorithm.28 These results also demonstrate the fallacy to measure the performance of a parallel machine by MFLOPS alone, when co~paring different algorithms.... ..."

### Table 3 Evaluation of structural parameters using three different versions of the lattice model for sample 2. A: Lattice model with finite stack height (used as fitting parameter), neglecting instrumental broadening [equation (5)]. B: Lattice model with a finite stack height corresponding to the film thickness (400 nm), and with finite instrumental resolution [equation (6)]. C: Lattice model with a finite stack height corresponding to the film thickness (400 nm), neglecting instrumental broadening [equation (4)]. All three analyses are based on equations (10) and (11).

2004

"... In PAGE 7: ... The results are presented in Fig. 12 and Table3 . In column A, the parameters from Table 1 (lattice model with finite stack height) are repeated.... ..."

### Table 2: The performance of different implementations of multilevel k-way partitioning algorithm. This table shows the performance of the MPI- and SHMEM-based parallel algorithm, of the coarse-grain parallel multilevel refinement algorithm, and of the serial al- gorithm on an SGI workstation. In the case of the results of the parallel algorithms, for each graph, the performance is shown for 16-, 32-, 64-, and 128-way partitions on 16, 32, 64, and 128 processors, respectively. All times are in seconds.

1997

"... In PAGE 6: ... of Edges Description AUTO 448695 3314611 3D Finite element mesh MDUAL 258569 513132 Dual of a 3D Finite element mesh MDUAL2 988605 1947069 Dual of a 3D Finite element mesh Table 1: Various graphs used in evaluating the parallel multilevel k-way graph partitioning algorithm. Table2 shows the performance of various implementations of the multilevel k-way partitioning algorithm. The first... In PAGE 7: ... Also, because the coarse-grain imple- mentation is memory efficient, this increases the amount of time spent in the algorithm to set-up the appropriate data structures. The third subtable in Table2 shows the performance achieved by the coarse-grain parallel multilevel refinement algorithm. These results were obtained by using as the initial graph distribution, the partitioning obtained by the parallel multilevel k-way partitioning algorithm.... ..."

Cited by 25

### Table 2: The performance of different implementations of multilevel k-way partitioning algorithm. This table shows the performance of the MPI- and SHMEM-based parallel algorithm, of the coarse-grain parallel multilevel refinement algorithm, and of the serial algorithm on an SGI workstation. In the case of the results of the parallel algorithms, for each graph, the performance is shown for 16-, 32-, 64-, and 128-way partitions on 16, 32, 64, and 128 processors, respectively. All times are in seconds.

1997

"... In PAGE 9: ... of Edges Description AUTO 448695 3314611 3D Finite element mesh MDUAL 258569 513132 Dual of a 3D Finite element mesh MDUAL2 988605 1947069 Dual of a 3D Finite element mesh Table 1: Various graphs used in evaluating the parallel multilevel k-way graph partitioning algorithm. Table2 shows the performance of various implementations of the multilevel k-way partitioning algorithm. The first two subtables show the performance of the coarse-grain and SHMEM-based parallel partitioning algorithms, respec-... In PAGE 10: ... Also, because the coarse-grain implementation is memory efficient, this increases the amount of time spent in the algorithm to set-up the appropriate data structures. The third subtable in Table2 shows the performance achieved by the coarse-grain parallel multilevel refinement algorithm. These results were obtained by using as the initial graph distribution, the partitioning obtained by the parallel multilevel k-way partitioning algorithm.... ..."

Cited by 25

### Table 2: The performance of different implementations of multilevel k-way partitioning algorithm. This table shows the performance of the MPI- and SHMEM-based parallel algorithm, of the coarse-grain parallel multilevel refinement algorithm, and of the serial algorithm on an SGI workstation. In the case of the results of the parallel algorithms, for each graph, the performance is shown for 16-, 32-, 64-, and 128-way partitions on 16, 32, 64, and 128 processors, respectively. All times are in seconds.

1997

"... In PAGE 9: ... of Edges Description AUTO 448695 3314611 3D Finite element mesh MDUAL 258569 513132 Dual of a 3D Finite element mesh MDUAL2 988605 1947069 Dual of a 3D Finite element mesh Table 1: Various graphs used in evaluating the parallel multilevel k-way graph partitioning algorithm. Table2 shows the performance of various implementations of the multilevel k-way partitioning algorithm. The first two subtables show the performance of the coarse-grain and SHMEM-based parallel partitioning algorithms, respec-... In PAGE 10: ... Also, because the coarse-grain implementation is memory efficient, this increases the amount of time spent in the algorithm to set-up the appropriate data structures. The third subtable in Table2 shows the performance achieved by the coarse-grain parallel multilevel refinement algorithm. These results were obtained by using as the initial graph distribution, the partitioning obtained by the parallel multilevel k-way partitioning algorithm.... ..."

Cited by 25

### Table 3: Evaluated GPU architecture configurations

"... In PAGE 9: ... As the Fragment Generator requires front facing edge and z equations, the equation coefficients of the back facing triangle are negated becoming a front facing tri- angle for the Fragment Generator. 5 Embedded GPU evaluation Table3 shows the parameters for the different configurations we have tested, ranging from a middle level PC GPU to our embedded TILA-rin GPU. The purpose of the dif- ferent configurations is to evaluate how the performance is affected by reducing the dif- ferent hardware resources.... In PAGE 10: ... Configurations J and K are configured with 1 MB of GPU memory for the framebuffer (could be implemented as embedded DRAM), with a max- imum data rate of 16 bytes/cycle, and 128 MBs of higher latency lower bandwidth sys- tem memory. Table3 shows the following param- eters: resolution (Res) at which the trace was simulated, an estimated working frequency in MHz for the architecture (400+ MHz is common for the middle and high-end PC GPU segments, 200 MHz is in the middle to high-end seg- ment of the current low power embedded GPUs); the number of vertex shaders for non unified shader configurations (VS) and the number of fragment shaders (FSh) for non unified shader configura- tions or the number of unified shaders (Sh) for unified shader configurations. The number of fragment shaders is specified by the number of fragment shader units multiplied by the number of fragments in a group (always 4).... ..."

### Table 5. Comparisons between different architectures Architecture Full Adder Cells Memory (Bytes) Time Data Rates

2000

"... In PAGE 9: ...4. Area-Time Efficient Architecture From comparing the above two architectures in Table5 , we see that the area-constrained ar- chitecture does not meet real-time requirements while the time-constrained architecture is highly aggressive in area. So, a tradeoff point in the design space needs to be found, which meets the real- time requirements with minimum additional area.... In PAGE 10: ... Elements in the cross-correlation matrix block build in hardware with XNOR gates, giving a163a164a131 a117 a133 performance with a163a164a131a106a51 a76 a133 or 1000 XNOR gates, while it may take a163a164a131a106a51 a76 a133 or 1000 cycles on a DSP and takes a163a164a131a106a51 a76 a133 or 1K bytes in memory. Assuming a 500 MHz clock for the VLSI architectures, the projected time required to compute the channel estimate along with the hardware required for 32 users and a spreading code of length 32 is as shown in Table5 . This is compared with the implementation of the previously existing algorithm (equation 4), on a TI TMS320C6701 Evaluation Module, operating at 166 MHz.... In PAGE 10: ...sers, this corresponds to a time requirement of 0.97 ms or 1.02 Kbps. The inherent parallelism present in the algorithm can be seen from the ratio of time taken for computation by the area-constrained and time-constrained architectures. The area estimates are compared using the number of Full Adder Cells needed in the design, as shown in Table5 . The time difference between the DSP and the VLSI architectures is due to the improvements in the algorithm modifications and the fact that the bit-level and byte-level parallelism are not exploited on the DSPs and the additional memory references.... ..."

Cited by 8

### Table 5. Comparisons between different architectures Architecture Full Adder Cells Memory (Bytes) Time Data Rates

2000

"... In PAGE 9: ...4. Area-Time Efficient Architecture From comparing the above two architectures in Table5 , we see that the area-constrained ar- chitecture does not meet real-time requirements while the time-constrained architecture is highly aggressive in area. So, a tradeoff point in the design space needs to be found, which meets the real- time requirements with minimum additional area.... In PAGE 10: ...c d a b (2K*1) r (N*8) Rbr (2KN*8) Rbr(i,j) Adder b(i) Add/ Sub# r(j) 8 8 1 Figure 6. Elements in the cross-correlation matrix block Assuming a 500 MHz clock for the VLSI architectures, the projected time required to compute the channel estimate along with the hardware required for 32 users and a spreading code of length 32 is as shown in Table5 . This is compared with the implementation of the previously existing algorithm (equation 4), on a TI TMS320C6701 Evaluation Module, operating at 166 MHz.... In PAGE 10: ...sers, this corresponds to a time requirement of 0.97 ms or 1.02 Kbps. The inherent parallelism present in the algorithm can be seen from the ratio of time taken for computation by the area-constrained and time-constrained architectures. The area estimates are compared using the number of Full Adder Cells needed in the design, as shown in Table5 . The time difference between the DSP and the VLSI architectures is due to the improvements in the algorithm modifications and the fact that the bit-level and byte-level parallelism are not exploited on the DSPs and the additional memory references.... ..."

Cited by 8

### Table 5. Comparisons between different architectures Architecture Full Adder Cells Memory (Bytes) Time Data Rates

2000

"... In PAGE 9: ...4. Area-Time Efficient Architecture From comparing the above two architectures in Table5 , we see that the area-constrained ar- chitecture does not meet real-time requirements while the time-constrained architecture is highly aggressive in area. So, a tradeoff point in the design space needs to be found, which meets the real- time requirements with minimum additional area.... In PAGE 10: ...c d a b (2K*1) r (N*8) Rbr (2KN*8) Rbr(i,j) Adder b(i) Add/ Sub# r(j) 8 8 1 Figure 6. Elements in the cross-correlation matrix block Assuming a 500 MHz clock for the VLSI architectures, the projected time required to compute the channel estimate along with the hardware required for 32 users and a spreading code of length 32 is as shown in Table5 . This is compared with the implementation of the previously existing algorithm (equation 4), on a TI TMS320C6701 Evaluation Module, operating at 166 MHz.... In PAGE 10: ...sers, this corresponds to a time requirement of 0.97 ms or 1.02 Kbps. The inherent parallelism present in the algorithm can be seen from the ratio of time taken for computation by the area-constrained and time-constrained architectures. The area estimates are compared using the number of Full Adder Cells needed in the design, as shown in Table5 . The time difference between the DSP and the VLSI architectures is due to the improvements in the algorithm modifications and the fact that the bit-level and byte-level parallelism are not exploited on the DSPs and the additional memory references.... ..."

Cited by 8