### Table 3: Runtimes (in seconds) for 10 iterations for the case with no preconditioner (bold indicates fastest runtimes for P processors).

in Hybrid (OpenMP and MPI) Parallelization of MFIX: A Multiphase CFD Code for Modeling Fluidized Beds

"... In PAGE 6: ... It is very evident that DMP parallelization, for this problem on this architecture, is desirable. Table3 gives the runtimes of the code for the test problem without the use of preconditioner. This required 124 nonlinear iterations for ten time steps compared to the 107 iterations when using the line relaxation preconditioning.... ..."

### Table 1: Floating point performance characteristics of individual cores of modern, multi-core processor architectures. DGESV and SGESV are the LAPACK subroutines for dense system solution in double precision and single precision respectively. Architecture Clock DP Peak SP Peak time(DGESV)/

2007

"... In PAGE 2: ... When combined with the size of the register file of 128 registers, it is capable of delivering close to peak performance on many common computationally intensive workloads. Table1 shows the difference in peak performance between single precision (SP) and double precision (DP) of four modern processor architectures; also, on the last column is reported the ratio between the time needed to solve a dense linear system in double and single precision by means of the LAPACK DGESV and SGESV respec- tively. Following the recent trend in chip design, all of the presented processors are multi-core architectures.... In PAGE 6: ... For the Cell processor (see Figures 7 and 8), parallel implementations of Algo- rithms 2 and 3 have been produced in order to exploit the full computational power of the processor. Due to the large difference between the single precision and double precision floating point units (see Table1 ), the mixed precision solver performs up to 7 and 11 faster than the double precision peak in the unsymmetric and symmetric, positive definite cases respectively. Implementation details for this case can be found in [7, 8].... ..."

### Table 3. Parameters for the simulated multi-core system.

"... In PAGE 7: ...1 Environment We use an execution-driven simulator that models multi- core systems with MESI coherence and support for hard- ware or hybrid TM systems. Table3 summarizes the pa- rameters for the simulated CMP architecture. All opera- tions, except loads and stores, have a CPI of 1.... ..."

### Table 5.3 presents the computational cost of critical steps. The nonlinear sys- tem solution time consists of the calculation of the Jacobian (matrix assembly), application of the nonlinear operator, formation of the multilevel preconditioner, and the solution of the linear system. The linear system solution time is dom- inated by matrix-vector multiplication, application of the multilevel precondi- tioner (preconditioning), and orthogonalization for the Krylov subspace vectors. The nonlinear solve takes most of the time and its parallelization is fairly well achieved.

1998

Cited by 6

### Table 1: Preconditioners

2008

"... In PAGE 7: ... It is stored in a consistent representation. See Table1 for a list of available sequential precondition- ers. In the list AMG can be used in sequential and parallel mode.... ..."

Cited by 1

### Table 3: Preconditioners available from parallel iterative solver packages.

"... In PAGE 12: ... The last row of the table shows the other preconditioners the packages also implemented and notes about the preconditioning schemes. The six parallel preconditioner listed in Table3 are: diagonal scaling #28diag#29, SOR#2FSSOR #5B20#5D, incom- plete LU #28in symmetric case incomplete Cholesky#29 factorization #5B33#5D, block-Jacobi iteration #5B44#5D, overlapping domain-decomposition #28ODD#29 #5B44#5D and approximate inverse #28AI#29 #5B44#5D. The diagonal scaling can be performed by itself as a preconditioner, but more frequently,itiscombined with another one.... ..."

### Table 1. Comparison of parallel prefix network architectures

2002

"... In PAGE 3: ... IV. RESULTS Table1 compares the parallel prefix networks under consideration. The delay depends on the number of logic levels, the fanout, and the wire capacitance.... ..."

Cited by 2

### Table 1: Architectural attributes of MIMD parallel computers

"... In PAGE 9: ...operational in November 1999. Table1 gives the definitions and values of the radar parameters and Table 2 shows the processor parameters. Table 1: Radar parameters Table 2: Processor parameters Figure 1 shows the block diagram of the GeoSAR range-Doppler signal processing and Figure 2 gives the numbers of floating-point operations per input sample at each processing stage ... In PAGE 9: ... Table 1 gives the definitions and values of the radar parameters and Table 2 shows the processor parameters. Table1 : Radar parameters Table 2: Processor parameters Figure 1 shows the block diagram of the GeoSAR range-Doppler signal processing and Figure 2 gives the numbers of floating-point operations per input sample at each processing stage ... In PAGE 21: ... Because of its centralized shared memory, the UMA model may limit scalability once built. Table1 compares the architectural attributes and performance of these five parallel architectures. ... ..."

### Table 1: Computer architecture and parallel program-

"... In PAGE 2: ... These tools, corresponding documentation, and educational material are being integrated into computer education curricula. Table1 shows the set of computer architecture and parallel programming tools currently available on... In PAGE 4: ... Both interfaces are provided to NETCARE users via conventional Web browsers: the graphical X Windowinterface is han- dled via a Java VNC #5B16#5D browser applet, while the text-based interface uses standard HTML language. Many tools that can be used in computer architec- ture education are text-based #28 Table1 #29; often, these tools are con#0Cgured via command-line parameters and #0Cles that are unfriendly to a novice user. PUNCH sup- ports the de#0Cnition of metaprograms that generate a dynamic HTML interface to these tools.... ..."

### Table 1: Simulation parameters for a single core architecture. These parameters remain the same for the multicore case except for the L2 cache which becomes a 4MB, 4-way set-associative, 4-bank cache shared among all cores.

"... In PAGE 3: ... On average, they accounted for ap- proximately 70% of the execution time of each benchmark. Table1 presents the simulation parameters for the archi- tecture we explore in this paper. We include an 8K entry gshare branch predictor in our model.... In PAGE 3: ... In addition to model- ing all of the structures and latencies in the architecture, we have extended SimpleScalar to include a cycle accurate, exe- cution driven model of chip multiprocessing (CMP) [5]. All the parameters used in our multicore experiments are the same as in Table1 for each core, except that we increase the size of the L2 cache to an 4MB, 4-way set-associative cache shared among all cores. Per-thread performance metrics are measured for execu- tion up to a maximum per-thread instruction count.... ..."