### Table 2: Parallel elapsed time for each linear algebra kernel involved in the numerical scheme.

in Incremental

"... In PAGE 14: ... We notice that this property is not necessary true for sparse linear systems, where the cost of the incremental preconditioner might dominate even for small values of p so that the preconditioner might not be e ective if it does not signi cantly reduce the number of iterations. In Table2 , we report on the parallel elapsed time required by each linear algebra kernel involved in the solution scheme. For MISLRU the application time corresponds the time to apply the preconditioner for the last linear system; that is, when the preconditioner is the most expensive.... ..."

### Table 2: Parallel elapsed time for each linear algebra kernel involved in the numerical scheme.

"... In PAGE 14: ... We notice that this property is not necessary true for sparse linear systems, where the cost of the incremental preconditioner might dominate even for small values of p so that the preconditioner might not be effective if it does not significantly reduce the number of iterations. In Table2 , we report on the parallel elapsed time required by each linear algebra kernel involved in the solution scheme. For MISLRU the application time corresponds the time to apply the preconditioner for the last linear system; that is, when the preconditioner is the most expensive.... ..."

### Table 1. MTL linear algebra operations.

1998

"... In PAGE 4: ...The MTL Generic Algorithms for Linear Algebra The Matrix Template Library provides a rich set of basic linear algebra opera- tions, roughly equivalent to the Level-1, Level-2 and Level-3 BLAS. Table1 lists the principle algorithms included in MTL. In the table, alpha and s are scalars, x,y,z are 1-D containers, A,B,C,E are row or column oriented matrices, U, L are upper and lower triangular matrices, and i is an iterator.... In PAGE 4: ... With BLAIS, the blocking sizes can be modi#0Ced at compile time through a few global constants, so that the algorithms can be customized for the memory hierarchy of a particular architecture. Note that in Table1 di#0Berent operations are not de#0Cned for each permutation of transpose, scaling, and striding. Instead, only one algorithm is provided, but it can be combined with the use of strided and scaled vector adapters, or the trans#28#29 method to create the permutations.... ..."

Cited by 7

### Table 1. MTL linear algebra operations.

1998

"... In PAGE 4: ...The MTL Generic Algorithms for Linear Algebra The Matrix Template Library provides a rich set of basic linear algebra opera- tions, roughly equivalent to the Level-1, Level-2 and Level-3 BLAS. Table1 lists the principle algorithms included in MTL. In the table, alpha and s are scalars, x,y,z are 1-D containers, A,B,C,E are row or column oriented matrices, U, L are upper and lower triangular matrices, and i is an iterator.... In PAGE 4: ... With BLAIS, the blocking sizes can be modi ed at compile time through a few global constants, so that the algorithms can be customized for the memory hierarchy of a particular architecture. Note that in Table1 di erent operations are not de ned for each permutation of transpose, scaling, and striding. Instead, only one algorithm is provided, but it can be combined with the use of strided and scaled vector adapters, or the trans() method to create the permutations.... ..."

Cited by 7

### Table 4: Data representation and layout for dominating computations in linear algebra kernels (\* quot; represents a local axis, \b quot; represents a parallel axis).

"... In PAGE 9: ... Language constructs: Table 1. Data Layout: Table4 for library functions and Table 9 for application kernels. Local memory accesses, types of array allocation: Table 7 for linear algebra codes, and Ta- ble 13 for application codes.... In PAGE 12: ... We summarize some of the important properties of our implementations of the linear algebra benchmarks by means of three tables. Table4 gives an overview of the data representation and layout for the dominating computations. Table 5 shows the communication operations used along with their associated array ranks.... ..."

### Table 2: Con guration at The University of Queensland. For the development of numerically intensive applications, SGI provides high-perfor- mance scienti c libraries collating a large amount of (serial and parallel) routines that have been optimised for its architecture. These include: BLAS, LAPACK, NAG, FFT, etc. Compilers are provided that support pivotal programming languages: f77, f90, High Per- formance Fortran (HPF), C, C++. The environment is equipped with (semi-)automatic tools that can generate a parallel version of sequential code automatically: PFA (Power Fortran Accelerator) and PCA (Power C Accelerator). The operating system is IRIX which is a derivative of UNIX System V. All experiments were undertaken in serial mode and the various codes were compiled with the optimizing switch at level 3 (i.e., -O3).

1996

"... In PAGE 9: ... The user can request a job to be executed on either of the systems or can simply hand over to the operating system which will then automatically select the more suitable system depending on the workload and the availability of resources required by the job. Table2 gives the particularities of their composition. Features not indicated on Table 2 are identical to the default settings as given on Table 1.... In PAGE 9: ... Table 2 gives the particularities of their composition. Features not indicated on Table2 are identical to the default settings as given on Table 1. Each microprocessor chip is a MIPS R8000 superscalar RISC 90MHz 64-bit arithmetic chip.... ..."

Cited by 3

### Table 2: Con guration at The University of Queensland. For the development of numerically intensive applications, SGI provides high-perfor- mance scienti c libraries collating a large amount of (serial and parallel) routines that have been optimised for its architecture. These include: BLAS, LAPACK, NAG, FFT, etc. Compilers are provided that support pivotal programming languages: f77, f90, High Per- formance Fortran (HPF), C, C++. The environment is equipped with (semi-)automatic tools that can generate a parallel version of sequential code automatically: PFA (Power Fortran Accelerator) and PCA (Power C Accelerator). The operating system is IRIX which is a derivative of UNIX System V. All experiments were undertaken in serial mode and the various codes were compiled with the optimizing switch at level 3 (i.e., -O3).

1996

"... In PAGE 9: ... The user can request a job to be executed on either of the systems or can simply hand over to the operating system which will then automatically select the more suitable system depending on the workload and the availability of resources required by the job. Table2 gives the particularities of their composition. Features not indicated on Table 2 are identical to the default settings as given on Table 1.... In PAGE 9: ... Table 2 gives the particularities of their composition. Features not indicated on Table2 are identical to the default settings as given on Table 1. Each microprocessor chip is a MIPS R8000 superscalar RISC 90MHz 64-bit arithmetic chip.... ..."

Cited by 3

### Table 3: High School Non-linear Production Function

in Enhancing our Understanding of the Complexities of Education: "Knowledge Extraction from Data" using

"... In PAGE 18: ... The parsimonious polynomial models paint a somewhat different picture of the predictors of performance for Vermont schools. Table3 displays the significant predictors of SAT performance. While it too selected parent level of education and school size as significant, the shape of the relationship is changed to a third order polynomial for school size.... ..."

### Table 1 Performance in MFlops of GEMM on shared memory multiprocessors using 512-by-512 matrices. We have shown that the use of parallel kernels provides high performance while maintaining portability. We intend to pursue this activity in the future on most of the parallel architectures to which we have access. The ALLIANT FX/2800 provides a good opportunity for validating these ideas, and we intend to implement a version of Level 3 BLAS based on our package on that machine.

"... In PAGE 5: ... Finally, these codes have been used as a platform for the implementation of the uniprocessor version of Level 3 BLAS on the BBN TC2000 (see next Section). We show in Table1 the MFlops rates of the parallel matrix-matrix multiplication, and in Table 2 the performance of the LU factorization (we use a blocked code similar to the LAPACK one) on the ALLIANT FX/80, the CRAY-2, and the IBM 3090-600J obtained using our parallel version of the Level 3 BLAS. Note that our parallel Level 3 BLAS uses the serial manufacturer-supplied versions of GEMM on all the computers.... In PAGE 6: ... This package is available without payment and will be sent to anyone who is interested. We show in Table1 the performance of the single and double precision GEMM on di erent numbers of processors. Table 2 shows the performance of the LAPACK codes corresponding to the blocked LU factorization (GETRF, right-looking variant), and the blocked Cholesky factorization (POTRF, top-looking variant).... In PAGE 8: ... The second part concerned the performance we obtained with tuning and parallelizing these codes, and by introducing library kernels. We give in Table1 a brief summary of the results we have obtained: One of the most important points to mention here is the great impact of the use of basic linear algebra kernels (Level 3 BLAS) and the LAPACK library. The conclusion involves recommendations for a methodology for both porting and developing codes on parallel computers, performance analysis of the target computers, and some comments relating to the numerical algorithms encountered.... In PAGE 12: ... Because of the depth rst search order, the contribution blocks required to build a new frontal matrix are always at the top of the stack. The minimum size of the LU area (see column 5 of Table1 ) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table 1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors.... In PAGE 12: ... The minimum size of the LU area (see column 5 of Table 1) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors. Frontal matrices are stored in a part of the global working space that will be referred to as the additional space.... In PAGE 12: ... In a uniprocessor environment, only one active frontal matrix need be stored at a time. Therefore, the minimum real space (see column 7 of Table1 ) to run the numerical factorization is the sum of the LU area, the space to store the largest frontal matrix and the space to store the original matrix. Matrix Order Nb of nonzeros in Min.... In PAGE 13: ... In this case the size of the LU area can be increased using a user-selectable parameter. On our largest matrix (BBMAT), by increasing the space required to run the factorization (see column 7 in Table1 ) by less than 15 percent from the minimum, we could handle the ll-in due to numerical pivoting and run e ciently in a multiprocessor environment. We reached 1149 M ops during numerical factorization with a speed-up of 4.... In PAGE 14: ...ack after computation. Interleaving and cachability are also used for all shared data. Note that, to prevent cache inconsistency problems, cache ush instructions must be inserted in the code. We show, in Table1 , timings obtained for the numerical factorization of a medium- size (3948 3948) sparse matrix from the Harwell-Boeing set [3]. The minimum degree ordering is used during analysis.... In PAGE 14: ... -in rows (1) we exploit only parallelism from the tree; -in rows (2) we combine the two levels of parallelism. As expected, we rst notice, in Table1 , that version 1 is much faster than version 2... In PAGE 15: ... Results obtained on version 3 clearly illustrate the gain coming from the modi cations of the code both in terms of speed-up and performance. Furthermore, when only parallelism from the elimination tree is used (see rows (1) in Table1 ) all frontal matrices can be allocated in the private area of memory. Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1].... In PAGE 15: ... Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1]. We nally notice, in Table1 , that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table 1.... In PAGE 15: ... We nally notice, in Table 1, that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table1 . The main reason is that frontal matrices must be allocated in the shared area when the second level of parallelism is enabled.... In PAGE 17: ...5 28.2 Table1 : Results in Mega ops on parallel computers. In Table 1, it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion.... In PAGE 17: ...2 Table 1: Results in Mega ops on parallel computers. In Table1 , it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion. The reason for this is that the single precision mathematical library routines are less optimized.... In PAGE 18: ... block diagonal) preconditioner appears to be very suitable and is superior to the Arnoldi-Chebyshev method. Table1 shows the results of the computation on an Alliant FX/80 of the eight eigenpairs with largest real parts of a random sparse matrix of order 1000. The nonzero o -diagonal and the full diagonal entries are in the range [-1,+1] and [0,20] respectively.... In PAGE 19: ... A comparison with the block preconditioned conjugate gradient is presently being investigated.In Table1 , we compare three partitioning strategies of the number of right-hand sides for solving the system of equations M?1AX = M?1B, where A is the ma- trix BCSSTK27 from Harwell-Boeing collection, B is a rectangular matrix with 16 columns, and M is the ILU(0) preconditioner. Method 1 2 3 1 block.... In PAGE 26: ...111 2000 lapack code 0.559 Table1 : Results on matrices of bandwith 9.... In PAGE 30: ... We call \global approach quot; the use of a direct solver on the entire linear system at each outer iteration, and we want to compare it with the use of our mixed solver, in the case of a simple splitting into 2 subdomains. We show the timings (in seconds) in Table1 on 1 processor and in Table 2 on 2 processors, for the following operations : construction amp; assembly : Construction and Assembly, 14% of the elapsed time, factorization : Local Factorization (Dirichlet+Neumann), 23%, substitution/pcg : Iterations of the PCG on Schur complement, 55%, total time The same code is used for the global direct solver and the local direct solvers, which takes advantage of the block-tridiagonal structure due to the privileged direction. Moreover, there has been no special e ort for parallelizing the mono-domain approach.... ..."

### Table 2: Properties of the MPEG-4 decoding tasks that determine the mapping onto HW or SW.

2002

"... In PAGE 8: ....1. Processing characteristics To come to a partitioning of the functionality in HW and SW, we separate the MPEG-4 decoder functions into tasks with self-contained functionality and clear interfaces. Table2 shows an overview of these tasks, including their processing characteristics. As explained in Section 2.... In PAGE 10: ... VLD context calculation 16 157 16 16 shape decoder 16 16 down samp up samp reference BAB memory CAD current bab mem Inverse scan AC/DC prediction MB stripe memory iQuant iDCT VOP reconstr up samp padding demux 16 27 49 362 run-length decode reference VOP memory 500 313 94 VLD Variable Length Decoder BAB Binary Alpha Block CAD Context-based Arithmetic Decoder SW implementation HW implementation memory 353 357 362 362 313 70 Figure 5: Block diagram of Video Object Plane decoder, including data transport bandwidths (MByte/s). The bottom row in Table2 represents the rendering and composition of all objects into the final scene. The first part of this functionality comprises a BIFS browser, analog to a VRML browser.... ..."

Cited by 5