• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 53,993
Next 10 →

Table 1 Performance in MFlops of parallel matrix-matrix multiplication on the BBN TC2000 using 1024- by-1024 matrices.

in Cerfacs Team "parallel Algorithm" Scientific Report For 1991
by Rt For
"... In PAGE 5: ... Finally, these codes have been used as a platform for the implementation of the uniprocessor version of Level 3 BLAS on the BBN TC2000 (see next Section). We show in Table1 the MFlops rates of the parallel matrix-matrix multiplication, and in Table 2 the performance of the LU factorization (we use a blocked code similar to the LAPACK one) on the ALLIANT FX/80, the CRAY-2, and the IBM 3090-600J obtained using our parallel version of the Level 3 BLAS. Note that our parallel Level 3 BLAS uses the serial manufacturer-supplied versions of GEMM on all the computers.... In PAGE 6: ... This package is available without payment and will be sent to anyone who is interested. We show in Table1 the performance of the single and double precision GEMM on di erent numbers of processors. Table 2 shows the performance of the LAPACK codes corresponding to the blocked LU factorization (GETRF, right-looking variant), and the blocked Cholesky factorization (POTRF, top-looking variant).... In PAGE 8: ... The second part concerned the performance we obtained with tuning and parallelizing these codes, and by introducing library kernels. We give in Table1 a brief summary of the results we have obtained: One of the most important points to mention here is the great impact of the use of basic linear algebra kernels (Level 3 BLAS) and the LAPACK library. The conclusion involves recommendations for a methodology for both porting and developing codes on parallel computers, performance analysis of the target computers, and some comments relating to the numerical algorithms encountered.... In PAGE 12: ... Because of the depth rst search order, the contribution blocks required to build a new frontal matrix are always at the top of the stack. The minimum size of the LU area (see column 5 of Table1 ) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table 1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors.... In PAGE 12: ... The minimum size of the LU area (see column 5 of Table 1) is computed during during the symbolic factorization step. The comparison between columns 4 and 5 of Table1 shows that the size of the LU area is only slightly larger than the space required to store the LU factors. Frontal matrices are stored in a part of the global working space that will be referred to as the additional space.... In PAGE 12: ... In a uniprocessor environment, only one active frontal matrix need be stored at a time. Therefore, the minimum real space (see column 7 of Table1 ) to run the numerical factorization is the sum of the LU area, the space to store the largest frontal matrix and the space to store the original matrix. Matrix Order Nb of nonzeros in Min.... In PAGE 13: ... In this case the size of the LU area can be increased using a user-selectable parameter. On our largest matrix (BBMAT), by increasing the space required to run the factorization (see column 7 in Table1 ) by less than 15 percent from the minimum, we could handle the ll-in due to numerical pivoting and run e ciently in a multiprocessor environment. We reached 1149 M ops during numerical factorization with a speed-up of 4.... In PAGE 14: ...ack after computation. Interleaving and cachability are also used for all shared data. Note that, to prevent cache inconsistency problems, cache ush instructions must be inserted in the code. We show, in Table1 , timings obtained for the numerical factorization of a medium- size (3948 3948) sparse matrix from the Harwell-Boeing set [3]. The minimum degree ordering is used during analysis.... In PAGE 14: ... -in rows (1) we exploit only parallelism from the tree; -in rows (2) we combine the two levels of parallelism. As expected, we rst notice, in Table1 , that version 1 is much faster than version 2... In PAGE 15: ... Results obtained on version 3 clearly illustrate the gain coming from the modi cations of the code both in terms of speed-up and performance. Furthermore, when only parallelism from the elimination tree is used (see rows (1) in Table1 ) all frontal matrices can be allocated in the private area of memory. Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1].... In PAGE 15: ... Most operations are then performed from the private memory and we obtain speedups comparable to those obtained on shared memory computers with the same number of processors [1]. We nally notice, in Table1 , that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table 1.... In PAGE 15: ... We nally notice, in Table 1, that although the second level of parallelism nicely supplements that from the elimination tree it does not provide all the parallelism that could be expected [1]. The second level of parallelism can even introduce a small speed down on a small number of processors as shown in column 3 of Table1 . The main reason is that frontal matrices must be allocated in the shared area when the second level of parallelism is enabled.... In PAGE 17: ...5 28.2 Table1 : Results in Mega ops on parallel computers. In Table 1, it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion.... In PAGE 17: ...2 Table 1: Results in Mega ops on parallel computers. In Table1 , it can be seen that the performance of the program on the Alliant FX/80 in double precision is better than the performance of the single precision ver- sion. The reason for this is that the single precision mathematical library routines are less optimized.... In PAGE 18: ... block diagonal) preconditioner appears to be very suitable and is superior to the Arnoldi-Chebyshev method. Table1 shows the results of the computation on an Alliant FX/80 of the eight eigenpairs with largest real parts of a random sparse matrix of order 1000. The nonzero o -diagonal and the full diagonal entries are in the range [-1,+1] and [0,20] respectively.... In PAGE 19: ... A comparison with the block preconditioned conjugate gradient is presently being investigated.In Table1 , we compare three partitioning strategies of the number of right-hand sides for solving the system of equations M?1AX = M?1B, where A is the ma- trix BCSSTK27 from Harwell-Boeing collection, B is a rectangular matrix with 16 columns, and M is the ILU(0) preconditioner. Method 1 2 3 1 block.... In PAGE 26: ...111 2000 lapack code 0.559 Table1 : Results on matrices of bandwith 9.... In PAGE 30: ... We call \global approach quot; the use of a direct solver on the entire linear system at each outer iteration, and we want to compare it with the use of our mixed solver, in the case of a simple splitting into 2 subdomains. We show the timings (in seconds) in Table1 on 1 processor and in Table 2 on 2 processors, for the following operations : construction amp; assembly : Construction and Assembly, 14% of the elapsed time, factorization : Local Factorization (Dirichlet+Neumann), 23%, substitution/pcg : Iterations of the PCG on Schur complement, 55%, total time The same code is used for the global direct solver and the local direct solvers, which takes advantage of the block-tridiagonal structure due to the privileged direction. Moreover, there has been no special e ort for parallelizing the mono-domain approach.... ..."

Table 2. Overhead for a virtual memory manager.

in A Case for Multi-Level Main Memory
by Magnus Ekman And, Magnus Ekman, Per Stenstrom 2004
"... In PAGE 8: ... Finally, the pageout thread writes the pages marked for evic- tion to disk. Table2 shows where the time is spent in these three parts in the existing virtual memory manager. The second column describes how many instructions it takes to run one loop in each part.... ..."
Cited by 1

Table 11: Performance summary of the multifrontal factorization on matrix BCSSTK15. In columns (1) we exploit only parallelism from the tree; in columns (2) we combine the two levels of parallelism. MUPS is designed for shared memory multiprocessors but it was not too di cult to develop a version that could run on the virtual shared memory environment of the BBN TC2000 ([4]). However, because access to memory on this machine is not uniform (remote access is not cached by default and takes about 3.5 times the time of local access for a read, and 3.0 for a write), the performance of the shared memory code was not good. It is necessary to pay more attention to data locality in order to get acceptable performance and we 17

in Porting Industrial Codes and Developing Sparse Linear Solvers on Parallel Computers
by Michel J. Daydé, Iain S. Duff
"... In PAGE 17: ... This is done simply by using parallel versions of the BLAS in the factorizations within the nodes. When we combine both tree and node parallelism the situation becomes much more encouraging and we show typical speed-ups for a range of computers in Table11 . A medium size sparse matrix, BCSSTK15 from the Harwell-Boeing set ([19]), is used to illustrate our discussion.... ..."

Table 6: Various remote memory access time measurements ( s) in the presence of the hot spot memory on the TC2000.

in Comparative Performance Evaluation of Hot Spot Contention Between MIN-based and Ring-based Shared-Memory Architectures
by Xiaodong Zhang, Yong Yan, Robert Castaneda 1995
"... In PAGE 25: ... There was one remote hot memory module, 11 remote cool memory modules. Table6 gives the measured access times on the TC2000. When there is a hot spot memory, remote accesses to cool memory were slowed down more than 3 to 4 times for all types of hot spot experiments.... ..."
Cited by 8

Table 3 : Speed-up and e ciency of the PCG shared memory-like implementation on the TC 2000

in Shared and Distributed Implementations of Block Preconditioned Conjugate Gradient Using Domain Decomposition on a Distributed Virtual Shared Memory Computer
by Luc Giraud

Table 16 : Speed-up and e ciency of the basic operations involved in parallel PCG and PCGS on a RS/6000 network The poor e ciencies of both dot-product and matrix-vector product show the worst observed performance and illustrate the requirement of both coarse grain parallelism and long messages for implementations on networks of workstations. This constraint is more limiting when the CPU speed and communication speed are very unbalanced. All experiments shown in the previous tables have been done with very low tra c on the local area network. Of course, an increase in tra c would have a negative impact on the performance, due to further aggravating the balance between communication and computation because of Ethernet contention. 6 Concluding remarks On the TC2000 the advantage of using the virtual shared memory is that it provides a con- venient way for porting codes initially developed on shared memory architectures. Tuning the code to the characteristics of the memory hierarchy does not require too expensive an amount 27

in Shared and Distributed Implementations of Block Preconditioned Conjugate Gradient Using Domain Decomposition on a Distributed Virtual Shared Memory Computer
by Luc Giraud
"... In PAGE 27: ...47 (0.56) Table 15 : Speed-up and e ciency of both Jacobi PCG and PCGS computation part using P4 on a RS/6000 network Table16 shows the e ciencies of the basic operations used by conjugate gradient-like meth- ods, establishing once again the better performance of PCGS in comparison with PCG.... ..."

Table 5: Performance in M ops for multifrontal factorisation phase on the BBN TC2000 (matrix BCSSTK15). 4 represents many re nements to this, including explicit copying of data to local memory so that it can be cached and e ectively reused during the computation so reducing the amount of remote access. The results in Table 3 show quite clearly that it is still vital to respect data locality when using virtual shared memory. If one does so, then good performance can be obtained.

in SPARSE LINEAR ALGEBRA in and around the APO-ENSEEIHT-IRIT group
by Algorithmique Parall Ele

Table 2 : Comparison between block Jacobi and block SSOR PCG methods on one processor of the TC2000

in Shared and Distributed Implementations of Block Preconditioned Conjugate Gradient Using Domain Decomposition on a Distributed Virtual Shared Memory Computer
by Luc Giraud
"... In PAGE 17: ... Furthermore, the Jacobi apos;s matrix is invertible in a natural parallel way. This natural parallelism can also be provided by the red-black ordering in the case of the SSOR preconditioning, denoted SSOR-RB in Table2 . The number of iterations and the sequential computation time on the TC2000 are shown in Table 2 for several sizes of the problem using the most e cient pre/post solver for each value of h.... In PAGE 17: ... This natural parallelism can also be provided by the red-black ordering in the case of the SSOR preconditioning, denoted SSOR-RB in Table 2. The number of iterations and the sequential computation time on the TC2000 are shown in Table2 for several sizes of the problem using the most e cient pre/post solver for each value of h. We note that the relaxation parameters ! involved in the S.... In PAGE 18: ...Table 2 : Comparison between block Jacobi and block SSOR PCG methods on one processor of the TC2000 Table2 shows that the Jacobi apos;s preconditioning provides an interesting compromise between the performance and the granularity of the parallel algorithm. Its sequential performance is not so bad and the degree of parallelism can be easily increased without decreasing the granularity of the tasks.... ..."

Table 2. Summary of target system architectures Sequent BBN Feature Symmetry TC2000

in MAD Kernels: An Experimental Testbed to Study Multiprocessor Memory System Behavior
by Arun Nanda, Arun K. N, Arun K. N, Lionel M. Ni, Lionel M. Ni 1992
"... In PAGE 13: ... 4 Target Systems We have used the framework described in Section 2 to measure the memory access performance of two multiprocessors. Table2 summarizes their architectural features and the measured charac- teristic times tc and tm (ts = 0 since gs = in our experiments). Note that on the TC2000, tm depends on whether the access is local or remote to a processor/memory node.... ..."

Table 1: Virtual Memory Segments Assignment

in Background Memory Management for Dynamic Data Structure Intensive Processing Systems
by Gjalt de Jong, Bill Lin, Carl Verdonck, Sven Wuytack, Francky Catthoor 1995
"... In PAGE 5: ...1, we have assigned to each dynamic data type a separate virtual memory segment and manager. The sizing of the segments is shown in Table1 ; frame size is the size of one element of each data structure; #frames is the maximal number of ele- ments for each data type; segment size is the total required memory size. The sizes of the LID and the MID tables are fixed for the application as they reflect the number of users connected to this ATM server, respectively the maximal number of multiplexed messages.... In PAGE 5: ... The sizing of the other data types, which are really used dynamically, reflects the net- work characteristics and can be determined statistically us- ing profiling information. The figures in Table1 reflect the choice that the SPP must work without loss for a peak per- formance of a continuous rate of single cell messages with a maximal lifetime of about 200ms for each message. Free lists are used for the packet and routingrecord managers; the cell FIFO is indeed managed as a FIFO, i.... ..."
Cited by 12
Next 10 →
Results 1 - 10 of 53,993
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University