### Table 4: Matrix Multiply

"... In PAGE 6: ... To get a feeling for the overall cost of the new instrumenta- tion, we instrumented a simple multithreaded application (matrix multiply with 150 lines of C code) and compared it with the cost of instrumenting a sequential version of the same algorithm by the non-threaded Paradyn. Table4 shows elapsed times of the two ver- sions repeatedly multiplying two 500x500 matrices of floating point numbers. In Table 4, we measure CPU time (inclusive) for the whole program, procedure call frequency and CPU time (inclu- sive) for the function innerp.... In PAGE 6: ... Table 4 shows elapsed times of the two ver- sions repeatedly multiplying two 500x500 matrices of floating point numbers. In Table4 , we measure CPU time (inclusive) for the whole program, procedure call frequency and CPU time (inclu- sive) for the function innerp. The procedure call frequency of innerp is about 3,500 calls/second on the uniprocessor, and about 10,000 calls/second for the multithreaded version on the multiprocessor.... ..."

### Table 3: Matrix Multiply

"... In PAGE 10: ...October 29, 1998 the same algorithm by the non-threaded Paradyn. Table3 shows elapsed times of the two versions repeatedly multi- plying two 500x500 matrices of floating point numbers. In Table 3, we measure CPU time (inclusive) for the whole program, procedure call frequency and CPU time (inclusive) for the function innerp.... In PAGE 10: ... Table 3 shows elapsed times of the two versions repeatedly multi- plying two 500x500 matrices of floating point numbers. In Table3 , we measure CPU time (inclusive) for the whole program, procedure call frequency and CPU time (inclusive) for the function innerp. The procedure call frequency of innerp is about 3,500 calls/second on the uniprocessor, and about 10,000 calls/second for the multithreaded ver- sion on the multiprocessor.... ..."

### Table VI. Performance of Matrix Multiply

1994

Cited by 20

### Table 4: Performance of Optimized Matrix Multiply (sec.)

1991

"... In PAGE 10: ... We also chose write-shared because it supports multiple writers and ne-grained sharing. The execution times for the unoptimized version of Matrix Multiply (see Table4 ) and SOR, for the previ- ous problem sizes and for 16 processors, are presented in Table 6. For Matrix Multiply, the use of result and Protocol Matrix Multiply SOR Multiple 72.... ..."

Cited by 558

### Table 4 Performance of Optimized Matrix Multiply (sec.)

1991

"... In PAGE 10: ... We also chose write-shared because it supports multiple writers and ne-grained sharing. The execution times for the unoptimized version of Matrix Multiply (see Table4 ) and SOR, for the previ- ous problem sizes and for 16 processors, are presented in Table 6. For Matrix Multiply, the use of result and Protocol Matrix Multiply SOR Multiple 72.... ..."

Cited by 558

### Table 4 Performance of Optimized Matrix Multiply (sec.)

1991

"... In PAGE 10: ... We also chose write-shared because it supports multiple writers and ne-grained sharing. The execution times for the unoptimized version of Matrix Multiply (see Table4 ) and SOR, for the previ- ous problem sizes and for 16 processors, are presented in Table 6. For Matrix Multiply, the use of result and read only sped up the time required to load the in- put matrices and later purge the output matrix back to the root node and resulted in a 4.... ..."

Cited by 558

### Table 4 Performance of Optimized Matrix Multiply (sec.)

1991

"... In PAGE 14: ... We also chose write-shared because it supports multiple writers and ne-grained sharing. The execution times for the unoptimized version of Matrix Multiply (see Table4 ) and SOR, for the previous problem sizes and for 16 processors, are presented in Table 6. For Matrix Multiply, the use of result and read only sped up the time required to load the input matrices and later purge the output matrix back to the root node and resulted in a 4.... ..."

Cited by 558

### Table 3 Performance of Matrix Multiply(sec.)

1991

"... In PAGE 12: ... Since the output matrix is a result object, Munin sends the modi cations only to the owner (the node where the root thread is executing), and invalidates the local copy. Table3 gives the execution times of both the Munin and the message passing implementations for multiplying two400 400 matrices. The System time represents the time spent executing Munin code on the root node, while the User time is that spent executing user code.... ..."

Cited by 558

### Table 3 Performance of Matrix Multiply(sec.)

1991

"... In PAGE 8: ... Since the output matrix is a result object, Munin sends the modi cations only to the owner (the node where the root thread is executing), and in- validates the local copy. Table3 gives the execution times of both the Munin and the message passing implementations for multiply- ing two 400 400 matrices. The System time repre- sents the time spent executing Munin code on the root node, while the User time is that spent executing user code.... ..."

Cited by 558