Results 1 - 10
of
100,989
Table 2. Clock speed, on-chip and o -chip cache sizes, peak oating point and LIN- PACK performance of the processors.
"... In PAGE 6: ... In this study we consider two di erent shared memory architectures: parallel vector supercomputers, a 12-CPU NEC SX-4 and a 16-CPU Cray J90, and two multiprocessor servers, an 8-CPU DEC AlphaServer 8400 and an 8-CPU SGI Origin 2000. Table2 summarizes the characteristics of the individual processors. Table 2.... In PAGE 8: ... The speedups are computed with respect to a very e cient serial implementation of the supernode left-looking algorithm. To calibrate our speedup gures with respect to the peak performance, we compare the oating point performance of our implementation on a single processor with the single processor performance of the LINPACK benchmark in Table2 . Although the solver is designed for matrices with a very sparse structure, it delivers 160 M op/s on the Cray J90 or 81 % of the LINPACK performance.... In PAGE 8: ... The irregularity of the grids leads to more complicated structures of the linear systems that must be solved during the simulation. The machines are shared memory multiprocessors and the processor characteristics are summarized in Table2 . We used an 32-processor SGI Origin 2000 with six- teen Gbytes of main memory, and an eight-processor DEC AlphaServer with two Gbytes of main memory.... ..."
Table 5. Code size and hashing speeds of the different compression functions on a 90 MHz Pentium both for our Assembly implementations and a corresponding portable C implementation (Watcom C 10.0). The code size only refers to the Assembly imple- mentations. Again both code and data are assumed to reside in the on-chip caches. The figures are independent of the buffer size as long as it, together with the local data, fits in the 8-Kbyte on-chip cache.
1996
"... In PAGE 4: ...g., Table5 ), and will never be larger than 8K. This means that it can be kept in the on-chip cache of most processors, leading to faster execution of the code from the second iteration onwards.... In PAGE 12: ... The speed-up factor is with respect to a (hypothetical) execution of the same code on a non-parallel architecture under otherwise unchanged conditions. The bandwidth figures of Table5 , obtained from actual timings, correspond exactly with the cycle figures of Table 4, if one allows for a few cycles overhead. A portable C implementation is, on average, twice as slow.... ..."
Cited by 35
Table 8: The Results from Pro ling AES on-chip with SnoopP for Both 2 and 400 Keys.
2004
"... In PAGE 8: ... Table 7 summarizes what functions are chosen for pro ling and number of instructions comprising each. Table8 summarizes the results obtained using SnoopP to pro le AES when only two keys are used and when four hun- dred keys are used. There is only one column of results as there is no change in any of the values when the number of keys is increased, reinforcing the fact that the on-chip pro-... ..."
Cited by 7
Table 3: Weight discretization in multilayer neural networks: on-chip learning. by allowing a dynamic rescaling of the weights (and hence the weight range) by adapting the gain of the activation function. The calculation of an activation value aj in a multilayer network is namely done as follows:
"... In PAGE 5: ... This means in speci c that at least the weight values are represented with only a limited precision. Simulations have shown that the popular backpropagation algorithm (see for example [Rumelhart-86]) is highly sensitive to the use of limited precision weights and that training fails when the weight accuracy is lower than 16 bits ( rst two references in Table3 ). This is mainly because the weight updates are often smaller than the quantization step which prevents the weights from changing.... In PAGE 5: ... In order to reduce the chip area needed for weight storage and to overcome system noise, a further reduction of the number of allowed weight values is desirable. Several weight discretization algorithms have therefore been designed and an extensive list of them and the attainable reduction in required precision is given in Table3 . Some of these weight discretization algorithms have already proven their usefulness in hardware implementations.... ..."
Table 3. Performance gures on a Pentium for the improved implementations of the compression function of the 6 members of the MD4 hash function family. Both code and data are assumed to reside in the on-chip caches. All gures are independent of the processor apos;s clock speed. The speed-up factor is with respect to a (hypothetical) execution of the same code on a non-parallel architecture under otherwise unchanged conditions. References
1997
"... In PAGE 1: ... Miraculously, this turns out to be the case, as illustrated in Table 2 for a round 1 step of MD5, updating [BGV96, Table 3]. Table3 is the updated version of [BGV96, Table 4]. All implementations now only use 1-cycle instruc- tions, except for SHA-1 that uses the bswap instruction taking an additional cycle to decode due to the 0Fx-pre x.... ..."
Cited by 4
Table 8: Performance of Large Benchmarks on UltraSPARC-II (measured using on-chip counters)
1998
"... In PAGE 13: ... The number of stalls caused in the processor due to instruction cache misses, branch mispredicts, store bu er full conditions and data loading delays are also measured to analyze the bottleneck. The statistics obtained using perf-monitor while running the programs are shown in Table8 . The rst column in Table 8 lists the cycles per instruction(CPI) for the di erent programs.... In PAGE 13: ....57 compared to 1.17 for the C programs. The second column in Table8 lists the percentage of the instruction cache misses. It is expected that C++ programs would have a higher miss rate because of inferior locality due to its more distributed control and larger number of function calls.... In PAGE 13: ... In general, the instruction cache miss ratio is worse for C++ programs compared to C programs. The third column in Table8 lists the miss rate for the data cache for the di erent programs. From the harmonic mean it can be seen that C programs have a slightly higher miss rate than C++ programs.... ..."
Cited by 4
Table 2: Hardware complexity of the RPU prototype with a chunk size of 4, 32 hardware threads, and 512 entries for each direct mapped cache. For each unit the number of specific floating point units and the required on-chip memory is given. Dual ported mem- ory bits are counted twice.
2005
"... In PAGE 7: ...ith them. This simple mechanism is sufficient for our test scenes. This design uses almost all FPGA logic slices (99%), about 88% of the on-chip block memories, and 20 of the block multipliers (13%). Almost all of the 48 floating point units are in the SPU unit with the remaining dedicated to the TPU (see Table2 ). On the FPGA we are limited to a 24 bit floating point representation, because we had to use the custom 18 bit fixed point multipliers already available on the chip.... In PAGE 7: ... Only floating point textures and 32 bit frame buffers are implemented. Table2 shows the number of floating point units and memory avail- able in the prototype. Only the memory bits actually used are shown and dual ported memories are counted twice.... ..."
Cited by 30
Table 1. Code size and hashing speeds of the different compression functions on a 90 MHz Pentium for our Assembly implementations of both [BGV96] and this note. The code size only refers to the improved implementations. Code and data are assumed to reside in the on-chip caches. The figures are independent of the buffer size as long as it, together with the local data, fits in the 8-Kbyte on-chip cache.
1997
"... In PAGE 1: ...Fx-prefix. A value for the cycles per instruction (CPI) of close to 0.5 is therefore an indication of the high percentage of simple paired instructions in the code. Table1 gives a better idea of the resulting improvement. Algorithm Size Speed (Mbit/s) Factor (bytes) [BGV96] this note this note-[BGV96]... ..."
Cited by 4
Table 1. Code size and hashing speeds of the di erent compression functions on a 90 MHz Pentium for our Assembly implementations of both [BGV96] and this note. The code size only refers to the improved implementations. Code and data are assumed to reside in the on-chip caches. The gures are independent of the bu er size as long as it, together with the local data, ts in the 8-Kbyte on-chip cache.
1997
"... In PAGE 1: ...Fx-pre x. A value for the cycles per instruction (CPI) of close to 0.5 is therefore an indication of the high percentage of simple paired instructions in the code. Table1 gives a better idea of the resulting improvement. Algorithm Size Speed (Mbit/s) Factor (bytes) [BGV96] this note this note-[BGV96]... ..."
Cited by 4
Table 8 shows the number of processors, the number of main memory modules and the extra amount of on-chip memory needed for Architecture 2. We assume a conservative on-chip memory access time of 70 ns., and main memory access time of 200 ns. Note that there is a drastic reduction in number of processors and memory modules at the cost of a few kbytes of on-chip memory. Block HBMA Picture size # of # of external on-chip memory format
1995
"... In PAGE 19: ... Table8 : Hardware resources required by Architecture 2. 3.... ..."
Cited by 5
Results 1 - 10
of
100,989