### Table 7 Data and tag hardware cost for a direct mapped conventional 16 KB cache

2004

### Table 4 ECC hardware performance comparisons.

"... In PAGE 9: ... As shown in the figure, our hardware architecture provides good scalability in terms of speed, area, and operator size. Table4 shows performance comparisons with conventional hardware for EC scalar multiplications in GF(2n) and GF(p). Our EC processor shows the fastest operation times in both fields, even though most of the conventional circuits support a single field and use special fields or EC parameters to boost speed and to reduce hardware resources.... ..."

### Table 3 Hardware Design Factors

2006

"... In PAGE 7: ... Translate Yes/No Translate training documents using lexicons before WMT generation. Table 2 WMT and Score Table Generation Factors The factors shown in Table3 involve the precision of the hardware. Hardware optimized circuits use much less memory than conventional processors.... ..."

Cited by 1

### Table 1: Methods exported by the NMM hardware interface

"... In PAGE 9: ...urther in section 5.2.2. The hardware interface is also responsible for providing access to page directory and page table related functionality. Table1 presents a selection of the methods available in the hardware interface. Note that the level of abstraction is significantly lower than that of even the internal interfaces of conventional OS kernels.... ..."

### Table 1: Direct convolution timings with a 128 128 workspace imagine more complex environments where the gap in the running times of the FFT-based and the linear algorithm will be reduced or even inverted. We emphasize that we ran our experiments with algorithms implemented in software on a conventional single-processor architecture. A hardware or parallel implementation of the FFT algorithm would certainly lower the `break-even apos; com- plexity where the FFT-based algorithm becomes preferable to the linear algorithm.

"... In PAGE 13: ... The direct algorithm aug- ments the workspace bitmap and pads it with zeros to avoid problems at boundary con gurations when the convolution is computed. Table1 summarizes experimental results for a 128 128 workspace. The FFT-based algorithm computes the bitmap CSPACE in approximately 90 seconds.... In PAGE 13: ... The FFT-based algorithm computes the bitmap CSPACE in approximately 90 seconds. Column 2 of Table1 shows the time required... ..."

### Table 1 presents a selection of the methods available in the hardware interface. Note that the level of abstraction is significantly lower than that of even the internal interfaces of conventional OS kernels. Traditionally, this type of functionality is hard coded into start up sequences and is not available for interaction once the OS has initialised. However, in DEIMOS, this level of functionality must be continuously available so that memory modules can complete the hardware initialisation that was started by the DEIMOS bootstrap sequence in an application specific, ongoing and open ended way.

"... In PAGE 9: ...TableAddr CreatePDEntry TableAddr, Offset, PTPointer, Options CreatePT TableAddr CreatePTEntry TableAddr, Offset, FramePointer, Options Paging Status PhysToVirt PhysAddr Table1 : Methods exported by the NMM hardware interface In the first group of methods, CreateSegDesc() creates a segment descriptor of a given Type (code, data or system), with a base of Base and limit of Limit. Privilege refers to the CPU defined importance levels assigned to every segment.... ..."

### Table 5.1: Hardware Debugger Support. Comparing conventional debuggers and the one implemented for DISC2 provides insight into some of the unique challenges encountered in this e ort. This discussion will only focus on three debugging operations of interest: break-points, printing and program execution control. A conventional debugger sets and clears break-points by saving the bytes in the instruction memory at the break-point location and replacing them with a trap instruction. When the program execution reaches that point, the trap instruction signals the operating system to switch to the debugger. The debugger gures out which line in the source program being debugged created the trap and 54

### Table 1. Comparison of conventional and proposed BMU and SMU architectures for constraint length K=4. Conven. Proposed

2005

"... In PAGE 4: ... This normalization method leads to a simplified SMU, but also to a more complex BMU as shown in Figure 5. However, the total hardware resources of BMU and SMU are not much different than the conventional architecture as shown in Table1 . More importantly, as can be seen in Figure 5 (a) and (b), our proposed architecture reduces the critical path delay significantly by eliminating the state metric normalization process used in the conventional SMU.... ..."

Cited by 1

### Table 1: Dynamic Memory Management in Some Real-Time Operating Systems. RTOS Fixed-Sized Blocks Allocation Heap Allocation

2003

"... In PAGE 29: ...Table1 classi es few real-time operating systems according to which use the xed-sized block allocation and which support heap allocation. Table 1: Dynamic Memory Management in Some Real-Time Operating Systems.... In PAGE 104: ...ion of the SoCDMMU in software. The comparison is shown in clock cycles. We assume that the hardware SoCDMMU and the microcontroller both have the same clock rate. As Table1 0, the hardwired SoCDMMU is more than 10X faster than the SoCDMMU that uses software running on a general purpose microcontroller. Table 10: A comparison between the SoCDMMU and Microcontroller E.... In PAGE 104: ... As Table 10, the hardwired SoCDMMU is more than 10X faster than the SoCDMMU that uses software running on a general purpose microcontroller. Table1 0: A comparison between the SoCDMMU and Microcontroller E.T.... In PAGE 106: ...3.2 Speedup of a single malloc() or free() Table1 1: E.T.... In PAGE 106: ... As Table 11 shows, memory allocation done by the SoCDMMU is faster than memory allocation using the malloc() function. Table1 2: E.T.... In PAGE 107: ...Table1 3: Required Memory Allocations MPEG-2 Player OFDM Receiver 2 KBytes 34 KBytes 500 KBytes 32 KBytes 5 KBytes 1 KBytes 1500 KBytes 1.5 KBytes 1.... In PAGE 107: ... During the transition from the MPEG-2 player to the OFDM receiver, six memory deallocations and seven memory allocations are executed. From the results, shown in Table1 4, we can see that using the SoCDMMU yields a 4.4X improvement over the SDT2.... In PAGE 108: ...Table1 4: Memory Management E.... In PAGE 108: ...4X Worst Case 1244 cycles 4851 cycles 3.9X Table1 5, 9.26X over uClibc memory management functions in average case execution time.... In PAGE 108: ...46X improvement over the uClibc memory management functions. Table1 5: Memory Management E.T.... In PAGE 109: ... In this way, the benchmarks could be dynamically downloaded and run on a handheld device, which is the kind of ability we want this research to enable and make more practical. Table1 6: E.T.... In PAGE 110: ...Table1 7: E.T.... In PAGE 110: ... This reduction in the memory management execution time yields speed ups in the benchmarks execution times. As we can see in Table1 7, using the SoCDMMU tends to speed up the ap- plication execution time and this speed up is almost equal to the percentage of time consumed by conventional software memory management techniques. 7.... In PAGE 111: ... Table1 8: SoC Area Element Number of Transistors % 4 ARM9TDMI Cores 4 x 112K = 448K Transistors 4 L1 Caches (64KB+64KB) 4 x 6.5M = 26M Transistors Global On-Chip Memory (16MB) 134.... ..."

Cited by 2

### Table 3: Delays assumed for hardware elements

"... In PAGE 22: ...Table 3: Delays assumed for hardware elements have recognized delays in terms of the delay of one full adder. The delays and areas assumed for the di erent hardware elements are showed in Table3 . For the 4{to{2 carry{save adder/subtracter we have assumed the implementation given in [13], which possesses the same delay as a 4{to{2 carry{save adder.... In PAGE 22: ... We would like to emphasize that a true comparison between di erent implementations is possible only if actual implementation is considered and logic level simulations are carried out. Therefore, we present a rough, rst order approximation comparison based on Table3 . Nevertheless, we claim that it can express the general trend between di erent designs.... In PAGE 23: ... For the case of redun- dant arithmetic, when only angle calculation is considered, we have compared our architectures with the architecture proposed in [5] (only n iterations), but when the calculation of the magnitude of the vector is considered we compare our architectures with the architecture proposed in [9] (n iterations plus some repetitions, but with a constant scale factor). Based on Figure 10 and according to Table3 , the delay of a CORDIC iteration is: tr2 = t2?1mux + treg + max(tbs; tsel) + tbuf + t2?1mux + tadder (47) where tsel is the delay of the logic to select the i value (about 3tfa in [9] and 1.5tfa in [5], and is negligible for the radix{2 architecture with non{redundant arithmetic) and tadder is the delay of the nal adder.... In PAGE 23: ...5tfa in [5], and is negligible for the radix{2 architecture with non{redundant arithmetic) and tadder is the delay of the nal adder. According to the delays given in Table3 , this corresponds to a cycle time of trd2 = (3:5+max(tbs; tsel)+ tadder) tfa For n bits of precision, n iterations are needed in the classic radix{2 CORDIC [3] and [5], so the total computation time Trd2 is Trd2 = (n) tr2, whereas in [9] some repetitions are needed, resulting in a total computation time of Trd2 = n + ln=2 t?1m trd2 . According to Table 3 and Figure 10, the total area in non{redundant arithmetic is Ard2 = (6 n + n log2 n + Asel + 3 Aadder)afa (48) where Asel is the area of the logic to select the i value (module SEL in Figure 10) and Aadder is the area of the adder selected.... In PAGE 23: ... According to the delays given in Table 3, this corresponds to a cycle time of trd2 = (3:5+max(tbs; tsel)+ tadder) tfa For n bits of precision, n iterations are needed in the classic radix{2 CORDIC [3] and [5], so the total computation time Trd2 is Trd2 = (n) tr2, whereas in [9] some repetitions are needed, resulting in a total computation time of Trd2 = n + ln=2 t?1m trd2 . According to Table3 and Figure 10, the total area in non{redundant arithmetic is Ard2 = (6 n + n log2 n + Asel + 3 Aadder)afa (48) where Asel is the area of the logic to select the i value (module SEL in Figure 10) and Aadder is the area of the adder selected. Asel is practically negligible for non{redundant arithmetic whereas in the redundant case it is about 1.... In PAGE 23: ... 6.1 Comparison of the architecture of the arithmetic compari- son method According to Table3 , the delay of one iteration for the design proposed in section 5.1 (see Figure 7) is given by: trd4comp = t3?1mux+treg+maxft2?1mux+tmodA; t2?1mux+tbsg+tbuf+t6?1mux+tadder (49) and the area isArd4comp = (8:25n + AmodA + n log2 n + 3Aadder)afa (50) where tmodA and AmodA correspond to the delay and the area of the Module A of Figure 7 respectively (the area is calculated without considering the com- pensation of the scale factor.... In PAGE 24: ... coordinates is increased by one bit (see subsection 3.1.3). Therefore, each one of the 7-CLA* of Figure 6 (module A of gure 7) must be substituted by two 4-2 CSA of 8 bits followed by a 8-CLA in the redundant arithmetic version. Therefore, according to Table3 , we estimate that Module A of Figure 7 has a delay of 4tfa and an area of 18afa in conventional arithmetic and a delay of 5.5tfa and area of 30afa in carry-save arithmetic.... In PAGE 24: ...5tfa and area of 30afa in carry-save arithmetic. Hence, for n 64 and according to Table3 , we have trd4comp = (4 + tmodA + tadder) tfa, and the total computation time (Trd4comp) is Trd4comp = (0:80 n2 + 3)trd4comp if only angle computation is considered, and Trd4comp = (0:80 n + 3)trd4comp if magnitude computation is also considered. Factor 0.... In PAGE 26: ...7 Table 5: Speedup and area ratio for the look{up table method of about 75afa for the digit selection network without using the zero{skipping technique, and about 100afa if this tecnnique is considered (note that it is only necessary to duplicate the indexing logic instead of the whole table). Therefore, according to Table3 for n 64 bits we have trd4table = 10:5tfa and the area is 22n + 2n log2 n + 100 if the zero{skipping technique is implemented. The total computation time is Trd4table = (0:80n2 + 1) 10:5 tfa.... ..."