### Table 1: Performance of the data-parallel programming style version of DISPER. A ` apos; indicates that there was not enough main memory available to run the problem size.

1993

"... In PAGE 11: ... For the chosen problem sizes and number of processors used, the computational load is perfectly balanced in every experiment. The lower part of Table1 shows the execution times of the automatically generated node programs on the Intel Hypercube iPSC/860. For comparison purposes, the upper part of the table contains the execution times of the data-parallel programming style versions on a sequential workstation (Sparc2), two superscalar workstations (Sparc10 and RS6000/550), and a vector supercomputer (Cray Y-MP).... In PAGE 15: ...Table1 . The resulting numbers show that the SUN compilers for the Sparc2 and Sparc10 did not exploit the additional opportunities for compile time optimizations available in the data-parallel programming style.... ..."

### Table 2. Exploiting Data Parallelism

1998

"... In PAGE 10: ... All performance numbers are sustainable for any data stream length that entirely fits within the SRF. Table2 shows the speedup of the four kernels on the Imagine processor going from a single cluster configura- tion to an eight cluster configuration. The near-linear speedup of 7.... ..."

Cited by 97

### Table 1: Available Data Parallelism in Wireless Communication Workloads ( U = Users, K = constraint length, N = spreading gain (fixed at 32), R = coding rate (fixed at rate 1/2)). The numbers in columns 3-5 represent the amount of data parallelism. A 32-cluster processor will require reconfiguration for all cases where the data parallelism drops below 32. The fully loaded case does not require any reconfiguration.

"... In PAGE 7: ...oltage. However, base-stations rarely operate at full capacity [3]. At lower capacity workloads, far fewer resources are required to meet the real-time constraints, so many of the resources will be used inefficiently. Table1 shows the available data parallelism 2 for the base-station with variation in the number of users (U) and the decoding constraint length (K). While it is possible to vary other system parameters such as the coding rate (R) and the spreading gain (N) in the table, we decided to keep them constant in order to fix the target data rate to 128 Kbps/user as changing these two parameters affects the target data rate as well in wireless standards [6].... In PAGE 16: ... In order to compare the increase in execution time, a 32-user system with constraint length 9 Viterbi decoding is considered. The (32,9) case does not require any reconfiguration as it always has data parallelism greater than 32 as can be seen from Table1 . Hence, Figure 9 allows us to see the overhead of providing the ability to reconfigure using memory transfers (MEM), the multiplexer network (MUX) and conditional streams (CS).... ..."

### Table 1: Available Data Parallelism in Wireless Communication Workloads ( U = Users, K = constraint length, N = spreading gain (fixed at 32), R = coding rate (fixed at rate 1/2)). The numbers in columns 3-5 represent the amount of data parallelism. A 32-cluster processor will require reconfiguration for all cases where the data parallelism drops below 32. The fully loaded case does not require any reconfiguration.

2004

"... In PAGE 7: ...oltage. However, base-stations rarely operate at full capacity [3]. At lower capacity workloads, far fewer resources are required to meet the real-time constraints, so many of the resources will be used inefficiently. Table1 shows the available data parallelism 2 for the base-station with variation in the number of users (U) and the decoding constraint length (K). While it is possible to vary other system parameters such as the coding rate (R) and the spreading gain (N) in the table, we decided to keep them constant in order to fix the target data rate to 128 Kbps/user as changing these two parameters affects the target data rate as well in wireless standards [6].... In PAGE 16: ... In order to compare the increase in execution time, a 32-user system with constraint length 9 Viterbi decoding is considered. The (32,9) case does not require any reconfiguration as it always has data parallelism greater than 32 as can be seen from Table1 . Hence, Figure 9 allows us to see the overhead of providing the ability to reconfigure using memory transfers (MEM), the multiplexer network (MUX) and conditional streams (CS).... ..."

### Table 2: Node and edge attributes for the data-parallel operations defined in Table 1.

1991

"... In PAGE 10: ... This is important as nodes can take inputs of several sizes, or produce outputs whose sizes differ from those of the input. The node and edge attributes of the operations shown in Table 1 are shown in Table2 . Note that the size of the output may be different from the iteration space size, as for the +/ operation.... In PAGE 13: ... To illustrate how the above algorithm works, consider the split operation. Given the initial labeling of the edges as shown in Figure 4, and using the node characteristics shown in Table2 , we get the following system of equations: E = fb = c;c = d; e = 1; b = f; g = a; g = f; h = g; b = d; b = h; i = b; a = i; j = ag which upon solution yields the following two equivalence classes: EC1 = feg; EC2 = fa; b; c;d; f; g; h; i; jg: This gives the edge labeling shown in Figure 5. The other node and edge attributes are easily computed given the size labels on the edges and are also shown in Figure 5.... In PAGE 19: ... Also, because a vector may be mapped to multiple edges with the same source (corresponding to fanout), the consumption pattern for a vector is defined as the most constrained consumption pattern among all the edges to which the vector is mapped, arb being more constrained than ind, which is more constrained than unused. Table2 shows the edge attributes for various nodes. The language primitives can be divided into the following groups: elementwise operations, structure accessors (such as LENGTH), permutes and distributes, and scans and reductions.... ..."

Cited by 13

### Table 2: Node and edge attributes for the data-parallel operations defined in Table 1.

1991

"... In PAGE 10: ... This is important as nodes can take inputs of several sizes, or produce outputs whose sizes differ from those of the input. The node and edge attributes of the operations shown in Table 1 are shown in Table2 . Note that the size of the output may be different from the iteration space size, as for the +/ operation.... In PAGE 13: ... To illustrate how the above algorithm works, consider the split operation. Given the initial labeling of the edges as shown in Figure 4, and using the node characteristics shown in Table2 , we get the following system of equations: E = fb = c;c = d; e = 1; b = f; g = a; g = f; h = g; b = d; b = h; i = b; a = i; j = ag which upon solution yields the following two equivalence classes: EC1 = feg; EC2 = fa; b; c;d; f; g; h; i; jg: This gives the edge labeling shown in Figure 5. The other node and edge attributes are easily computed given the size labels on the edges and are also shown in Figure 5.... In PAGE 19: ... Also, because a vector may be mapped to multiple edges with the same source (corresponding to fanout), the consumption pattern for a vector is defined as the most constrained consumption pattern among all the edges to which the vector is mapped, arb being more constrained than ind, which is more constrained than unused. Table2 shows the edge attributes for various nodes. The language primitives can be divided into the following groups: elementwise operations, structure accessors (such as LENGTH), permutes and distributes, and scans and reductions.... ..."

Cited by 13

### Table 3 A data{parallel library of partitioning schemes for adaptive O(N) N{body methods. Method Input load balancing quality nodal weights edge weights ORB workload + coord. good unknown Morton

1997

"... In PAGE 7: ...2 Load Balancing and Partitioning Heuristics Previously, orthogonal recursive bisection (ORB) and Morton and Peano{Hilbert ordering have been used to partition particles in the Barnes{Hut method [17, 22, 18] and to partition boxes in an adaptive fast multipole method [18]. We have developed an extensive library of partitioning schemes together with their data{parallel implementations in HPF, as summarized in Table3 . We also developed an extension of ORB called rotational recursive bisection (RRB).... ..."

Cited by 8

### Table 3 A data{parallel library of partitioning schemes for adaptive O(N) N{body methods. Method Input load balancing quality nodal weights edge weights ORB workload + coord. good unknown Morton

1997

"... In PAGE 6: ...2 Load Balancing and Partitioning Heuristics Previously, orthogonal recursive bisection (ORB) and Morton and Peano{Hilbert ordering have been used to partition particles in the Barnes{Hut method [17, 22, 18] and to partition boxes in an adaptive fast multipole method [18]. We have developed an extensive library of partitioning schemes together with their data{parallel implementations in HPF, as summarized in Table3 . We also developed an extension of ORB called rotational recursive bisection (RRB).... ..."

Cited by 8

### Table 1: Data parallel performance on a network of Sun Sparcstation 5 machines Circuit Processors

"... In PAGE 26: ...2 440.5 Table1 0: E ect of task parallel scheduling on a network of Sun Sparcstations Circuit Task Processors Ordering 1 2 3 4 5 plapart random 175.1 142.... In PAGE 28: ... Considering that the overall performance of the DRC is bounded by the performance of the most complex algorithms used by the DRC, the overall complexity for the DRC is O(N logN). Table1 2: Complexities of the DRC operations Operation Execution Time Boolean Operation O(N log N) Sort O(N log N) or O(log N) Grow O(N log N) Width O(N) Spacing O(N) Square Test O(N) Overall DRC O(N log N) 6.2 Analysis of Parallel DRC The performance results demonstrate that both data parallelism and task parallelism can be applied to the DRC problem to achieve better performance and reduced memory re- quirements as compared to serial algorithms.... In PAGE 29: ...Table1 3: Comparison of parallelization methods on CM-5 Circuit Procs per Processors Cluster 16 32 64 128 haab1 1 100.3 44.... ..."

### Table 1. Available Data Parallelism in Wireless Communication algorithms

"... In PAGE 12: ... Hence, the total number of ALUs on the chip were fixed to 3 adders and 3 multipliers for every cluster. As shown in Table1 , the data parallelism in the algorithms varies with many factors such as the number of users, the spreading gain and the constraint length of Viterbi decoding. While it is possible to choose 64 clusters for constraint length 9 Viterbi decoding as the maximum number and de- sign a 64 cluster architecture, we see that estimation and detection will never use 64 clusters and hence, half the clusters will always be turned off during estimation and detection and possibly more clusters will be off if other constraint lengths are used.... ..."