## Towards an Optimal Bit-Reversal Permutation Program (1998)

### Cached

### Download Links

- [www.cs.ucsd.edu]
- [www.cs.ucsd.edu]
- [cseweb.ucsd.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceeding of IEEE Foundations of Computer Science |

Citations: | 11 - 2 self |

### BibTeX

@INPROCEEDINGS{Carter98towardsan,

author = {Larry Carter and Kang Su Gatlin},

title = {Towards an Optimal Bit-Reversal Permutation Program},

booktitle = {In Proceeding of IEEE Foundations of Computer Science},

year = {1998},

pages = {544--555},

publisher = {IEEE Computer Society Press}

}

### OpenURL

### Abstract

The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation -- trivial operations on a RAM -- present non-trivial problems when designing highly-tuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebble-type game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with four-levels of memory (registers, cache, TLB, and memory), taking into account such "messy" features as block moves and set-associative caches. The insights from this analysis lead to a bit-reversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algo...

### Citations

2446 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...r i = 0 to N-1 B[i] = A[i] Transpose(A,B): for i = 0 to N1-1 for j = 0 to N2-1 B[j,i] = A[i,j] BitReverse(A,B): for i = 0 to N-1 B[r(i)] = A[i] In the Random Access Machine (RAM) model of computation =-=[AHU74]-=-, all three programs have the same complexity,s\Theta(N). If we only count the cost of Loads and Stores of array elements (i.e., we assume that all addressing and looping computations are free) then e... |

318 |
Computational Frameworks for the Fast Fourier Transform
- Loan
- 1992
(Show Context)
Citation Context ...s. This example highlights that a major difficulty with programming FFT's efficiently is choreographing the data movement. Because of the tremendous importance of FFT's, 1 many papers and books (e.g. =-=[V-L92]) deal ext-=-ensively with this question. In practice, most FFT implementations avoid bit reversals, using "autosort" methods instead, which weave the bit reversal into the rest of the computation. One o... |

166 |
H.T.: I/O complexity: The red-blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...ata must be moved from a large, slow memory to a small, fast memory for processing. Permutations involve no re-use of data (each element is used only once) so models (such as the Red-Blue pebble game =-=[HK81]-=-) that ignore the spatial structure of memory don't provide any insight. However, a two-level model becomes relevant when there is an added restriction that only contiguous blocks of data can be moved... |

129 | FFTs in external or hierarchical memory
- Bailey
- 1990
(Show Context)
Citation Context ...d bit reversals, using "autosort" methods instead, which weave the bit reversal into the rest of the computation. One of the more popular algorithms is the "Four Step" FFT [GS66], =-=advocated by Bailey [B90]-=- particularly for computers with hierarchical memories. The algorithm performs a onedimensional FFT by storing the data in a 2-D array in column major order, performing FFT's on the rows of the array,... |

110 |
Hierarchical memory with block transfer
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ...de as moving data from cache to registers. Modeling this accurately enough to determine the constants requires a hierarchical model of memory. In the Block Transfer (BT ) model of hierarchical memory =-=[ACS87]-=-, copying a block of consecutive locations takes one unit of time per element, after an initial access time that is a function of the source and target locations. The cited paper provides results for ... |

53 |
Fast Fourier transforms for fun and profit
- Gentleman, Sande
- 1966
(Show Context)
Citation Context ...ost FFT implementations avoid bit reversals, using "autosort" methods instead, which weave the bit reversal into the rest of the computation. One of the more popular algorithms is the "=-=Four Step" FFT [GS66]-=-, advocated by Bailey [B90] particularly for computers with hierarchical memories. The algorithm performs a onedimensional FFT by storing the data in a 2-D array in column major order, performing FFT'... |

42 | Uniform Memory Hierarchies - Alpern, Carter, et al. - 1990 |

38 |
Permuting information in idealized two-level storage
- Floyd
- 1972
(Show Context)
Citation Context ...ture of memory don't provide any insight. However, a two-level model becomes relevant when there is an added restriction that only contiguous blocks of data can be moved between the two levels. Floyd =-=[F72]-=- shows that if the small memory can hold only two blocks, each of size B elements, and Bsmin(N 1 ; N 2 ), then transposing a N 1 \ThetaN 2 array requires exactly 2(N=B) lg(B) block moves between the t... |

38 |
Extending the Hong-Kung model to memory hierarchies
- Savage
- 1995
(Show Context)
Citation Context ... architectural scenarios. Further, many computers cannot overlap communication as required by the model. 2 The specific model here assumes the initial access time of a block at address x is x. Savage =-=[S95]-=- presents a multilevel pebble game and briefly suggests an extension that can model block moves, but his results don't apply to the issues we address in this paper. The rest of the paper is organized ... |

19 |
Virtual memory algorithms
- Aggarwal, Chandra
- 1988
(Show Context)
Citation Context ...t functions. Unfortunately, there is no obvious way to translate these asymptotic analyses to specific results for the step-wise functions that occur in practice. A particularly provocative result in =-=[AC88]-=- is that when block transfers in a certain BT model 2 are controlled by a virtual memory system, then any transpose program that moves each element of the source array directly to its final destinatio... |

13 | Bit reversal on uniprocessors
- Karp
- 1996
(Show Context)
Citation Context ...early b 2 = p 2b 1 k 1 times. The final section presents an optimized BitReverse program, and shows it is better than any other known method. This last task is made easier since a comprehensive study =-=[K96] shows tha-=-t Alan Karp's "Hybrid" bit reversal is superior to the 29 other algorithms he found in a thorough literature search. Our program beats Hybrid significantly. 2. The RoCol TM pebble game EQUIP... |

11 | Challenges of computing the fast Fourier transform
- Johnson, Johnson
- 1997
(Show Context)
Citation Context ... to what has been known in practice since the earliest computers; that "tiling" an array into subarrays of size B \Theta B, and processing one tile at a time, allows transpose 1 It has been =-=estimated [JJ97]-=- that in 1990, 40% of all CPU cycles executed by Cray Research supercomputers were devoted to FFT's. 2 to be computed with each data element making only one trip into the smaller memory. For some othe... |

4 |
Fast Fourier Transform
- Gentleman, Sande
- 1966
(Show Context)
Citation Context ...ost FFT implementations avoid bit reversals, using “autosort” methods instead, which weave the bit reversal into the rest of the computation. One of the more popular algorithms is the “Four Step” FFT =-=[GS66]-=-, advocated by Bailey [B90] particularly for computers with hierarchical memories. The algorithm performs a onedimensional FFT by storing the data in a 2-D array in column major order, performing FFT’... |

3 |
IO Complexity of Sorting and Related Problems," CACM
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...blocks, each of size B elements, and Bsmin(N 1 ; N 2 ), then transposing a N 1 \ThetaN 2 array requires exactly 2(N=B) lg(B) block moves between the two level, where N = N 1 N 2 . Aggarwal and Vitter =-=[AV88]-=- extend this result to show that if the small memory can hold Ks2 blocks, then transpose requires \Theta((N=B) lg(min(KB; N 1 ; N 2 ; N=B))= lg(K)) block moves. In practice, when the two levels being ... |