## Bit Reversal On Uniprocessors (1996)

Venue: | SIAM Rev |

Citations: | 13 - 0 self |

### BibTeX

@ARTICLE{Karp96bitreversal,

author = {Alan H. Karp},

title = {Bit Reversal On Uniprocessors},

journal = {SIAM Rev},

year = {1996},

volume = {38},

pages = {1--26}

}

### OpenURL

### Abstract

Manyversions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index. There has been a surprisingly large number of papers on this subject in the recent literature.

### Citations

631 | Cache memories
- Smith
- 1982
(Show Context)
Citation Context ... held a given word[44]. 3.2. Hierarchical memory. Hierarchical memories are complicated because they use many tricks to shield the user from delays caused by the slower circuits lower in the hierarchy=-=[36]-=-. While these techniques speed up the average memory access time, they make understanding the system more di cult. At the top of the memory hierarchy are the registers. The time it takes to move data ... |

324 |
Computational Frameworks for the Fast Fourier Transform
- Loan
- 1992
(Show Context)
Citation Context ...ix problems, we limit our discussion to radix 2 problems only.s4 A. H. Karp The bit reversal reordering is also closely related to matrix transposition repeatedly applied on matrices of di erent shape=-=[39]-=-. One scheme is to reshape x into an N=2 2 matrix, transpose it, and copy the elements in row major order. This new vector is reshaped into an N=4 4 matrix. This procedure is repeated n ; 1 times and ... |

158 |
The fast Fourier transform
- Brigham
- 1988
(Show Context)
Citation Context ...45.5 48.3 buneman[6] 23.3 23.8 23.3 23.4 23.5 33.9 50.4 57.1 59.4 61.0 rodrig[28] 19.6 19.6 19.6 19.8 19.9 31.7 48.1 57.0 58.5 58.8 r-rodr[31] 11.5 11.5 11.5 11.5 11.8 33.8 41.9 45.9 48.2 48.6 brigham=-=[5]-=- 113.8 122.1 130.8 141.3 147.1 167.8 193.2 211.8 219.0 233.9 duhamel[8] 11.5 11.4 11.3 11.5 11.4 35.2 44.9 45.1 46.2 46.3 middled[24] 21.8 21.8 21.7 21.8 21.9 33.7 49.8 57.1 60.0 62.5 cvl143[39] 31.2 ... |

133 | FFT’s in External or Hierarchical Memory - Bailey - 1990 |

40 |
Algorithms for matrix transposition on Boolean N -Cube Con Ensemble Architectures
- Johnsson, Ho
- 1988
(Show Context)
Citation Context ...prehensive review of the literature in the last 15-20 years. (Rosel's[30] survey is more limited in scope than this one.) While there is a good deal of interest in bit reversals on parallel processors=-=[9, 15, 19, 20]-=-, this paper considers only those written for uniprocessors. I have collected 30 bit reversal algorithms from the literature and from a request made over na-net. Each has been coded into a uniform sty... |

25 | Optimal matrix transposition and bit reversal on hypercubes: all-to-all personalized communication - Edelman - 1991 |

19 | private communication
- Edelman, Reiner
(Show Context)
Citation Context ...a top and a bottom half of the even and odd values. He breaks up the bit reversal into groups of elements that can be processed together, and an index vector of order p N in size. The algorithm Heller=-=[14]-=- uses was designed for the CM-2, so some of the details are a bit obscure. However, the structure is similar to that of Khan's in the way groups of elements are handled. Heller starts with a base vers... |

18 | Efficient communication primitives on hypercubes - Ho, Raghunath - 1992 |

12 |
private communication
- Bailey
- 1993
(Show Context)
Citation Context ...e bit reversal index vector contains all possible power of two strides which leads to many bank con icts. All is not lost, though. We can modify the data access patterns by scrambling the index vector=-=[1]-=-. Of course, since the index vector is scrambled, we will have to use both a scatter and a gather to do the bit reversal. We will gain by avoiding memory bank con icts, but we lose because gathers and... |

12 |
A method for computing the fast Fourier transform with auxiliary memory and limited high-speed storage
- Singleton
- 1967
(Show Context)
Citation Context ...bit reversal step is needed because many ofthe FFT algorithms bit reverse the data in log2 N passes. This observation means that we can also do a bit reversal in log2 N passes as was done by Singleton=-=[34]-=-, whose algorithm was designed to minimize the amountofwords transferred between memory and backing storage. His code uses the perfect shu e on successively smaller segments on each pass. The loops ar... |

9 |
P: The fast fourier transform and its applications
- JW, Lewis, et al.
- 1969
(Show Context)
Citation Context ...14.8 44.1 44.0 43.9 44.3 44.5 hybrid 8.7 8.5 8.6 8.7 8.6 37.3 37.5 37.4 37.7 38.1sBit Reversal 25 Table 13 Bit reversals on Intel i860 in cycles per element. Version 8 9 10 11 12 13 14 15 16 17 cooley=-=[7]-=- 27.3 26.3 25.6 26.1 29.0 32.5 33.5 33.9 34.1 42.7 r-cooley[31] 14.9 13.4 12.5 13.3 16.2 18.8 19.3 19.4 19.2 23.7 yong[43] 14.1 12.5 12.1 12.9 16.8 18.8 19.9 19.9 20.0 28.1 buneman[6] 33.8 32.9 32.5 3... |

8 |
R.: “IBM RISC System/6000 NIC Tuning Guide for Fortran and C,” Document GG24-3611-01
- Bell
- 1991
(Show Context)
Citation Context ...ach other by 1024 words. 4.3. IBM RS/6000-520. The memory subsystem of the RS/6000 model 520 is similar to that of the 3090. Its cache is smaller, 32 KBytes, and its cache lines are shorter (64-bytes)=-=[3]-=-. It also has a smaller TLB. When a reference is made to a word not in cache, a line is transferred from main memory starting with the quadword (16 bytes) containing the requested byte. The rst word i... |

8 |
An improved digit-reversal permutation algorithm for the fast fourier and hartley transforms
- Evans
- 1987
(Show Context)
Citation Context ...er k = k2 + k1 p N is ~k = ~ k2 p N + ~ k1 when N is even. When k is odd, we do the same thing for the two halves with di ering middle bits. This approach is used in Polge's usone algorithm[27]. Evans=-=[10]-=- independently discovered Polge's approach and produced a slightly simpler implementation. Evans later improved the algorithm very slightly by reducing the number of multiplications needed[11]. Evans'... |

6 |
Conversion of FFT's to fast Hartley transforms
- Buneman
- 1986
(Show Context)
Citation Context ...ile the other is modi ed until it contains the bit reversal of the other. At this point, the elements are exchanged if they have not already been exchanged in an earlier iteration of the loop. Buneman=-=[6]-=- made three improvements. First, he noted that the last two elements will never be exchanged with each other, so the loop could be shortened. Secondly, he replaced the divisions by two that Cooley use... |

4 | The i860(TM) 64-bit supercomputing microprocessor - Kohn, Margulis - 1989 |

4 |
E cient fast Fourier transform programs for arbitrary factors with one step loop unscrambling
- Polge, Bhagavan
- 1994
(Show Context)
Citation Context ...g longsBit Reversal 21 cache lines. While this paper has concentrated on radix-2 bit reversals, most of the algorithms used have mixed-radix analogs, some of which have been reported in the literature=-=[26, 35]-=-. However, it is di cult to predict their performance because their memory access patterns are quite di erent from those of the radix-2 methods. Perhaps a followup study is needed. Acknowledgments. Iw... |

3 |
A Connection between Bit Reversal and Matrix Tranposition: Hardware and Software Consequences
- Duhamel
- 1990
(Show Context)
Citation Context ...pairs of elements. This code runs a bit faster than Rutkowska's because it does less integer work in the inner loop. Although we normally think of arithmetic operations as taking all the time, Duhamel=-=[8]-=- noted that the test in the inner loops of many bit reversals takes a substantial amount of time. He used the close relation between bit reversal and matrix transpose to completely eliminate the test ... |

3 |
A fast recursive bit-reversal algorithm
- Jeong, Williams
- 1990
(Show Context)
Citation Context ...proach. Other orderings can now be seen quite easily. For example, a tree-like reduction could be used on a parallel machine or a di erent ordering used for machines with direct mapped caches. Jeong's=-=[18]-=- approach combines the data movement with the computation of the index. While it seems unfair to compare it with a routine like gather which doesn't count the time to set up the index array, the algor... |

3 |
Fast Computational Algorithms for Bit Reversal
- Polge, Bhagavan, et al.
- 1974
(Show Context)
Citation Context ...pairs of elements. Repeat this process, doubling the size of the groups shu ed for log2 N steps. Each step is implementable using the merge operation on the STAR 100 they used. Polge's usbin algorithm=-=[27]-=- doesmultiple passes over the data but does not need any auxiliary storage. However, some elements may bemoved more than once. At step k elements are moved when the bits of k and log2 N ; k di er. Sin... |

3 | A new bit reversal algorithm - Walker - 1990 |

2 |
On Estimating and Enhancing Cache E ectiveness
- Ferrante, Sarkar, et al.
- 1991
(Show Context)
Citation Context ...on. Other power of two strides will also be bad. A stride of half the cache size gives only two rows� one quarter of the cache size, four rows� etc. Surprisingly, other strides can also cause prob=-=lems[12]-=-. A four-way set associative, 32 KByte cache with a 128 byte line size will perform poorly when the stride is 103! We can x our example by changing the stride by a small amount. If the array in our si... |

2 |
Computing the Fast Fourier Transform on aVector
- Korn, Lambiotte
- 1979
(Show Context)
Citation Context ...Figure 1 illustrates the problem. It shows the time it takes to do the bit reversal of an array of the indicated length in machine cycles per element. Two methods are shown, a simple scatter operation=-=[23]-=- and one of the rst published methods[7]. Also shown is the time it takes to do one FFT butter y using an algorithm tuned to the IBM 3090 vector processor used. It is clear that the bit reversal step ... |

2 |
An Improved Bit-Reversal Algorithm for the Fast Fourier Transform
- Rodriguez
- 1988
(Show Context)
Citation Context ... 34.7 35.8 36.1 r-cooley[31] 6.9 6.9 9.4 10.7 11.0 11.1 14.6 17.5 18.3 18.6 yong[43] 6.4 6.3 10.8 11.3 11.6 11.8 20.1 22.4 22.5 22.6 buneman[6] 27.0 26.5 29.9 31.1 31.7 32.0 41.4 45.7 46.9 47.2 rodrig=-=[28]-=- 16.5 17.0 20.9 21.6 22.1 22.5 31.6 35.9 37.1 37.5 r-rodr[31] 8.0 8.1 10.4 12.0 12.4 12.5 16.0 18.9 19.7 20.0 brigham[5] 59.9 63.8 71.9 77.1 81.9 86.0 99.7 107.7 135.5 117.5 duhamel[8] 7.5 7.5 12.5 12... |

2 |
FFT Algorithms for Vector
- Swarztrauber
- 1984
(Show Context)
Citation Context ...t done properly, it can take a substantial fraction of the total time to do the FFT. In fact, it is common wisdom that bit reversal reordering is too slow tobeusedonamachine with a hierarchical memory=-=[37]-=-. Figure 1 illustrates the problem. It shows the time it takes to do the bit reversal of an array of the indicated length in machine cycles per element. Two methods are shown, a simple scatter operati... |

2 |
On Vectorizing the Fast Fourier Transform
- Wang
- 1980
(Show Context)
Citation Context ...ts of the binary form of k j are 0 when k j is small. For example, we know that the last n ; 2 bits of ~ k j are zero for k j < 4. This approach leads to the recursive algorithm illustrated in Table 1=-=[42]-=-. At each stepwemultiply the current listby 2 and concatenate 1 plus the list. We are primarily interested in bit reversals because many FFT algorithms leave the result bit reversed or require input i... |

2 |
A better FFT bit-reversal algorithm without tables
- Yong
- 1991
(Show Context)
Citation Context ...le identities of an integer and its bit reversal to do four swaps of elements for each index computation. A similar modi cation of Cooley's algorithm results in a similar performance improvement. Yong=-=[43]-=- used a similar trick toimproveCooley's algorithm by exchanging pairs of elements. This code runs a bit faster than Rutkowska's because it does less integer work in the inner loop. Although we normall... |

1 |
Bit Reversal in FFT from Matrix Viewpoint
- Biswas
- 1991
(Show Context)
Citation Context ...nly consider the subdiagonal part of this matrix. The observation allowed Walker to reduce the size of the index vector by a factor of two when log2 N is odd and reduce the run time marginally. Biswas=-=[4]-=- presents an algorithm virtually identical to that of Walker, even including a matrix showing the symmetry of the transformation. Vesely[40] studied the properties of mixed radix bit reversals and con... |

1 |
Bit-reversal and Generalized Sorting of
- Fraser
- 1985
(Show Context)
Citation Context ...riginally cited[29]. Unfortunately, I no longer have access to the machines used for the extensive measurements. Instead, I'll report performance numbers on an HP-755 workstation. Istumbled on a paper=-=[13]-=- describing a scheme that is very similar to the hybrid method described in x 5.6. It was designed for out-of-core problems, which isnot unlike the out-of-cache problem described in this paper. Howeve... |

1 |
Bit Reversal Algorithms
- IEEE
(Show Context)
Citation Context ...us di culty when reading from a disk, but would be awkward to incorporate in a program working on memory resident data. Table 14 Index vector by recursion. 0 0 1 0 4 2 0 4 2 6 3 0 4 2 6 1 5 3 7 Elster=-=[17]-=- presented an interesting derivation of an algorithm that is identical to cvl143[39]. The bit reversed index is written as r n(k) =c k2 t;q , where n =2 r and 1 q < t. The c k can be computed recursiv... |

1 |
Communication Channel Utilizatio for Matrix Transposition and Related Permuations on Boolean Cubes
- Optimal
- 1991
(Show Context)
Citation Context ...prehensive review of the literature in the last 15-20 years. (Rosel's[30] survey is more limited in scope than this one.) While there is a good deal of interest in bit reversals on parallel processors=-=[9, 15, 19, 20]-=-, this paper considers only those written for uniprocessors. I have collected 30 bit reversal algorithms from the literature and from a request made over na-net. Each has been coded into a uniform sty... |

1 |
Another Fast Algorithm for Bit Reversal
- Khan
- 1992
(Show Context)
Citation Context ... fully symmetric version used here is similar to Evans's algorithm for odd n except that the middle section is wider than one bit and an additional reversal stage is needed to process these bits. Khan=-=[21]-=- formulates the bit reversal in a manner similar to that of Rutkowska[32], splitting the indices into a top and a bottom half of the even and odd values. He breaks up the bit reversal into groups of e... |

1 |
Private communication
- Middleditch
- 1992
(Show Context)
Citation Context ...6 57.2 61.5 61.9 68.7 74.1 75.7 r-rodr[31] 22.8 23.0 27.5 29.5 30.3 33.3 36.1 37.2 brigham[5] 238.0 254.5 274.1 290.4 304.0 323.8 342.1 357.0 duhamel[8] 24.5 24.6 38.9 38.6 39.3 53.9 54.3 55.9 middled=-=[24]-=- 30.4 31.1 41.2 43.3 44.4 53.5 58.5 62.0 cvl143[39] 29.8 40.5 55.0 54.7 55.5 78.5 83.8 86.2 gather 15.4 22.0 40.5 42.1 41.8 59.9 64.0 67.2 scatter[23] 16.3 24.2 44.8 46.6 46.6 67.6 71.5 74.2 gatscat[1... |

1 |
Fast Bit-reversal Algorithms Based on
- Orchard
- 1992
(Show Context)
Citation Context ...m[17]. The elements to be swapped are just x(r(i)+j) andx(i + r(j)) for 0 j<i.Onlyn=2 stages are needed to complete the bit reversal for an array with2 n elements. The bit reversal proposed by Orchard=-=[25]-=- is similar to the methods that step through the index values. However, instead of stepping through the integers by counting, these methods use the properties of Galois elds to step through the intege... |

1 |
An Improved FFT Digit-reversal Algorithm
- Rodriguez
- 1989
(Show Context)
Citation Context ...s found during a less than thorough literature search. In addition, I have added a citation to a version published in an archival publication in addition to the conference proceedings originally cited=-=[29]-=-. Unfortunately, I no longer have access to the machines used for the extensive measurements. Instead, I'll report performance numbers on an HP-755 workstation. Istumbled on a paper[13] describing a s... |

1 |
TimingofSomeBitReversal Algorithms
- Rosel
- 1989
(Show Context)
Citation Context ...erest in bit reversals. There have been nearly 20 publications in the last several years on this subject. This study is the rst comprehensive review of the literature in the last 15-20 years. (Rosel's=-=[30]-=- survey is more limited in scope than this one.) While there is a good deal of interest in bit reversals on parallel processors[9, 15, 19, 20], this paper considers only those written for uniprocessor... |

1 |
A Simple Algorithm for the Bit-Reversal
- Rutkowska
- 1991
(Show Context)
Citation Context ... as the upper bound on the loop. He showed thatsBit Reversal 11 the loop was made shorter by about p N. Unfortunately, thesavings are small for arrays of the length considered in this study. Rutkowska=-=[31]-=- improved Rodriguez's idea by using some simple identities of an integer and its bit reversal to do four swaps of elements for each index computation. A similar modi cation of Cooley's algorithm resul... |

1 |
Recursive Unscrambling Algorithms
- Fast
- 1991
(Show Context)
Citation Context ...vl153[39] 78.1 84.4 90.2 98.2 232.6 285.7 317.8 366.3 405.9 434.3 kornlam[23] 65.9 69.1 73.0 77.7 143.5 567.5 633.5 707.2 782.6 848.4 unsbin[27] 21.6 22.0 23.7 23.8 26.2 57.2 75.8 91.1 96.0 97.1 perm1=-=[32]-=- 20.5 21.0 20.5 20.3 20.3 40.7 41.0 41.7 42.3 41.9 perm2[32] 28.6 28.6 29.0 29.0 29.3 44.6 54.0 63.9 70.8 92.1 unsone[27] 8.7 8.6 8.5 8.6 8.4 31.2 42.5 42.7 43.6 45.2 evans[10] 9.6 9.5 9.5 9.6 9.5 17.... |

1 |
Digit-index Permuation Algorithms for FFT Computations: An Applicative Approach
- Seguel, Bellman, et al.
(Show Context)
Citation Context ... been computed, it saves arithmetic and time, as shown in Table 15. Elster's algorithm and that in Table 1 are both special cases of a more general formulation that comes from the idea of a tensor sum=-=[33]. The tens-=-or sum of two vectors w = u v is de ned as w =[u + v0�u+ v1� ...�u+ v n]. The general form of the bit reversal index vector is de ned as B k(V k 2 )= M k;1 2 j=0 j V2� where V2 =[0� 1]. Now,... |

1 |
Fast Cell-Structured Algorithm for Digit Reversal of Arbitrary Length
- Vesely
- 1991
(Show Context)
Citation Context ...27] 10.0 10.2 22.8 23.2 23.3 33.7 35.3 37.4 evans[10] 9.9 10.3 20.4 24.6 24.9 32.4 37.3 38.8 walker[41] 11.7 11.4 22.8 23.1 26.4 34.2 42.6 41.2 biswas[4] 12.3 14.1 26.0 28.4 28.5 26.8 42.9 42.6 vesely=-=[40]-=- 29.9 80.6 82.1 86.6 95.2 102.1 108.4 121.8 khan[21] 12.8 14.2 23.3 24.2 25.1 34.4 36.9 37.8 heller[14] 18.7 20.8 43.3 44.3 47.3 63.6 63.3 64.6 hybrid 10.0 10.4 20.6 22.3 22.2 22.0 22.5 22.4 Table 10 ... |

1 | On the Modulo N Translators for the Prime Memory System, J.Parallel Distrib - Yoon, Lee, et al. - 1990 |