## The design and implementation of FFTW3 (2005)

Venue: | Proceedings of the IEEE |

Citations: | 407 - 3 self |

### BibTeX

@INPROCEEDINGS{Frigo05thedesign,

author = {Matteo Frigo and Steven and G. Johnson},

title = {The design and implementation of FFTW3},

booktitle = {Proceedings of the IEEE},

year = {2005},

pages = {216--231}

}

### Years of Citing Articles

### OpenURL

### Abstract

FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. Keywords—Adaptive software, cosine transform, fast Fourier transform (FFT), Fourier transform, Hartley transform, I/O tensor.

### Citations

8550 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...nd the strides denote a matrix-transposition problem, FFTW creates a plan that transposes the array in place. FFTW implements the square transposition by means of the “cache-oblivious” algorithm from =-=[35]-=-, which is fast and, in theory, uses the cache optimally regardless of the cache size. A generalization of this idea is employed for nonsquare transpositions with a large common factor or a small diff... |

451 | FFTW: An adaptive software architecture for the FFT
- Frigo, Johnson
- 1998
(Show Context)
Citation Context ... VIII). We summarize in Section VI, while a full description appears in [2]. We have produced three major implementations of FFTW, each building on the experience of the previous system. FFTW1 (1997) =-=[3]-=- introduced the idea of generating codelets automatically and of letting a planner search for the best combination of codelets. FFTW2 (1998) incorporated a new version of [2]. did not change much in F... |

371 | J.J.: Automatically tuned linear algebra software - Whaley, Dongarra - 1998 |

344 |
Discrete Cosine Transform: Algorithms, Advantages, Applications
- Rao, Yip
- 1990
(Show Context)
Citation Context ...DHT, there exist a number of other useful transforms of real inputs to real outputs, namely, DFTs of real-symmetric (or anti-symmetric) data, otherwise known as DCTs and DSTs, types I–VIII [27], [51]–=-=[53]-=-. We collectively refer to these transforms as trigonometric transforms. Types I–IV are equivalent to ( double-length) DFTs of even size with the different possible half-sample shifts in the input and... |

318 | Computational Frameworks for the Fast Fourier Transform - LOAN - 1987 |

226 | Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology - Bilmes, Asanovic, et al. - 1997 |

154 | A fast Fourier transform compiler
- Frigo
- 1999
(Show Context)
Citation Context ...struction multiple-data (SIMD) instructions (Section IX). Similarly, automatically derives codelets for the DCT and DST (Section VIII). We summarize in Section VI, while a full description appears in =-=[2]-=-. We have produced three major implementations of FFTW, each building on the experience of the previous system. FFTW1 (1997) [3] introduced the idea of generating codelets automatically and of letting... |

141 | SPIRAL: Code generation for DSP transforms
- Püschel, Moura, et al.
(Show Context)
Citation Context ...cribes the main ideas common to all FFTW systems, the runtime structure of FFTW3, and the modifications to since FFTW2. Previous work on adaptive systems includes [3]–[11]. In particular, SPIRAL [9], =-=[10]-=- is another system focused on optimization of Fourier transforms and related algorithms, but it has distinct differences from FFTW. SPIRAL searches at compile time over a space of mathematically equiv... |

133 |
An algorithm for the machine computation of the complex fourier series
- Cooley, Tukey
- 1965
(Show Context)
Citation Context ...hms to compute the same result. The most important FFT (and the one primarily used in FFTW) is known as the “Cooley–Tukey” algorithm, after the two authors who rediscovered and popularized it in 1965 =-=[14]-=-, although it had been previously known as early as 1805 by Gauss as well as by later reinventors [15]. The basic idea behind this FFT is that a DFT of a composite size can be reexpressed in terms of ... |

132 |
Fast fourier transform and convolution algorithms
- Nussbaumer
- 1982
(Show Context)
Citation Context ...ution of a permutation of the input array with a fixed real sequence. This cyclic convolution can be computed by means of two real DFTs, in which case the algorithm takes time, or by any other method =-=[49]-=-. (FFTW computes convolutions via DFTs.) The output element , which is the sum of all input elements, cannot be computed via (5) and must be calculated separately. An adaptation of Bluestein’s prime-s... |

129 | FFTs in external or hierarchical memory
- Bailey
- 1990
(Show Context)
Citation Context ... that are adapted to the hardware, and often uses much larger radices (radix-32 is typical) than were once common. (On the other end of the scale, a “radix” of roughly has been called a four-step FFT =-=[18]-=-, and we have found that one step of such a radix can be useful for large sizes in FFTW; see Section IV-D1.) A key difficulty in implementing the Cooley–Tukey FFT is that the dimension corresponds to ... |

90 | Superoptimizer: a look at the smallest program - Massalin - 1987 |

82 | Spl: A language and compiler for dsp algorithms - Xiong, Johnson, et al. |

79 |
Fast algorithms for the discrete W transform and for the discrete Fourier transform
- Wang
- 1984
(Show Context)
Citation Context ... the DHT, there exist a number of other useful transforms of real inputs to real outputs, namely, DFTs of real-symmetric (or anti-symmetric) data, otherwise known as DCTs and DSTs, types I–VIII [27], =-=[51]-=-–[53]. We collectively refer to these transforms as trigonometric transforms. Types I–IV are equivalent to ( double-length) DFTs of even size with the different possible half-sample shifts in the inpu... |

76 |
Fast Fourier transforms: a tutorial review and a state of the art
- Duhamel, Vetterli
- 1990
(Show Context)
Citation Context ...ny different ways to implement the data reorderings of the transpositions, have led to numerous implementation strategies for the Cooley–Tukey FFT, with many variants distinguished by their own names =-=[16]-=-, [17]. FFTW implements a space of many such variants, as described later, but here we derive the basic algorithm, identify its key features, and outline some important historical variations and their... |

75 |
On computing the discrete Fourier transform
- Winograd
- 1976
(Show Context)
Citation Context ...ection VI) and the latter two for general prime . A new generalization of Rader’s algorithm for prime-size real-data transforms is also discussed in Section VII. FFTW does not employ the Winograd FFT =-=[30]-=-, which minimizes the number of multiplications at the expense of a large number of additions. (This tradeoff is not beneficial on current processors that have specialized hardware multipliers.) III. ... |

74 |
Discrete Hartley transform
- Bracewell
- 1983
(Show Context)
Citation Context ...escribes this algorithm, which to our knowledge has not been published before. The algorithm first reduces the real DFT to the discrete Hartley transform (DHT) by means of the well-known reduction of =-=[48]-=-, and then it executes a DHT variant of Rader’s algorithm. The DHT was originally proposed by [48] as a faster alternative to the real DFT, but [45] argued that a well-implemented real DFT is always m... |

72 | Spiral: A generator for platform-adapted libraries of signal processing algorithms
- Puschel, Singer, et al.
(Show Context)
Citation Context ...r describes the main ideas common to all FFTW systems, the runtime structure of FFTW3, and the modifications to since FFTW2. Previous work on adaptive systems includes [3]–[11]. In particular, SPIRAL =-=[9]-=-, [10] is another system focused on optimization of Fourier transforms and related algorithms, but it has distinct differences from FFTW. SPIRAL searches at compile time over a space of mathematically... |

62 |
Discrete Fourier transforms when the number of data samples is prime
- Rader
- 1968
(Show Context)
Citation Context ...Section V-C. Finally, we should mention that there are many FFTs entirely distinct from Cooley–Tukey. Three notable such algorithms are the prime-factor algorithm for [27, p. 619], along with Rader’s =-=[28]-=- and Bluestein’s [27], [29] algorithms for prime . FFTW implements the first two in its codelet generator for hard-coded (Section VI) and the latter two for general prime . A new generalization of Rad... |

50 |
Realvalued fast Fourier transform algorithms
- Sorensen, Jones, et al.
- 1987
(Show Context)
Citation Context ...ensional arrays as well.) By exploiting this symmetry, one can save roughly a factor of two in storage and, by eliminating redundant operations within the FFT, roughly a factor of two in time as well =-=[45]-=-. The implementation of real-data DFTs in FFTW parallels that of complex DFTs discussed in Section IV. For direct plans, we use optimized codelets generated by , which automatically derives specialize... |

45 |
Symmetric convolution and the discrete sine and cosine transforms
- Martucci
- 1994
(Show Context)
Citation Context ...o these transforms as trigonometric transforms. Types I–IV are equivalent to ( double-length) DFTs of even size with the different possible half-sample shifts in the input and/or output. Types V–VIII =-=[52]-=- are similar, except that their “logical” DFTs are of odd size; these four types seem to see little practical use, so we do not implement them. (In order to make the transforms unitary, additional fac... |

44 |
A linear filtering approach to the computation of the Discrete Fourier Transform
- Bluestein
- 1970
(Show Context)
Citation Context ...ould mention that there are many FFTs entirely distinct from Cooley–Tukey. Three notable such algorithms are the prime-factor algorithm for [27, p. 619], along with Rader’s [28] and Bluestein’s [27], =-=[29]-=- algorithms for prime . FFTW implements the first two in its codelet generator for hard-coded (Section VI) and the latter two for general prime . A new generalization of Rader’s algorithm for prime-si... |

43 | Gauss and the history of the fast fourier transform
- Heideman, Johnson, et al.
- 1984
(Show Context)
Citation Context ...s the “Cooley–Tukey” algorithm, after the two authors who rediscovered and popularized it in 1965 [14], although it had been previously known as early as 1805 by Gauss as well as by later reinventors =-=[15]-=-. The basic idea behind this FFT is that a DFT of a composite size can be reexpressed in terms of smaller DFTs of sizes and —essentially, as a 2-D DFT of size where the output is transposed. The choic... |

41 |
The MD5 Message-Digest Algorithm,” Network Working Group RFC
- Rivest
- 1992
(Show Context)
Citation Context ... first place. When the hash of a problem matches a hash key in the table, the planner invokes the corresponding solver to obtain a plan. For hashing, we use the cryptographically strong MD5 algorithm =-=[41]-=-. In the extremely unlikely event of a hash collision, the planner would still return a valid plan, because the solver returned by the table lookup would either construct a valid plan or fail, and in ... |

37 |
An Algorithm for Computing the Mixed Radix Fast Fourier Transform
- Singleton
- 1969
(Show Context)
Citation Context ...egrated Performance Primitives, Signal Processing, v. 3.0 on the Pentium IV; numerical recipes, the C routine from [31]; ooura, a free code by T. Ooura (C and Fortran, 2001); singleton, a Fortran FFT =-=[32]-=-; sorensen, a split-radix FFT [33]; takahashi, the FFTE Fig. 3. Comparison of single-precision 1-D complex DFTs, power-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and flags as in Fig. 1. Note that... |

33 |
A fast cosine transform in one and two dimensions
- Makhoul
- 1980
(Show Context)
Citation Context ...ile implementing ideas similar to those described in this section [54]. B. General Trigonometric Transforms Type II and III trigonometric transforms of length are computed using a trick from [22] and =-=[55]-=- to re-express them in terms of a size- real-input DFT. Types I and IV are more difficult, because we have observed that convenient algorithms to embed them in an equal-length real-input DFT have poor... |

31 |
Vectorizing the FFTs,” in Parallel Computations
- Swarztrauber
- 1982
(Show Context)
Citation Context ... autosort FFT [17], [21], which transforms back and forth between two arrays with each butterfly, transposing one digit each time, and was popular to improve contiguity of access for vector computers =-=[22]-=-. Alternatively, an explicitly recursive style, as in FFTW, performs the digit-reversal implicitly at the “leaves” of its computation when operating out-ofplace (Section IV-D1). To operate in-place wi... |

30 |
On computing the split-radix FFT
- Sorensen, Heideman, et al.
- 1986
(Show Context)
Citation Context ...gnal Processing, v. 3.0 on the Pentium IV; numerical recipes, the C routine from [31]; ooura, a free code by T. Ooura (C and Fortran, 2001); singleton, a Fortran FFT [32]; sorensen, a split-radix FFT =-=[33]-=-; takahashi, the FFTE Fig. 3. Comparison of single-precision 1-D complex DFTs, power-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and flags as in Fig. 1. Note that fftpack, which was originally des... |

22 |
Fast Fourier transforms—for fun and profit
- Gentleman, Sande
- 1966
(Show Context)
Citation Context ...s or any similar prediction.) Technically, the asymptotically optimal “cache-oblivious” recursive algorithm would use a radix of for a transform of size , analogous to the “four-step” algorithm [18], =-=[38]-=-, but we have found that a bounded radix generally works better in practice, except for at most a single step of radix- . A depth-first style is also used for the multidimensional plans of Section IV-... |

17 |
Direct methods for computing discrete sinusoidal transforms
- Chan, Ho
- 1990
(Show Context)
Citation Context ...ause we have observed that convenient algorithms to embed them in an equal-length real-input DFT have poor numerical properties: the type-I algorithm from [22] and [31] and the type-IV algorithm from =-=[56]-=- both have (root mean square) relative errors that seem to grow as . We have not performed a detailed error analysis, but we believe the problem is due to the fact that both of these methods multiply ... |

16 | Accuracy of the discrete Fourier transform and the fast Fourier transform
- Schatzman
- 1996
(Show Context)
Citation Context ... an equal-length real-input DFT; thus, we resort to the “slow” method of embedding it in a real-input DFT of length . All of our methods are observed to achieve the same error as the Cooley–Tukey FFT =-=[59]-=-. One can also compute symmetric DFTs by directly specializing the Cooley–Tukey algorithm, removing redundant operations as we did for real inputs, to decompose the transform into smaller symmetric tr... |

15 | A framework for generating distributedmemory parallel programs for block recursive algorithms - Gupta, Huang, et al. - 1996 |

14 |
Self-sorting in-place fast fourier transforms
- Temperton
- 1991
(Show Context)
Citation Context ... [40]. As an optimization, we include DIF-transpose codelets that combine the radix- DIF twiddle codelet (in a loop of length ) with the transpose, for . (DIF-transpose is to DIF transpose roughly as =-=[24]-=- is to [25].) Another common special case is where , in which a size- direct plan (Section IV-C3a), not a DIF codelet, is required (the twiddle factors are unity), and the transposes are performed at ... |

14 |
A fast fourier transform algorithm for real-valued series
- Bergland
- 1968
(Show Context)
Citation Context ...n VI). For Cooley–Tukey plans, we use a mixed-radix generalization of [45], which works by eliminating the redundant computations in a standard Cooley–Tukey algorithm applied to real data [22], [46], =-=[47]-=-. When the transform length is a prime number, FFTW uses an adaptation of Rader’s algorithm [28] that reduces the storage and time requirements roughly by a factor of two with respect to the complex c... |

14 | Architecture independent short vector FFTs
- Franchetti, Karner, et al.
- 2001
(Show Context)
Citation Context ...t it is machine specific: it outputs assembly code, exploiting the peculiarities of the target instruction set. The second scheme relies on an abstraction layer consisting of C macros in the style of =-=[60]-=-, and it is, therefore, semiportable (the C compiler must support SIMD extensions in order for this scheme to work). To understand this SIMD scheme, consider first a machine with length-2 vectors, suc... |

13 | Bit reversal on uniprocessors
- Karp
- 1996
(Show Context)
Citation Context ...n lies in how the digit reversal is performed. The classic approach is a single, separate digit-reversal pass following or preceding the arithmetic computations. Although this pass requires only time =-=[20]-=-, it can still be nonnegligible, especially if the data is out of cache; moreover, it neglects the possibility that data reordering during the transform may improve memory locality. Perhaps the oldest... |

13 |
High speed convolution and correlation
- Stockham
- 1966
(Show Context)
Citation Context ...he data is out of cache; moreover, it neglects the possibility that data reordering during the transform may improve memory locality. Perhaps the oldest alternative is the Stockham autosort FFT [17], =-=[21]-=-, which transforms back and forth between two arrays with each butterfly, transposing one digit each time, and was popular to improve contiguity of access for vector computers [22]. Alternatively, an ... |

12 | Efficient utilization of SIMD extensions
- Franchetti, Kral, et al.
(Show Context)
Citation Context ...FFTW’s machine-independent codelets are no slower than machine-specific codelets generated by SPIRAL [43, Fig. 3]. In the stock implementation, the schedule is finally unparsed to C. A variation from =-=[44]-=- implements the rest of a compiler backend and outputs assembly code. VII. REAL-DATA TRANSFORMS In this section, we briefly outline how FFTW computes DFTs of real data (a real DFT), and we give a new ... |

11 |
On computing the fast fourier transform
- Singleton
- 1967
(Show Context)
Citation Context ...ee is traversed “depth-first”—one size- transform is solved completely before processing the other one, and so on. However, most traditional FFT implementations are nonrecursive (with rare exceptions =-=[19]-=-) and traverse the tree “breadth-first” [17]—in the radix-2 example, they would perform (trivial) size-1 transforms, then combinations into size-2 transforms, then combinations into size-4 transforms,... |

10 | Learning to construct fast signal processing implementations
- Singer, Veloso
- 2002
(Show Context)
Citation Context ... are machine independent. FFTW’s search uses dynamic programming [12, Ch. 16], while the SPIRAL project has experimented with a wider range of search strategies, including machine-learning techniques =-=[13]-=-. The remainder of this paper is organized as follows. We begin with a general overview of fast Fourier transforms (FFTs) in Section II. Then, in Section III, we compare the performance of FFTW and ot... |

10 |
Fast Mixed-Radix Real Fourier Transforms
- Temperton
- 1983
(Show Context)
Citation Context ...Section VI). For Cooley–Tukey plans, we use a mixed-radix generalization of [45], which works by eliminating the redundant computations in a standard Cooley–Tukey algorithm applied to real data [22], =-=[46]-=-, [47]. When the transform length is a prime number, FFTW uses an adaptation of Rader’s algorithm [28] that reduces the storage and time requirements roughly by a factor of two with respect to the com... |

9 | A self-sorting in-place fast fourier transform algorithm suitable for vector and parallel processing
- Hegland
- 1994
(Show Context)
Citation Context ...citly at the “leaves” of its computation when operating out-ofplace (Section IV-D1). To operate in-place with scratch storage, one can interleave small matrix transpositions with the butterflies [23]–=-=[26]-=-, and a related strategy in FFTW is described in Section IV-D3. FFTW can also perform intermediate reorderings that blend its in-place and out-of-place strategies, as described in Section V-C. Finally... |

7 |
Decimation-in-Time-Frequency FFT Algorithm
- Saidi
- 1994
(Show Context)
Citation Context ...i.e., ). Related strategies for in-place transforms based on small transposes were described in [23]–[26]; alternating DIT/DIF, without concern for in-place operation, was also considered in [39] and =-=[40]-=-. As an optimization, we include DIF-transpose codelets that combine the radix- DIF twiddle codelet (in a loop of length ) with the transpose, for . (DIF-transpose is to DIF transpose roughly as [24] ... |

7 | Code generators for automatic tuning of numerical kernels: experiences with FFTW
- Vuduc, Demmel
- 2000
(Show Context)
Citation Context ...T I and II in this way. Vuduc and Demmel independently rediscovered that could derive trigonometric transforms from the complex DFT while implementing ideas similar to those described in this section =-=[54]-=-. B. General Trigonometric Transforms Type II and III trigonometric transforms of length are computed using a trick from [22] and [55] to re-express them in terms of a size- real-input DFT. Types I an... |

6 |
On computing the discrete Fourier and cosine transforms
- Wang
- 1985
(Show Context)
Citation Context ...iddle factor), with a resulting loss of relative precision near the cosine zero. Instead, to compute a type-IV trigonometric transform, we use one of two algorithms: for even , we use the method from =-=[57]-=- to express it as pair of type-III problems of size , which are solved as above; for odd , we use a method from [58] to re-express the type-IV problem as a size- real-input DFT (with a complicated rei... |

5 | Portable High Performance Programming via ArchitectureCognizant Divide-and-Conquer Algorithms
- Gatlin
- 2000
(Show Context)
Citation Context ...er space of plans. This paper describes the main ideas common to all FFTW systems, the runtime structure of FFTW3, and the modifications to since FFTW2. Previous work on adaptive systems includes [3]–=-=[11]-=-. In particular, SPIRAL [9], [10] is another system focused on optimization of Fourier transforms and related algorithms, but it has distinct differences from FFTW. SPIRAL searches at compile time ove... |

5 | An in-place in-order radix-2 t - Johnson, Burrus - 1984 |

4 |
An improved fast fourier transform algorithm using mixed frequency and time decimations
- Nakayama
- 1988
(Show Context)
Citation Context ... square, i.e., ). Related strategies for in-place transforms based on small transposes were described in [23]–[26]; alternating DIT/DIF, without concern for in-place operation, was also considered in =-=[39]-=- and [40]. As an optimization, we include DIF-transpose codelets that combine the radix- DIF twiddle codelet (in a loop of length ) with the transpose, for . (DIF-transpose is to DIF transpose roughly... |

4 |
Fast algorithms for computing the discrete cosine transform
- Chan, Ho
- 1992
(Show Context)
Citation Context ...nometric transform, we use one of two algorithms: for even , we use the method from [57] to express it as pair of type-III problems of size , which are solved as above; for odd , we use a method from =-=[58]-=- to re-express the type-IV problem as a size- real-input DFT (with a complicated reindexing that requires no twiddle factors at all). For the type-I DCT/DST, however, we could not find any accurate al... |

3 |
Self-sorting in-place FFT algorithm with minimum working space
- Qian, Lu, et al.
- 1994
(Show Context)
Citation Context ...n optimization, we include DIF-transpose codelets that combine the radix- DIF twiddle codelet (in a loop of length ) with the transpose, for . (DIF-transpose is to DIF transpose roughly as [24] is to =-=[25]-=-.) Another common special case is where , in which a size- direct plan (Section IV-C3a), not a DIF codelet, is required (the twiddle factors are unity), and the transposes are performed at the leaves ... |