## FFTW: An Adaptive Software Architecture For The FFT (1998)

### Cached

### Download Links

Citations: | 451 - 4 self |

### BibTeX

@INPROCEEDINGS{Frigo98fftw:an,

author = {Matteo Frigo and Steven G. Johnson},

title = {FFTW: An Adaptive Software Architecture For The FFT},

booktitle = {},

year = {1998},

pages = {1381--1384},

publisher = {IEEE}

}

### Years of Citing Articles

### OpenURL

### Abstract

FFT literature has been mostly concerned with minimizing the number of floating-point operations performed by an algorithm. Unfortunately, on present-day microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's self-optimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machine-specific, vendor-optimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and...

### Citations

8557 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...losion of the number of plans. Instead, the planner uses a dynamicprogramming algorithm [4, chapter 16] to prune the search space. In order to use dynamic-programming, we assumed optimal substructures=-=[4]-=-: if an optimal plan for a size N is known, this plan is still optimal when size N is used as a subproblem of a larger transform. This assumption is in principle false because of the different states ... |

535 | Cilk: An efficient multithreaded runtime system
- Blumofe, Joerg, et al.
- 1996
(Show Context)
Citation Context ...dy enjoys many hundreds of users. FFTW performs one- and multidimensional transforms, and it is not restricted to input sizes that are powers of 2. A parallel version of the executor, written in Cilk =-=[7]-=-, also exists. The rest of the paper is organized as follows. In Section 2 we outline the runtime structure of FFTW, consisting of the executor and the planner. In Section 3 we briefly describe the co... |

228 |
Numerical Recipes in FORTRAN, the art of scientific computing, 2nd. edition
- Press, Teukolsky, et al.
- 1992
(Show Context)
Citation Context ...an split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton's Fortran GPFA code [14]; Bailey's "4-step" FFT implementation [15]; Sitton's QFT code [16]; and the four1 rou=-=tine from [17]-=- (NRF). We get similar numbers on other machines. For example, on an IBM RS/6000, FFTW ranges from 55% faster than IBM's ESSL library for N = 64, to 12% slower for N = 16384, to again 7% faster for N ... |

133 |
An algorithm for the machine computation of the complex fourier series
- Cooley, Tukey
- 1965
(Show Context)
Citation Context ...re the program itself adapts the computation to the details of the hardware. We developed FFTW, an adaptive, high performance implementation of the Cooley-Tukey fast Fourier transform (FFT) algorithm =-=[3]-=-, written in C. We have compared many C and Fortran implementations of the DFT on several machines, and our experiments show that FFTW typically yields significantly better performance than all other ... |

104 | An analysis of dagconsistent distributed shared-memory algorithms
- Blumofe, Frigo, et al.
- 1996
(Show Context)
Citation Context ...rast with the traditional loop-based implementations [1, page 608]. We chose an explicitly recursive implementation because of theoretical evidence that divide-and-conquer algorithms improve locality =-=[8]-=-. For example, as soon as a subproblem fits into the cache, no further cache misses are needed in order to solve that subproblem. We E E E E E E E E E E E E E E 30 40 50 60 70 80 90 1790000 1800000 18... |

76 |
Fast Fourier transforms: a tutorial review and a state of the art
- Duhamel, Vetterli
- 1990
(Show Context)
Citation Context ...achine-specific, vendor-optimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and has been studied extensively =-=[2]-=-. For many practical applications, it is important to have an implementation of the DFT that is as fast as possible. In the past, speed was the direct consequence of clever algorithms [2] that minimiz... |

62 |
Discrete Fourier transforms when the number of data samples is prime
- Rader
- 1968
(Show Context)
Citation Context ...ding Cooley-Tukey (in the form presented in [1, page 611]), a prime factor algorithm (as described in [1, page 619]), a split-radix algorithm [2], and Rader's algorithm for transforms of prime length =-=[9]-=-. Our first implementation of the Cooley-Tukey AST generator consisted of 60 lines of Caml code. The prime factor and split-radix algorithms were added using about 20 additional lines of code each. (T... |

37 |
An Algorithm for Computing the Mixed Radix Fast Fourier Transform
- Singleton
- 1969
(Show Context)
Citation Context ...ublic-domain code by T. Ooura (Fortran, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton =-=[13]; Temperto-=-n's Fortran GPFA code [14]; Bailey's "4-step" FFT implementation [15]; Sitton's QFT code [16]; and the four1 routine from [17] (NRF). We get similar numbers on other machines. For example, o... |

35 | Computational Frameworks for the Fast Fourier Transform - Loan - 1992 |

34 | A high-performance FFT algorithm for vector supercomputers
- Bailey
- 1993
(Show Context)
Citation Context ...ukar (C, 1990); the Fortran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton's Fortran GPFA code [14]; Bailey's "4-step" FFT imple=-=mentation [15]-=-; Sitton's QFT code [16]; and the four1 routine from [17] (NRF). We get similar numbers on other machines. For example, on an IBM RS/6000, FFTW ranges from 55% faster than IBM's ESSL library for N = 6... |

31 |
Vectorizing the FFTs,” in Parallel Computations
- Swarztrauber
- 1982
(Show Context)
Citation Context ...erformance. They include the Sun Performance Library version 1.2 (SUNPERF); public-domain code by T. Ooura (Fortran, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library =-=[11]-=-; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton’s Fortran GPFA code [14]; Bailey’s “4-step” FFT implementation [15]; Sitton’s QFT code [16]; and thefour1 routi... |

30 |
On computing the split-radix FFT
- Sorensen, Heideman, et al.
- 1986
(Show Context)
Citation Context ... Library version 1.2 (SUNPERF); public-domain code by T. Ooura (Fortran, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen =-=[12]; a Fortra-=-n FFT by Singleton [13]; Temperton's Fortran GPFA code [14]; Bailey's "4-step" FFT implementation [15]; Sitton's QFT code [16]; and the four1 routine from [17] (NRF). We get similar numbers ... |

25 |
The Design of Optimal DFT Algorithms Using Dynamic Programming
- Johnson, Burrus
- 1983
(Show Context)
Citation Context ...f FFT programs include [20], which describes the generation of FFT programs for prime sizes. [21] presents a generator of Pascal programs implementing a prime factor FFT algorithm. Johnson and Burrus =-=[22]-=- applied dynamic programming to the design of optimal DFT modules. These systems all try to minimize the arithmetic complexity of the transform rather than its execution time. Adaptive techniques such... |

15 | A framework for generating distributedmemory parallel programs for block recursive algorithms
- Gupta, Huang, et al.
- 1996
(Show Context)
Citation Context ...x that manually optimizing software is difficult to the point of impracticality. Our FFTW system is a method of dealing with such complexity. Similar ideas have been incorporated by other researchers =-=[18]-=- into an interesting system called EXTENT which uses a tensor product framework to synthesize Fortran FFTs for multiprocessors. Like FFTW, EXTENT generates code optimized for speed, but unlike FFTW, t... |

13 | Automatic generation of prime length FFT programs
- Selesnick, Burrus
- 1996
(Show Context)
Citation Context ...rm size. The idea of using ML as a metalanguage for generating C applications first appeared, to the best of our knowledge, in [19]. Other automatic systems for the generation of FFT programs include =-=[20]-=-, which describes the generation of FFT programs for prime sizes. [21] presents a generator of Pascal programs implementing a prime factor FFT algorithm. Johnson and Burrus [22] applied dynamic progra... |

9 | The quick discrete fourier transform
- Guo, Sitton, et al.
- 1982
(Show Context)
Citation Context ...ran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton's Fortran GPFA code [14]; Bailey's "4-step" FFT implementation [15]; Sitton's=-= QFT code [16]-=-; and the four1 routine from [17] (NRF). We get similar numbers on other machines. For example, on an IBM RS/6000, FFTW ranges from 55% faster than IBM's ESSL library for N = 64, to 12% slower for N =... |

7 |
A prime factor FFT algorithm implementation using a program generation technique
- Perez, Takaoka
- 1987
(Show Context)
Citation Context ...cations first appeared, to the best of our knowledge, in [19]. Other automatic systems for the generation of FFT programs include [20], which describes the generation of FFT programs for prime sizes. =-=[21]-=- presents a generator of Pascal programs implementing a prime factor FFT algorithm. Johnson and Burrus [22] applied dynamic programming to the design of optimal DFT modules. These systems all try to m... |

4 |
Tukey, "An algorithm for the machine computation of the complex Fourier series
- Cooley, W
- 1965
(Show Context)
Citation Context ...re the program itself adapts the computation to the details of the hardware. We developed FFTW, an adaptive, high performance implementation of the Cooley-Tukey fast Fourier transform (FFT) algorithm =-=[3]-=-, written in C. We have compared many C and Fortran implementations of the DFT on several machines, and our experiments show that FFTW typically yields significantly better performance than all other ... |

4 |
A generalized prime factor FFT algorithm for any n = 2 p q r
- Temperton
- 1992
(Show Context)
Citation Context ...an, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton's Fortran GPFA code =-=[14]; Bailey's-=- "4-step" FFT implementation [15]; Sitton's QFT code [16]; and the four1 routine from [17] (NRF). We get similar numbers on other machines. For example, on an IBM RS/6000, FFTW ranges from 5... |

3 |
The Caml Light system release 0.71. Institut National de Recherche en Informatique at Automatique (INRIA
- Leroy
- 1996
(Show Context)
Citation Context ...reason, we found it convenient to generate the codelets automatically by means of a special-purpose compiler. FFTW's codelet generator, written in the Caml Light dialect of the functional language ML =-=[5], is a sop-=-histicated program that first produces a representation of the codelet in the form of abstract C syntax tree, and then "optimizes" the codelet by applying well known transformations such as ... |

3 |
Vectorizing the FFTs," Parallel Computations
- Swarztrauber
- 1982
(Show Context)
Citation Context ...erformance. They include the Sun Performance Library version 1.2 (SUNPERF); public-domain code by T. Ooura (Fortran, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library =-=[11]; a Fortra-=-n split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton's Fortran GPFA code [14]; Bailey's "4-step" FFT implementation [15]; Sitton's QFT code [16]; and the four1 rout... |

2 |
A generalized prime factor FFT algorithm for any n =2 p 3 q 5 r
- Temperton
- 1992
(Show Context)
Citation Context ...an, 1996), J. Green (C, 1996), and R. H. Krukar (C, 1990); the Fortran FFTPACK library [11]; a Fortran split-radix FFT by Sorensen [12]; a Fortran FFT by Singleton [13]; Temperton’s Fortran GPFA code =-=[14]-=-; Bailey’s “4-step” FFT implementation [15]; Sitton’s QFT code [16]; and thefour1 routine from [17] (NRF). We get similar numbers on other machines. For example, on an IBM RS/6000, FFTW ranges from 55... |

1 |
Standard ML as a meta-programming language. " Unpublished technical report, available from s-kamin@uiuc.edu
- Kamin
- 1996
(Show Context)
Citation Context ... speed, but unlike FFTW, the generated program only works for one transform size. The idea of using ML as a metalanguage for generating C applications first appeared, to the best of our knowledge, in =-=[19]-=-. Other automatic systems for the generation of FFT programs include [20], which describes the generation of FFT programs for prime sizes. [21] presents a generator of Pascal programs implementing a p... |

1 |
ML as a meta-programming language.” Unpublished technical report, available from s-kamin@uiuc.edu
- Kamin, “Standard
- 1996
(Show Context)
Citation Context ... speed, but unlike FFTW, the generated program only works for one transform size. The idea of using ML as a metalanguage for generating C applications first appeared, to the best of our knowledge, in =-=[19]-=-. Other automatic systems for the generation of FFT programs include [20], which describes the generation of FFT programs for prime sizes. [21] presents a generator of Pascal programs implementing a p... |