Results 1  10
of
24
FFTW: An Adaptive Software Architecture For The FFT
, 1998
"... FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have ..."
Abstract

Cited by 444 (4 self)
 Add to MetaCart
FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's selfoptimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machinespecific, vendoroptimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and...
The design and implementation of FFTW3
 Proceedings of the IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our cu ..."
Abstract

Cited by 396 (6 self)
 Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for realdata DFTs of prime size, a new way of implementing DFTs by means of machinespecific singleinstruction, multipledata (SIMD) instructions, and how a specialpurpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. Keywords—Adaptive software, cosine transform, fast Fourier transform (FFT), Fourier transform, Hartley transform, I/O tensor.
CacheOblivious Algorithms
, 1999
"... This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cac ..."
Abstract

Cited by 79 (1 self)
 Add to MetaCart
This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z =# (L 2 ), the number of cache misses for an m × n matrix transpose is #(1 + mn=L). The number of cache misses for either an npoint FFT or the sorting of n numbers is #(1 + (n=L)(1 + log Z n)). The cache complexity of computing n ...
The Fastest Fourier Transform in the West
 the Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98
, 1997
"... This paper describes FFTW, a portable C package for computing the one and multidimensional complex discrete Fourier transform (DFT). FFTW is typically faster than all other publicly available DFT software, including the wellknown FFTPACK and the code from Numerical Recipes. More interestingly, FFT ..."
Abstract

Cited by 66 (2 self)
 Add to MetaCart
This paper describes FFTW, a portable C package for computing the one and multidimensional complex discrete Fourier transform (DFT). FFTW is typically faster than all other publicly available DFT software, including the wellknown FFTPACK and the code from Numerical Recipes. More interestingly, FFTW is competitive with or better than proprietary, highlytuned codes such as Sun's Performance Library and IBM's ESSL library. FFTW implements the CooleyTukey fast Fourier transform, and is freely available on the Web at http://theory.lcs.mit.edu/fftw. Three main ideas are the keys to FFTW's performance. First, the computation of the transform is performed by an executor consisting of highlyoptimized, composable blocks of C code called codelets. Second, at runtime, a planner finds an efficient way (called a `plan') to compose the codelets. Through the planner, FFTW adapts itself to the architecture of the machine it is running on. Third, the codelets are automatically generated by a code...
Multidigit Multiplication For Mathematicians
"... . This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchonhageStrass ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
. This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchonhageStrassen trick, Schonhage's trick, Nussbaumer's trick, the cyclic SchonhageStrassen trick, and the CantorKaltofen theorem. It emphasizes the underlying ring homomorphisms. 1.
The cost of cacheoblivious searching
 IN PROC. 44TH ANN. SYMP. ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS
, 2003
"... This paper gives tight bounds on the cost of cacheoblivious searching. The paper shows that no cacheoblivious search structure can guarantee a search performance of fewer than lgelog B N memory transfers between any two levels of the memory hierarchy. This lower bound holds even if all of the bloc ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
This paper gives tight bounds on the cost of cacheoblivious searching. The paper shows that no cacheoblivious search structure can guarantee a search performance of fewer than lgelog B N memory transfers between any two levels of the memory hierarchy. This lower bound holds even if all of the block sizes are limited to be powers of 2. The paper gives modified versions of the van Emde Boas layout, where the expected number of memory transfers between any two levels of the memory hierarchy is arbitrarily close to [lge+O(lglgB/lgB)]log B N +O(1). This factor approaches lge ≈ 1.443 as B increases. The expectation is taken over the random placement in memory of the first element of the structure. Because searching in the diskaccess machine (DAM) model can be performed in log B N+O(1) block transfers, thisresultestablishes aseparation between the (2level) DAM model and cacheoblivious model. The DAM model naturally extends to k levels. The paper also shows that as k grows, the search costs of the optimal klevel DAM search structure and the optimal cacheoblivious search structure rapidly converge. This result demonstrates that for a multilevel memory hierarchy, a simple cacheoblivious structure almost replicates the performance of an optimal parameterized klevel DAM structure.
Portable HighPerformance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
right notice and this permission notice are preserved on all copies.
Cacheoblivious algorithms (Extended Abstract)
 In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cacheline length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cacheline length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an npoint FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “idealcache ” model to analyze our algorithms. We prove that an optimal cacheoblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the idealcache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cacheoblivious algorithms in practice.
Fast Radix 2, 3, 4, And 5 Kernels For Fast Fourier Transformations On Computers With Overlapping MultiplyAdd Instructions
, 1997
"... . We present a new formulation of fast Fourier transformation (FFT) kernels for radix 2, 3, 4, and 5, which have a perfect balance of multiplies and adds. These kernels give higher performance on machines that have a single multiplyadd (multadd) instruction. We demonstrate the superiority of thi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
. We present a new formulation of fast Fourier transformation (FFT) kernels for radix 2, 3, 4, and 5, which have a perfect balance of multiplies and adds. These kernels give higher performance on machines that have a single multiplyadd (multadd) instruction. We demonstrate the superiority of this new kernel on IBM and SGI workstations. Key word. FFT kernels AMS subject classifications. 6504, 4204 PII. S1064827595281940 1. Introduction. Until several years ago, additions were faster on most computers than multiplications. Many attempts were therefore made to formulate FFT kernels with the smallest number of multiplies. Today multiplications are as fast as adds and most machines can do a multiplication and addition simultaneously. Among machines that can do this are highperformance workstations from IBM, HP, DEC, and SGI, as well as vectorsupercomputers from Cray, NEC, and Fujitsu. Some machines achieve this parallelism by a multadd instruction that is assigned during compil...
Performance models and search methods for optimal FFT implementations
, 2000
"... This work was supported by DARPA through the ARO grant # DABT639810004 This thesis considers systematic methodologies for finding optimized implementations for the fast Fourier transform (FFT). By employing rewrite rules (e.g., the CooleyTukey formula), we obtain a divide and conquer procedure (de ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
This work was supported by DARPA through the ARO grant # DABT639810004 This thesis considers systematic methodologies for finding optimized implementations for the fast Fourier transform (FFT). By employing rewrite rules (e.g., the CooleyTukey formula), we obtain a divide and conquer procedure (decomposition) that breaks down the initial transform into combinations of different smaller size subtransforms, which are graphically represented as breakdown trees. Recursive application of the rewrite rules generates a set of algorithms and alternative codes for the FFT computation. The set of "all " possible implementations (within the given set of the rules) results in pairing the possible breakdown trees with the code implementation alternatives. To evaluate the quality of these implementations, we develop analytical and experimental performance models. Based on these models, we derive methods dynamic programming, soft decision dynamic programming and exhaustive search to find the implementation with minimal runtime. Our test results demonstrate that good algorithms and codes, accurate performance