Results 1  10
of
22
FFTW: An Adaptive Software Architecture For The FFT
, 1998
"... FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have ..."
Abstract

Cited by 444 (4 self)
 Add to MetaCart
FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's selfoptimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machinespecific, vendoroptimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and...
The design and implementation of FFTW3
 Proceedings of the IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our cu ..."
Abstract

Cited by 396 (6 self)
 Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for realdata DFTs of prime size, a new way of implementing DFTs by means of machinespecific singleinstruction, multipledata (SIMD) instructions, and how a specialpurpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. Keywords—Adaptive software, cosine transform, fast Fourier transform (FFT), Fourier transform, Hartley transform, I/O tensor.
Superfast solution of real positive definite Toeplitz systems
 SIAM J. Matrix Anal. Appl
, 1988
"... Abstract. We describe an implementation of the generalized Schur algorithm for the superfast solution of real positive definite Toeplitz systems of order n + 1, where n = 2ν. Our implementation uses the splitradix fast Fourier transform algorithms for real data of Duhamel. We are able to obtain the ..."
Abstract

Cited by 54 (1 self)
 Add to MetaCart
Abstract. We describe an implementation of the generalized Schur algorithm for the superfast solution of real positive definite Toeplitz systems of order n + 1, where n = 2ν. Our implementation uses the splitradix fast Fourier transform algorithms for real data of Duhamel. We are able to obtain the nth Szegő polynomial using fewer than 8n log2 2 n real arithmetic operations without explicit use of the bitreversal permutation. Since Levinson’s algorithm requires slightly more than 2n2 operations to obtain this polynomial, we achieve crossover with Levinson’s algorithm at n = 256. Key words. Toeplitz matrix, Schur’s algorithm, splitradix Fast Fourier Transform
Multidigit Multiplication For Mathematicians
"... . This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchonhageStrass ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
. This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchonhageStrassen trick, Schonhage's trick, Nussbaumer's trick, the cyclic SchonhageStrassen trick, and the CantorKaltofen theorem. It emphasizes the underlying ring homomorphisms. 1.
A Modified SplitRadix FFT With Fewer Arithmetic Operations
, 2007
"... Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a poweroftwo discrete Fourier transform (DFT). Here, we present a simple recursive modification of the splitradix algorithm that computes th ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a poweroftwo discrete Fourier transform (DFT). Here, we present a simple recursive modification of the splitradix algorithm that computes the DFT with asymptotically about 6 % fewer operations than Yavne, matching the count achieved by Van Buskirk’s programgeneration framework. We also discuss the application of our algorithm to realdata and realsymmetric (discrete cosine) transforms, where we are again able to achieve lower arithmetic counts than previously published algorithms.
Portable HighPerformance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
right notice and this permission notice are preserved on all copies.
Performance models and search methods for optimal FFT implementations
, 2000
"... This work was supported by DARPA through the ARO grant # DABT639810004 This thesis considers systematic methodologies for finding optimized implementations for the fast Fourier transform (FFT). By employing rewrite rules (e.g., the CooleyTukey formula), we obtain a divide and conquer procedure (de ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
This work was supported by DARPA through the ARO grant # DABT639810004 This thesis considers systematic methodologies for finding optimized implementations for the fast Fourier transform (FFT). By employing rewrite rules (e.g., the CooleyTukey formula), we obtain a divide and conquer procedure (decomposition) that breaks down the initial transform into combinations of different smaller size subtransforms, which are graphically represented as breakdown trees. Recursive application of the rewrite rules generates a set of algorithms and alternative codes for the FFT computation. The set of "all " possible implementations (within the given set of the rules) results in pairing the possible breakdown trees with the code implementation alternatives. To evaluate the quality of these implementations, we develop analytical and experimental performance models. Based on these models, we derive methods dynamic programming, soft decision dynamic programming and exhaustive search to find the implementation with minimal runtime. Our test results demonstrate that good algorithms and codes, accurate performance
An Approach To LowPower, HighPerformance, Fast Fourier Transform Processor Design
"... The Fast Fourier Transform (FFT) is one of the most widely used digital signal processing algorithms. While advances in semiconductor processing technology have enabled the performance and integration of FFT processors to increase steadily, these advances have also caused the power consumed by proce ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The Fast Fourier Transform (FFT) is one of the most widely used digital signal processing algorithms. While advances in semiconductor processing technology have enabled the performance and integration of FFT processors to increase steadily, these advances have also caused the power consumed by processors to increase as well. This power increase has resulted in a situation where the number of potential FFT applications limited by maximum power budgets  not performance  is significant and growing. We present
ApplicationSpecific Architecture For Fast Transforms Based On The Successive Doubling Method, Part I: A Constant Geometry Approach
, 1994
"... The successive doubling method is an ecient procedure for the design of fast algorithms for orthogonal transforms of length N = r n , where the radix r is a power of 2. It reduces the algorithmic complexity from N 2 to N log r N . In this work we present a partitioned systolic architecture ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The successive doubling method is an ecient procedure for the design of fast algorithms for orthogonal transforms of length N = r n , where the radix r is a power of 2. It reduces the algorithmic complexity from N 2 to N log r N . In this work we present a partitioned systolic architecture for the two standard radix successive doubling algorithms: ascend and descend communication patterns. The systolization and partitioning procedure we have used is made up of three actions. First, we transform the ow chart of the data for the successive doubling algorithm into a new chart of constant geometry in all its stages (n). We obtain the constant geometry by means of the perfect unshue (ascending algorithm) or shuf e (descending algorithm) permutations of order log 2 r. We then carry out the decomposition of these permutations into elementary permutations, which can be implemented electronically. Finally, we project the index space of the data onto the index space associ...
Constant Geometry SplitRadix Algorithms
 J. VLSI Signal Processing
, 1995
"... The splitradix algorithm (SR) is a highly efficient version of the successive doubling method. Its application to the Fourier transform results in an algorithm that brings together the advantages of the radix 2 and radix 4 algorithms. In this work we present the generalization of the method that le ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
The splitradix algorithm (SR) is a highly efficient version of the successive doubling method. Its application to the Fourier transform results in an algorithm that brings together the advantages of the radix 2 and radix 4 algorithms. In this work we present the generalization of the method that leads to the SR algorithm in the FFT and the implementation of a constant geometry (CG) version of it. In particular, we develop a CG algorithm of the successive doubling method that factorizes a sequence of length N into p sequences of length N=r and into (r \Gamma p)r of length N=r 2 (r 2; 0 ! p ! r). After this, the method is generalized for its application to SR r, r 2 , ..., r u algorithms, that is, to those based on the factorization of a sequence of length N into p 1 subsequences of length N=r, p 2 r of length N=r 2 , ...., p u r u\Gamma1 of length N=r u (p 1 + p 2 + \Delta \Delta \Delta + p u = r). The results are applied to the implementation of a pipeline with identical...