Results 1  10
of
128
The design and implementation of FFTW3
 PROCEEDINGS OF THE IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our cu ..."
Abstract

Cited by 722 (3 self)
 Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for realdata DFTs of prime size, a new way of implementing DFTs by means of machinespecific singleinstruction, multipledata (SIMD) instructions, and how a specialpurpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.
FFTW: An Adaptive Software Architecture For The FFT
, 1998
"... FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have ..."
Abstract

Cited by 605 (4 self)
 Add to MetaCart
(Show Context)
FFT literature has been mostly concerned with minimizing the number of floatingpoint operations performed by an algorithm. Unfortunately, on presentday microprocessors this measure is far less important than it used to be, and interactions with the processor pipeline and the memory hierarchy have a larger impact on performance. Consequently, one must know the details of a computer architecture in order to design a fast algorithm. In this paper, we propose an adaptive FFT program that tunes the computation automatically for any particular hardware. We compared our program, called FFTW, with over 40 implementations of the FFT on 7 machines. Our tests show that FFTW's selfoptimizing approach usually yields significantly better performance than all other publicly available software. FFTW also compares favorably with machinespecific, vendoroptimized libraries. 1. INTRODUCTION The discrete Fourier transform (DFT) is an important tool in many branches of science and engineering [1] and...
A Fast Fourier Transform Compiler
, 1999
"... FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the perform ..."
Abstract

Cited by 194 (5 self)
 Add to MetaCart
(Show Context)
FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performancecritical code was generated automatically by a specialpurpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft “discovered” algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this specialpurpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.
Cyclic Prefixing or Zero Padding for Wireless Multicarrier Transmissions?
, 2002
"... Zero padding (ZP) of multicarrier transmissions has recently been proposed as an appealing alternative to the traditional cyclic prefix (CP) orthogonal frequencydivision multiplexing (OFDM) to ensure symbol recovery regardless of the channel zero locations. In this paper, both systems are studied t ..."
Abstract

Cited by 103 (10 self)
 Add to MetaCart
Zero padding (ZP) of multicarrier transmissions has recently been proposed as an appealing alternative to the traditional cyclic prefix (CP) orthogonal frequencydivision multiplexing (OFDM) to ensure symbol recovery regardless of the channel zero locations. In this paper, both systems are studied to delineate their relative merits in wireless systems where channel knowledge is not available at the transmitter. Two novel equalizers are developed for ZPOFDM to tradeoff performance with implementation complexity. Both CPOFDM and ZPOFDM are then compared in terms of transmitter nonlinearities and required power backoff. Next, both systems are tested in terms of channel estimation and tracking capabilities. Simulations tailored to the realistic context of the standard for wireless local area network HIPERLAN/2 illustrate the pertinent tradeoffs.
Cacheoblivious algorithms
, 1999
"... requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on ..."
Abstract

Cited by 90 (1 self)
 Add to MetaCart
(Show Context)
requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z = (L2), the number of cache misses for an m x n matrix transpose is E(1 + mn/L). The number of cache misses for either an npoint FFT or the sorting of n numbers is 0(1 + (n/L)(1 + logzn)). The cache complexity of computing n time steps of a Jacobistyle multipass filter on an array of size n is E(1 + n/L + n2 /ZL). We also give an 8(mnp)work algorithm to multiply an m x n matrix by an n x p matrix
Wavelet transforms versus Fourier transforms
 Department of Mathematics, MIT, Cambridge MA
, 213
"... Abstract. This note is a very basic introduction to wavelets. It starts with an orthogonal basis of piecewise constant functions, constructed by dilation and translation. The "wavelet transform " maps each f(x) to its coefficients with respect to this basis. The mathematics is simple and t ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This note is a very basic introduction to wavelets. It starts with an orthogonal basis of piecewise constant functions, constructed by dilation and translation. The "wavelet transform " maps each f(x) to its coefficients with respect to this basis. The mathematics is simple and the transform is fast (faster than the Fast Fourier Transform, which we briefly explain), but approximation by piecewise constants is poor. To improve this first wavelet, we are led to dilation equations and their unusual solutions. Higherorder wavelets are constructed, and it is surprisingly quick to compute with them — always indirectly and recursively. We comment informally on the contest between these transforms in signal processing, especially for video and image compression (including highdefinition television). So far the Fourier Transform — or its 8 by 8 windowed version, the Discrete Cosine Transform — is often chosen. But wavelets are already competitive, and they are ahead for fingerprints. We present a sample of this developing theory. 1. The Haar wavelet To explain wavelets we start with an example. It has every property we hope for, except one. If that one defect is accepted, the construction is simple and the computations are fast. By trying to remove the defect, we are led to dilation equations and recursively defined functions and a small world of fascinating new problems — many still unsolved. A sensible person would stop after the first wavelet, but fortunately mathematics goes on. The basic example is easier to draw than to describe: W(x)
The Fastest Fourier Transform in the West
 the Proceedings of the 1998 International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98
, 1997
"... This paper describes FFTW, a portable C package for computing the one and multidimensional complex discrete Fourier transform (DFT). FFTW is typically faster than all other publicly available DFT software, including the wellknown FFTPACK and the code from Numerical Recipes. More interestingly, FFT ..."
Abstract

Cited by 73 (2 self)
 Add to MetaCart
This paper describes FFTW, a portable C package for computing the one and multidimensional complex discrete Fourier transform (DFT). FFTW is typically faster than all other publicly available DFT software, including the wellknown FFTPACK and the code from Numerical Recipes. More interestingly, FFTW is competitive with or better than proprietary, highlytuned codes such as Sun's Performance Library and IBM's ESSL library. FFTW implements the CooleyTukey fast Fourier transform, and is freely available on the Web at http://theory.lcs.mit.edu/fftw. Three main ideas are the keys to FFTW's performance. First, the computation of the transform is performed by an executor consisting of highlyoptimized, composable blocks of C code called codelets. Second, at runtime, a planner finds an efficient way (called a `plan') to compose the codelets. Through the planner, FFTW adapts itself to the architecture of the machine it is running on. Third, the codelets are automatically generated by a code...
A modified splitradix FFT with fewer arithmetic operations
 IEEE TRANS. SIGNAL PROCESSING
, 2006
"... ..."
Multidigit Multiplication For Mathematicians
, 2001
"... This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStr ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStrassen trick, Schönhage's trick, Nussbaumer's trick, the cyclic SchönhageStrassen trick, and the CantorKaltofen theorem. It emphasizes the underlying ring homomorphisms.
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multic ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multicores, and we propose a multicoreoblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicoreoblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including FloydWarshall’s allpairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient networkoblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the DecomposableBSP.