Results 1  10
of
61
A Fast Fourier Transform Compiler
, 1999
"... FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the perform ..."
Abstract

Cited by 155 (6 self)
 Add to MetaCart
FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performancecritical code was generated automatically by a specialpurpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft “discovered” algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this specialpurpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.
SPIRAL: Code Generation for DSP Transforms
 PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2005
"... Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital sig ..."
Abstract

Cited by 143 (32 self)
 Add to MetaCart
Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domainspecific mathematical structure of transform algorithms to implement a feedbackdriven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code. Index Terms — library generation, code optimization, adaptation, automatic performance tuning, high performance computing, linear signal transform, discrete Fourier transform, FFT, discrete cosine transform, wavelet, filter, search, learning, genetic and evolutionary algorithm, Markov decision process I.
SPL: A Language and Compiler for DSP Algorithms
, 2001
"... We discuss the design and implementation of a compiler that translates formulas representing signal processing transforms into ecient C or Fortran programs. The formulas are represented in a language that we call SPL, an acronym from Signal Processing Language. The compiler is a component of the SPI ..."
Abstract

Cited by 82 (11 self)
 Add to MetaCart
We discuss the design and implementation of a compiler that translates formulas representing signal processing transforms into ecient C or Fortran programs. The formulas are represented in a language that we call SPL, an acronym from Signal Processing Language. The compiler is a component of the SPIRAL system which makes use of formula transformations and intelligent search strategies to automatically generate optimized digital signal processing (DSP) libraries. After a discussion of the translation and optimization techniques implemented in the compiler, we use SPL formulations of the fast Fourier transform (FFT) to evaluate the compiler. Our results show that SPIRAL, which can be used to implement many classes of algorithms, produces programs that perform as well as \hardwired" systems like FFTW.
SPIRAL: A Generator for PlatformAdapted Libraries of Signal Processing Algorithms
 Journal of High Performance Computing and Applications
, 2004
"... SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical fra ..."
Abstract

Cited by 71 (20 self)
 Add to MetaCart
SPIRAL is a generator for libraries of fast software implementations of linear signal processing transforms. These libraries are adapted to the computing platform and can be reoptimized as the hardware is upgraded or replaced. This paper describes the main components of SPIRAL: the mathematical framework that concisely describes signal transforms and their fast algorithms; the formula generator that captures at the algorithmic level the degrees of freedom in expressing a particular signal processing transform; the formula translator that encapsulates the compilation degrees of freedom when translating a specific algorithm into an actual code implementation; and, finally, an intelligent search engine that finds within the large space of alternative formulas and implementations
Generalized FFTs  A Survey Of Some Recent Results
, 1995
"... In this paper we survey some recent work directed towards generalizing the fast Fourier transform (FFT). We work primarily from the point of view of group representation theory. In this setting the classical FFT can be viewed as a family of efficient algorithms for computing the Fourier transform of ..."
Abstract

Cited by 51 (8 self)
 Add to MetaCart
In this paper we survey some recent work directed towards generalizing the fast Fourier transform (FFT). We work primarily from the point of view of group representation theory. In this setting the classical FFT can be viewed as a family of efficient algorithms for computing the Fourier transform of either a function defined on a finite abelian group, or a bandlimited function on a compact abelian group. We discuss generalizations of the FFT to arbitrary finite groups and compact Lie groups.
A memory model for scientific algorithms on graphics processors
 in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract

Cited by 50 (3 self)
 Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D blockbased array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memoryintensive scientific applications – sorting, fast Fourier transform and dense matrixmultiplication. In practice, our cacheefficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPUbased and CPUbased implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
QUANTUM ALGORITHMS FOR SOME HIDDEN SHIFT PROBLEMS
 SIAM J. COMPUT
, 2006
"... Almost all of the most successful quantum algorithms discovered to date exploit the ability of the Fourier transform to recover subgroup structures of functions, especially periodicity. The fact that Fourier transforms can also be used to capture shift structure has received far less attention in th ..."
Abstract

Cited by 42 (2 self)
 Add to MetaCart
Almost all of the most successful quantum algorithms discovered to date exploit the ability of the Fourier transform to recover subgroup structures of functions, especially periodicity. The fact that Fourier transforms can also be used to capture shift structure has received far less attention in the context of quantum computation. In this paper, we present three examples of “unknown shift” problems that can be solved efficiently on a quantum computer using the quantum Fourier transform. For one of these problems, the shifted Legendre symbol problem, we give evidence that the problem is hard to solve classically, by showing a reduction from breaking algebraically homomorphic cryptosystems. We also define the hidden coset problem, which generalizes the hidden shift problem and the hidden subgroup problem. This framework provides a unified way of viewing the ability of the Fourier transform to capture subgroup and shift structure.
Fast Discrete Polynomial Transforms with Applications to Data Analysis for Distance Transitive Graphs
, 1997
"... . Let P = fP 0 ; : : : ; Pn\Gamma1 g denote a set of polynomials with complex coefficients. Let Z = fz 0 ; : : : ; z n\Gamma1 g ae C denote any set of sample points. For any f = (f 0 ; : : : ; fn\Gamma1 ) 2 C n the discrete polynomial transform of f (with respect to P and Z) is defined as the col ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
. Let P = fP 0 ; : : : ; Pn\Gamma1 g denote a set of polynomials with complex coefficients. Let Z = fz 0 ; : : : ; z n\Gamma1 g ae C denote any set of sample points. For any f = (f 0 ; : : : ; fn\Gamma1 ) 2 C n the discrete polynomial transform of f (with respect to P and Z) is defined as the collection of sums, f b f(P 0 ); : : : ; b f(Pn\Gamma1 )g, where f(P j ) = hf; P j i = P n\Gamma1 i=0 f i P j (z i )w(i) for some associated weight function w. These sorts of transforms find important applications in areas such as medical imaging and signal processing. In this paper we present fast algorithms for computing discrete orthogonal polynomial transforms. For a system of N orthogonal polynomials of degree at most N \Gamma 1 we give an O(N log 2 N) algorithm for computing a discrete polynomial transform at an arbitrary set of points instead of the N 2 operations required by direct evaluation. Our algorithm depends only on the fact that orthogonal polynomial sets satisfy a thre...
Short Vector Code Generation for the Discrete Fourier Transform
 In Proc. IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS
"... In this paper we use a mathematical approach to automatically generate high performance short vector code for the discrete Fourier transform (DFT). We represent the wellknown CooleyTukey fast Fourier transform in a mathematical notation and formally derive a "short vector variant". Using this recu ..."
Abstract

Cited by 29 (16 self)
 Add to MetaCart
In this paper we use a mathematical approach to automatically generate high performance short vector code for the discrete Fourier transform (DFT). We represent the wellknown CooleyTukey fast Fourier transform in a mathematical notation and formally derive a "short vector variant". Using this recursion we generate for a given DFT a large number of different algorithms, represented as formulas, and translate them into short vector code. Then we present a vector code specific dynamic programming method that searches in the space of different implementations for the fastest on the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, our automatically generated SSE and SSE2 vector code compares favorably with the handtuned Intel vendor library.
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
 In Proc. IPDPS
, 2002
"... Short vector SIMD instructions on recent microprocessors, such as SSE on Pentium III and 4, speed up code but are a major challenge to software developers. We present a compiler that automatically generates C code enhanced with short vector instructions for digital signal processing (DSP) transforms ..."
Abstract

Cited by 27 (21 self)
 Add to MetaCart
Short vector SIMD instructions on recent microprocessors, such as SSE on Pentium III and 4, speed up code but are a major challenge to software developers. We present a compiler that automatically generates C code enhanced with short vector instructions for digital signal processing (DSP) transforms, such as the fast Fourier transform (FFT). The input to our compiler is a concise mathematical description of a DSP algorithm in the language SPL. SPL is used in the SPIRAL system (http://www.ece.cmu.edu/spiral) to generate highly optimized architecture adapted implementations of DSP transforms. Interfacing our compiler with SPIRAL yields speedups of more than a factor of 2 in several important cases including the FFT and the discrete cosine transform (DCT) used in the JPEG compression standard. For the FFT our automatically generated code is competitive with the handcoded Intel Math Kernel Library.