Results 1  10
of
43
FFTs in external or hierarchical memory
 Journal of Supercomputing
, 1990
"... Conventional algorithms for computing large onedimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT alg ..."
Abstract

Cited by 134 (12 self)
 Add to MetaCart
Conventional algorithms for computing large onedimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 mpoint FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance gures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray2, the Cray XMP,andtheCrayYMP systems. Using all eight processors on the Cray YMP, this main memory routine runs at nearly two giga ops.
A Survey of OutofCore Algorithms in Numerical Linear Algebra
 DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers outofcore algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for Nbody computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of outofcore algorithms, and relationships between outofcore, cacheaware, and parallel algorithms.
The Fractional Fourier Transform and Applications
, 1995
"... This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Fourier transf ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Fourier transform is based on fractional roots of unity e \Gamma2ßiff , where ff is arbitrary. The fractional Fourier transform and the corresponding fast algorithm are useful for such applications as computing DFTs of sequences with prime lengths, computing DFTs of sparse sequences, analyzing sequences with noninteger periodicities, performing highresolution trigonometric interpolation, detecting lines in noisy images and detecting signals with linearly drifting frequencies. In many cases, the resulting algorithms are faster by arbitrarily large factors than conventional techniques. Bailey is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field,...
Faster Integer Multiplication
 STOC'07
, 2007
"... For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complex ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complexity of an optimal algorithm is Θ(n log n). We present a major step towards closing the gap from above by presenting an algorithm running in time n log n 2 O(log ∗ n). The main result is for boolean circuits as well as for multitape Turing machines, but it has consequences to other models of computation as well.
Communication and Memory Requirements as the Basis for Mapping Task and Data Parallel Programs
, 1994
"... For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single source program in many different ways onto a parallel machine. The tradeoffs between task and data parallelism are complex and depend on the characteristics of the program to be executed, most significantly the memory and communication requirements, and the performance parameters of the target parallel machine. In this paper, we present a framework to isolate and examine the specific characteristics of programs that determine the performance for different mappings. Our focus is on applications that process a stream of input, and whose computation structure is fairly static and predictable. We describe three such applications that were developed with our compiler: fast Fourier transforms, nar...
Sequential and parallel complexity of approximate evaluation of polynomial zeros
 COMPUT. MATH. APPLIC
, 1987
"... Our new sequential and parallel algorithms establish new record upper bounds on both arithmetic and Boolean complexity of approximating to complex polynomial zeros. O(n 2 log b log n) arithmetic operations or O(n log n log (bn)) parallel steps and n log b/log (bn) processors suffice in order to appr ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
Our new sequential and parallel algorithms establish new record upper bounds on both arithmetic and Boolean complexity of approximating to complex polynomial zeros. O(n 2 log b log n) arithmetic operations or O(n log n log (bn)) parallel steps and n log b/log (bn) processors suffice in order to approximate with absolute errors ~< 2 mb to all the complex zeros of an nth degree polynomial p(x) whose coefficients have mod ~< 2 m. If we only need such an approximation to a single zero of p(x), then O(n log b log n) arithmetic operations or O(log z n log (bn)) steps and (n/log n)log b/log (bn) processors suffice (which places the latter problem in NC, that is, in the class of problems that can be solved using polylogarithmic parallel time and a polynomial number of processors). Those estimates are reached in computations with O(bn) binary bits where the polynomial has integer coefficients. We also reach the sequential Boolean time bounds O(bn31og (bn)log log(bn)) for approximating to all the zeros (very minor improvement of the bound announced in 1982 by Schrnhage) and O(bn21og log n Iog(bn)log log(bn)) for approximating to a single zero. Among further implications are the improvements of the known algorithm.q and complexity estimates for computing matrix eigenvalues, for polynomial factorization over the field of complex numbers and for solving systems of polynomial equations. The computations rely on recursive application of Turan's proximity test of 1968, on its more recent extensions to root radii computations, on contour integration via Fast Fourier transform (FFT) within geometric constructions for search and exclusion, and (for the final minor improvements ofthe complexity bounds) on the recursive factorization ofp(x) over discs on the complex plane via numerical integration and Newton's iterations.'
Towards an Optimal BitReversal Permutation Program
 In Proceeding of IEEE Foundations of Computer Science
, 1998
"... The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present nontrivial problems when designing highlytuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebbletype game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with fourlevels of memory (registers, cache, TLB, and memory), taking into account such "messy" features as block moves and setassociative caches. The insights from this analysis lead to a bitreversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algo...
A selfsorting inplace fast Fourier transform algorithm suitable for vector and parallel processing
, 1994
"... this paper point to software. Furthermore a simple tutorial on FFTs is presented there without explicit usage of Kronecker products. Both complexity results and implementation ideas are presented. A broad introduction to FFTs featuring "matrix language" can be found in [Loa92]. It seems safe to gues ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
this paper point to software. Furthermore a simple tutorial on FFTs is presented there without explicit usage of Kronecker products. Both complexity results and implementation ideas are presented. A broad introduction to FFTs featuring "matrix language" can be found in [Loa92]. It seems safe to guess that this book will have large impact on the further development of FFT algorithms as it gives a concise mathematical treatment and thorough discussion of most of the major FFT ideas known so far. The purpose this paper is threefold: First we present a new FFT algorithm. Then we discuss the theory of index digit permutations on which such algorithms are based. Finally we derive some old and new FFT algorithms using splitting lemmata. Our new algorithm implements a recursive Swarztraubertype of splitting. In contrast to Swarztrauber's, it is inplace for any order n. It is naturally selfsorting and essentially accesses data only with stride one. This is important for computers featuring very fast stride one data access like the VP series of Fujitsu. As in Swarztrauber's algorithm, it also lends itself to parallel implementation. In a joint project between ANU and Fujitsu Ltd. implementations of this algorithm for the new VPP 500 supercomputer are being developed. The Fujitsu VPP 500 is a stateoftheart vector parallel computer with up to 222 vector processors and capable of up to 355 Gigaflops (from a product description by Fujitsu Ltd.). As a first stage we implemented this algorithm for n being a power of 2 on the VP 2200 at ANU. We will introduce some basic matrix families to describe the Fourier transform algorithms. These families are closely related to basic subroutines (similar to the BLAS in linear algebra) needed for the implementation of the algorithms. The basi...
Computing Isotypic Projections with the Lanczos Iteration
 SIAM J. Matrix Anal. Appl
"... Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient i ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient isotypic projection method for permutation representations of distance transitive graphs and the symmetric group.