Results 1  10
of
72
FFTs in external or hierarchical memory
 Journal of Supercomputing
, 1990
"... Conventional algorithms for computing large onedimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT alg ..."
Abstract

Cited by 149 (11 self)
 Add to MetaCart
Conventional algorithms for computing large onedimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 mpoint FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance gures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray2, the Cray XMP,andtheCrayYMP systems. Using all eight processors on the Cray YMP, this main memory routine runs at nearly two giga ops.
Faster Integer Multiplication
 STOC'07
, 2007
"... For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complex ..."
Abstract

Cited by 84 (0 self)
 Add to MetaCart
For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complexity of an optimal algorithm is Θ(n log n). We present a major step towards closing the gap from above by presenting an algorithm running in time n log n 2 O(log ∗ n). The main result is for boolean circuits as well as for multitape Turing machines, but it has consequences to other models of computation as well.
The Fractional Fourier Transform and Applications
, 1995
"... This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Four ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Fourier transform is based on fractional roots of unity e \Gamma2ßiff , where ff is arbitrary. The fractional Fourier transform and the corresponding fast algorithm are useful for such applications as computing DFTs of sequences with prime lengths, computing DFTs of sparse sequences, analyzing sequences with noninteger periodicities, performing highresolution trigonometric interpolation, detecting lines in noisy images and detecting signals with linearly drifting frequencies. In many cases, the resulting algorithms are faster by arbitrarily large factors than conventional techniques. Bailey is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field,...
A Survey of OutofCore Algorithms in Numerical Linear Algebra
 DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers outofcore algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for Nbody computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of outofcore algorithms, and relationships between outofcore, cacheaware, and parallel algorithms.
The CMU Task Parallel Program Suite
, 1994
"... The idea of exploiting both task and data parallelism in programs is appealing. However, identifying realistic, yet manageable example programs that can benefit from such a mix of task and data parallelism is a major problem for researchers. We address this problem by describing a suite of five appl ..."
Abstract

Cited by 61 (7 self)
 Add to MetaCart
The idea of exploiting both task and data parallelism in programs is appealing. However, identifying realistic, yet manageable example programs that can benefit from such a mix of task and data parallelism is a major problem for researchers. We address this problem by describing a suite of five applications from the domains of scientific, signal, and image processing that are of reasonable size, are representative of real codes, and can benefit from exploiting task and data parallelism. The suite includes fast Fourier transforms, narrowband tracking radar, multibaseline stereo imaging, and airshed simulation. Complete source code for each example program is available from the authors.
An algorithm for computing the mixed radix fast Fourier transform
 IEEE Trans. on Audio CJ Electroacoustics
, 1969
"... This paper presents an algorithm for computing the fast Fourier transform, based on a method proposed by Cooley and Tukey. As in their algorithm, the dimension n of the transform is factored (if possible), and n/p elementary transforms of dimension p are computed for each factor p of n. An improved ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
This paper presents an algorithm for computing the fast Fourier transform, based on a method proposed by Cooley and Tukey. As in their algorithm, the dimension n of the transform is factored (if possible), and n/p elementary transforms of dimension p are computed for each factor p of n. An improved method of computing a transform step corresponding to an odd factor of n is given; with this method, the number of complex multiplicatiops for an elementary transform of dimension p is reduced from (p1)2 to (p1)2/4 for odd p. The fast Fourier transform, when computed in place, requires a final permutation step to arrange the results in normal order. This algorithm includes an efficient method for permuting the results in place. The algorithm is described mathematically and illustrated by a FORTRAN subroutine.
Communication and Memory Requirements as the Basis for Mapping Task and Data Parallel Programs
, 1994
"... For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single ..."
Abstract

Cited by 32 (8 self)
 Add to MetaCart
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single source program in many different ways onto a parallel machine. The tradeoffs between task and data parallelism are complex and depend on the characteristics of the program to be executed, most significantly the memory and communication requirements, and the performance parameters of the target parallel machine. In this paper, we present a framework to isolate and examine the specific characteristics of programs that determine the performance for different mappings. Our focus is on applications that process a stream of input, and whose computation structure is fairly static and predictable. We describe three such applications that were developed with our compiler: fast Fourier transforms, nar...
Sequential and parallel complexity of approximate evaluation of polynomial zeros
 COMPUT. MATH. APPLIC
, 1987
"... Our new sequential and parallel algorithms establish new record upper bounds on both arithmetic and Boolean complexity of approximating to complex polynomial zeros. O(n 2 log b log n) arithmetic operations or O(n log n log (bn)) parallel steps and n log b/log (bn) processors suffice in order to appr ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
Our new sequential and parallel algorithms establish new record upper bounds on both arithmetic and Boolean complexity of approximating to complex polynomial zeros. O(n 2 log b log n) arithmetic operations or O(n log n log (bn)) parallel steps and n log b/log (bn) processors suffice in order to approximate with absolute errors ~< 2 mb to all the complex zeros of an nth degree polynomial p(x) whose coefficients have mod ~< 2 m. If we only need such an approximation to a single zero of p(x), then O(n log b log n) arithmetic operations or O(log z n log (bn)) steps and (n/log n)log b/log (bn) processors suffice (which places the latter problem in NC, that is, in the class of problems that can be solved using polylogarithmic parallel time and a polynomial number of processors). Those estimates are reached in computations with O(bn) binary bits where the polynomial has integer coefficients. We also reach the sequential Boolean time bounds O(bn31og (bn)log log(bn)) for approximating to all the zeros (very minor improvement of the bound announced in 1982 by Schrnhage) and O(bn21og log n Iog(bn)log log(bn)) for approximating to a single zero. Among further implications are the improvements of the known algorithm.q and complexity estimates for computing matrix eigenvalues, for polynomial factorization over the field of complex numbers and for solving systems of polynomial equations. The computations rely on recursive application of Turan's proximity test of 1968, on its more recent extensions to root radii computations, on contour integration via Fast Fourier transform (FFT) within geometric constructions for search and exclusion, and (for the final minor improvements ofthe complexity bounds) on the recursive factorization ofp(x) over discs on the complex plane via numerical integration and Newton's iterations.'
Quantitative Performance Modeling of Scientific Computations and Creating Locality in Numerical Algorithms
, 1995
"... ... you design an efficient outofcore iterative algorithm? These are the two questions answered in this thesis. The first ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
... you design an efficient outofcore iterative algorithm? These are the two questions answered in this thesis. The first
Computing isotypic projections with the Lanczos iteration
 SIAM Journal on Matrix Analysis and Applications
"... Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient i ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient isotypic projection method for permutation representations of distance transitive graphs and the symmetric group.