Results 1 - 10
of
37
FFTs in external or hierarchical memory
- Journal of Supercomputing
, 1990
"... Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT alg ..."
Abstract
-
Cited by 124 (10 self)
- Add to MetaCart
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 m-point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance gures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP,andtheCrayY-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly two giga ops.
A Survey of Out-of-Core Algorithms in Numerical Linear Algebra
- DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers out-of-core algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for N-body computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of out-of-core algorithms, and relationships between out-of-core, cache-aware, and parallel algorithms.
The Fractional Fourier Transform and Applications
, 1995
"... This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Fourier transf ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
This paper describes the "fractional Fourier transform", which admits computation by an algorithm that has complexity proportional to the fast Fourier transform algorithm. Whereas the discrete Fourier transform (DFT) is based on integral roots of unity e \Gamma2ßi=n , the fractional Fourier transform is based on fractional roots of unity e \Gamma2ßiff , where ff is arbitrary. The fractional Fourier transform and the corresponding fast algorithm are useful for such applications as computing DFTs of sequences with prime lengths, computing DFTs of sparse sequences, analyzing sequences with non-integer periodicities, performing high-resolution trigonometric interpolation, detecting lines in noisy images and detecting signals with linearly drifting frequencies. In many cases, the resulting algorithms are faster by arbitrarily large factors than conventional techniques. Bailey is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field,...
Communication and Memory Requirements as the Basis for Mapping Task and Data Parallel Programs
, 1994
"... For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
For a wide variety of applications, both task and data parallelism must be exploited to achieve the best possible performance on a multicomputer. Recent research has underlined the importance of exploiting task and data parallelism in a single compilerframework, and such a compiler can map a single source program in many different ways onto a parallel machine. The tradeoffs between task and data parallelism are complex and depend on the characteristics of the program to be executed, most significantly the memory and communication requirements, and the performance parameters of the target parallel machine. In this paper, we present a framework to isolate and examine the specific characteristics of programs that determine the performance for different mappings. Our focus is on applications that process a stream of input, and whose computation structure is fairly static and predictable. We describe three such applications that were developed with our compiler: fast Fourier transforms, nar...
Faster Integer Multiplication
- STOC'07
, 2007
"... For more than 35 years, the fastest known method for integer multiplication has been the Schönhage-Strassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complex ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
For more than 35 years, the fastest known method for integer multiplication has been the Schönhage-Strassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complexity of an optimal algorithm is Θ(n log n). We present a major step towards closing the gap from above by presenting an algorithm running in time n log n 2 O(log ∗ n). The main result is for boolean circuits as well as for multitape Turing machines, but it has consequences to other models of computation as well.
A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing
, 1994
"... this paper point to software. Furthermore a simple tutorial on FFTs is presented there without explicit usage of Kronecker products. Both complexity results and implementation ideas are presented. A broad introduction to FFTs featuring "matrix language" can be found in [Loa92]. It seems safe to gues ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
this paper point to software. Furthermore a simple tutorial on FFTs is presented there without explicit usage of Kronecker products. Both complexity results and implementation ideas are presented. A broad introduction to FFTs featuring "matrix language" can be found in [Loa92]. It seems safe to guess that this book will have large impact on the further development of FFT algorithms as it gives a concise mathematical treatment and thorough discussion of most of the major FFT ideas known so far. The purpose this paper is threefold: First we present a new FFT algorithm. Then we discuss the theory of index digit permutations on which such algorithms are based. Finally we derive some old and new FFT algorithms using splitting lemmata. Our new algorithm implements a recursive Swarztrauber-type of splitting. In contrast to Swarztrauber's, it is in-place for any order n. It is naturally selfsorting and essentially accesses data only with stride one. This is important for computers featuring very fast stride one data access like the VP series of Fujitsu. As in Swarztrauber's algorithm, it also lends itself to parallel implementation. In a joint project between ANU and Fujitsu Ltd. implementations of this algorithm for the new VPP 500 supercomputer are being developed. The Fujitsu VPP 500 is a state-of-the-art vector parallel computer with up to 222 vector processors and capable of up to 355 Gigaflops (from a product description by Fujitsu Ltd.). As a first stage we implemented this algorithm for n being a power of 2 on the VP 2200 at ANU. We will introduce some basic matrix families to describe the Fourier transform algorithms. These families are closely related to basic subroutines (similar to the BLAS in linear algebra) needed for the implementation of the algorithms. The basi...
Towards an Optimal Bit-Reversal Permutation Program
- In Proceeding of IEEE Foundations of Computer Science
, 1998
"... The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation -- trivial operations on a RAM -- present ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bit-reversal permutation -- trivial operations on a RAM -- present non-trivial problems when designing highly-tuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebble-type game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with four-levels of memory (registers, cache, TLB, and memory), taking into account such "messy" features as block moves and set-associative caches. The insights from this analysis lead to a bit-reversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algo...
Computing Isotypic Projections with the Lanczos Iteration
- SIAM J. Matrix Anal. Appl
"... Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient i ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Abstract. When the isotypic subspaces of a representation are viewed as the eigenspaces of a symmetric linear transformation, isotypic projections may be achieved as eigenspace projections and computed using the Lanczos iteration. In this paper, we show how this approach gives rise to an efficient isotypic projection method for permutation representations of distance transitive graphs and the symmetric group.
Approximate Complex Polynomial Evaluation In Near Constant Work Per Point
, 1999
"... . Given the n complex coe#cients of a degree n - 1 complex polynomial, we wish to evaluate the polynomial at a large number m # n of points on the complex plane. This problem is required by many algebraic computations and so is considered in most basic algorithm texts (e.g., [A. V. Aho, J. E. Ho ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
. Given the n complex coe#cients of a degree n - 1 complex polynomial, we wish to evaluate the polynomial at a large number m # n of points on the complex plane. This problem is required by many algebraic computations and so is considered in most basic algorithm texts (e.g., [A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, 1974]). We assume an arithmetic model of computation, where on each step we can execute an arithmetic operation, which is computed exactly. All previous exact algorithms [C. M. Fiduccia, Proceedings 4th Annual ACM Symposium on Theory of Computing, 1972, pp. 88--93; H. T. Kung, Fast Evaluation and Interpolation, Carnegie-Mellon, 1973; A. B. Borodin and I. Munro, The Computational Complexity of Algebraic and Numerical Problems, American Elsevier, 1975; V. Pan, A. Sadikou, E. Landowne, and O. Tiga, Comput. Math. Appl., 25 (1993), pp. 25--30] cost at least work ## log 2 n) per point, and previously, the...

