• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Fast Radix 2,3,4, and 5 Kernels for Fast Fourier Transformations on Computers with overlapping multiplyadd instructions (1997)

by S Goedecker
Venue:SIAM Journal on Scientific Computing, Vol
Add To MetaCart

Tools

Sorted by:
Results 1 - 5 of 5

Parallel scaling of Teter’s minimization for Ab Initio calculations

by Torsten Hoefler, Wolfgang Rehm
"... Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel con ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel conjugate gradient calculation can be applied in addition to the already implemented parallelism in the application ABINIT. We propose distribution schemes for the band vectors and the 3D-FFT, and provide both a detailed runtime and scalability analysis and a model for the used collective operations. We use this model of collective communication to predict the parallel scaling and to show that the scalability is mostly limited by the communication. Our codes scales up to 52 processors for a small 43 atom system and up to 120 processors for a larger 86 atom system for a single k-point on our test cluster. Our results suggest that non-blocking collective communication could be used to enhace the application running time especially for cluster computers. I.

NAS Benchmarks on the Tera MTA

by Jay Boisseau, Kang Su Gatlin, Amit Majumdar, Allan Snavely, Larry Carter, Larry Carter , 1998
"... The Tera MTA is new, revolutionary commercial computer based on a multithreaded processor architecture. We have compiled and run the five NAS kernel parallel benchmarks on a prototype version of the MTA. This paper briefly describes the MTA architecture, our experience with the compiler, and some pe ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
The Tera MTA is new, revolutionary commercial computer based on a multithreaded processor architecture. We have compiled and run the five NAS kernel parallel benchmarks on a prototype version of the MTA. This paper briefly describes the MTA architecture, our experience with the compiler, and some performance results. We compare a single-processor MTA's performance and ease of programming to that of the Cray T90, the most powerful vector supercomputer made by Cray Research. We found both the MTA and the singleprocessor T90 required no tuning on four of the five benchmarks to get respectable performance. The production MTA should be faster on the CG and IS benchmarks, and the T90 is faster on FT and MG. Except for MG, where the T90's faster clock and higher memory-to-processor bandwidth give it an unbeatable advantage, the differences in performance are relatively small. We have defined four levels of tuning effort, ranging from "no tuning" to "heroic". The one remaining code, EP, was ea...

A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs

by Daisuke Takahashi , 2000
"... In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of vector symmetric multiprocessor (SMP) nodes. The three-dimensional FFT algorithm can be altered into a multirow FFT algorithm to expand the innermost loop length. We use the ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
In this paper, we propose a high-performance parallel three-dimensional fast Fourier transform (FFT) algorithm on clusters of vector symmetric multiprocessor (SMP) nodes. The three-dimensional FFT algorithm can be altered into a multirow FFT algorithm to expand the innermost loop length. We use the multirow FFT algorithm to implement the parallel three-dimensional FFT algorithm. Performance results of three-dimensional power-of-two FFTs on clusters of (pseudo) vector SMP nodes, Hitachi SR8000, are reported. We succeeded in obtaining performance of about 40 GFLOPS on a 16-node Hitachi SR8000.

Performance Optimisations of the NPB FT Kernel by Special-Purpose Unroller

by Vladimir Getov, Yuan Wei, Larry Carter, Kang Su Gatlin , 1999
"... The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however, the real performance of the FFT implementations is far below the acceptable figures. In this paper, we explore several strategies for performanc ..."
Abstract - Add to MetaCart
The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however, the real performance of the FFT implementations is far below the acceptable figures. In this paper, we explore several strategies for performance optimisations of the FFT computation, such as enhancing instruction-level parallelism, loop merging, and reducing the memory loads and stores by using a special-purpose automatic loop unroller. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks (NPB). In experiments on two different IBM SP2 platforms, our automatically generated unrolled FFT subroutine is shown to improve the performance between 40% and 53% in comparison with the original code. Further, the execution time of the entire 3-D FFT mega-step of the benchmark is faster than when calls to a similar FFT subroutine from the vendor-optimised PESSL numeri...

Symbiotic Jobscheduling on the Tera MTA

by Allan Snavely And, Larry Carter, Allan Snavely, Allan Snavely - In Proceedings of Third Workshop on Multi-Threaded Execution, Architecture, and Compilers , 2000
"... Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Archite ..."
Abstract - Add to MetaCart
Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Architecture) coscheduled jobs share system resource very intimately on a cycle by cycle basis. This can increase system utilization and boost throughput but it can also lead to pathological resource conflicts that lower overall system performance. We exhibit a number of job interactions both beneficial and harmful and explain observed phenomena in a framework of shared system resources. We describe a user space jobscheduler called S.O.S. that dynamically determines which jobs ought to be coscheduled based on resource utilization measurements. S.O.S. can boost system throughput by more than 10% even when the job mix being scheduled is already highly tuned and efficient. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University