Results 1  10
of
11
Parallel scaling of Teter’s minimization for Ab Initio calculations
"... Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel con ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
Abstract — We propose a parallelization scheme for the conjugate gradient method by Teter et. al. and report a detailed analysis of its scalability. We use MPI collective operations exclusively to take advantage of optimized collective implementations with possible hardware support. Our parallel conjugate gradient calculation can be applied in addition to the already implemented parallelism in the application ABINIT. We propose distribution schemes for the band vectors and the 3DFFT, and provide both a detailed runtime and scalability analysis and a model for the used collective operations. We use this model of collective communication to predict the parallel scaling and to show that the scalability is mostly limited by the communication. Our codes scales up to 52 processors for a small 43 atom system and up to 120 processors for a larger 86 atom system for a single kpoint on our test cluster. Our results suggest that nonblocking collective communication could be used to enhace the application running time especially for cluster computers. I.
NAS Benchmarks on the Tera MTA
, 1998
"... The Tera MTA is new, revolutionary commercial computer based on a multithreaded processor architecture. We have compiled and run the five NAS kernel parallel benchmarks on a prototype version of the MTA. This paper briefly describes the MTA architecture, our experience with the compiler, and some pe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The Tera MTA is new, revolutionary commercial computer based on a multithreaded processor architecture. We have compiled and run the five NAS kernel parallel benchmarks on a prototype version of the MTA. This paper briefly describes the MTA architecture, our experience with the compiler, and some performance results. We compare a singleprocessor MTA's performance and ease of programming to that of the Cray T90, the most powerful vector supercomputer made by Cray Research. We found both the MTA and the singleprocessor T90 required no tuning on four of the five benchmarks to get respectable performance. The production MTA should be faster on the CG and IS benchmarks, and the T90 is faster on FT and MG. Except for MG, where the T90's faster clock and higher memorytoprocessor bandwidth give it an unbeatable advantage, the differences in performance are relatively small. We have defined four levels of tuning effort, ranging from "no tuning" to "heroic". The one remaining code, EP, was ea...
A Parallel 3D FFT Algorithm on Clusters of Vector SMPs
, 2000
"... In this paper, we propose a highperformance parallel threedimensional fast Fourier transform (FFT) algorithm on clusters of vector symmetric multiprocessor (SMP) nodes. The threedimensional FFT algorithm can be altered into a multirow FFT algorithm to expand the innermost loop length. We use the ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a highperformance parallel threedimensional fast Fourier transform (FFT) algorithm on clusters of vector symmetric multiprocessor (SMP) nodes. The threedimensional FFT algorithm can be altered into a multirow FFT algorithm to expand the innermost loop length. We use the multirow FFT algorithm to implement the parallel threedimensional FFT algorithm. Performance results of threedimensional poweroftwo FFTs on clusters of (pseudo) vector SMP nodes, Hitachi SR8000, are reported. We succeeded in obtaining performance of about 40 GFLOPS on a 16node Hitachi SR8000.
Mechanical derivation of fused multiplyadd algorithms for linear transforms
 IEEE Transactions on Signal Processing
, 2007
"... Abstract—Several computer architectures offer fused multiply–add (FMA), also called multiplyandaccumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Several computer architectures offer fused multiply–add (FMA), also called multiplyandaccumulate (MAC) instructions, that are as fast as a single addition or multiplication. For the efficient implementation of linear transforms, such as the discrete Fourier transform or discrete cosine transforms, this poses a challenge to algorithm developers as standard transform algorithms have to be manipulated into FMA algorithms that make optimal use of FMA instructions. We present a general method to convert any transform algorithm into an FMA algorithm. The method works with both algorithms given as directed acyclic graphs (DAGs) and algorithms given as structured matrix factorizations. We prove bounds on the efficiency of the method. In particular, we show that it removes all single multiplications except at most as many as the transform has outputs. We implemented the DAGbased version of the method and show that we can generate many of the bestknown handderived FMA algorithms from the literature as well as a few novel FMA algorithms. Index Terms—Automatic program generation, discrete co
unknown title
, 805
"... Direct minimization for calculating invariant subspaces in density functional computations of the electronic structure ∗ ..."
Abstract
 Add to MetaCart
(Show Context)
Direct minimization for calculating invariant subspaces in density functional computations of the electronic structure ∗
unknown title
"... Direct minimization for calculating invariant subspaces in density functional computations of the electronic structure ∗ ..."
Abstract
 Add to MetaCart
(Show Context)
Direct minimization for calculating invariant subspaces in density functional computations of the electronic structure ∗
Sodium oxalate
, 2005
"... Powder diffraction structure analysis / Xray diffraction / ..."
(Show Context)
FFT ALGORITHMS FOR MULTIPLYADD ARCHITECTURES
"... Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This paper introduces newly developed radix2 n FFT kernels that efficiently take advantage of fused multiplyadd (FMA) instructions. If a pro ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. FFTs are the single most important algorithms in science and engineering. To utilize current hardware architectures, special techniques are needed. This paper introduces newly developed radix2 n FFT kernels that efficiently take advantage of fused multiplyadd (FMA) instructions. If a processor is provided with FMA instructions, the new radix2 n kernels reduce the number of required twiddle factors from 2 n − 1ton compared to conventional radix2 n FFT kernels. The reduction in the number of twiddle factors is accomplished by a “hidden ” computation of the twiddle factors (instead of accessing a twiddle factor array) by hiding the additional arithmetic operations in FMA instructions. The new FFT kernels are fully compatible with conventional FFT kernels and can therefore easily be incorporated into existing FFT software, such as Fftw or Spiral. 1
Performance Optimisations of the NPB FT Kernel by SpecialPurpose Unroller
, 1999
"... The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however, the real performance of the FFT implementations is far below the acceptable figures. In this paper, we explore several strategies for performanc ..."
Abstract
 Add to MetaCart
The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however, the real performance of the FFT implementations is far below the acceptable figures. In this paper, we explore several strategies for performance optimisations of the FFT computation, such as enhancing instructionlevel parallelism, loop merging, and reducing the memory loads and stores by using a specialpurpose automatic loop unroller. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks (NPB). In experiments on two different IBM SP2 platforms, our automatically generated unrolled FFT subroutine is shown to improve the performance between 40% and 53% in comparison with the original code. Further, the execution time of the entire 3D FFT megastep of the benchmark is faster than when calls to a similar FFT subroutine from the vendoroptimised PESSL numeri...
Symbiotic Jobscheduling on the Tera MTA
 In Proceedings of Third Workshop on MultiThreaded Execution, Architecture, and Compilers
, 2000
"... Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Archite ..."
Abstract
 Add to MetaCart
Symbiosis is a term from biology meaning the living together of dissimilar organisms in close proximity. We adapt that term to refer to an increase in throughput that can occur when jobs are coscheduled on multithreaded machines. On a multithreaded machine such as the Tera MTA (Multithreaded Architecture) coscheduled jobs share system resource very intimately on a cycle by cycle basis. This can increase system utilization and boost throughput but it can also lead to pathological resource conflicts that lower overall system performance. We exhibit a number of job interactions both beneficial and harmful and explain observed phenomena in a framework of shared system resources. We describe a user space jobscheduler called S.O.S. that dynamically determines which jobs ought to be coscheduled based on resource utilization measurements. S.O.S. can boost system throughput by more than 10% even when the job mix being scheduled is already highly tuned and efficient. 1