Results 1  10
of
16
Parallel Algorithms For The Spectral Transform Method
, 1994
"... The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; howev ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
The spectral transform method is a standard numerical technique for solving partial differential equations on a sphere and is widely used in atmospheric circulation models. Recent research has identified several promising algorithms for implementing this method on massively parallel computers; however, no detailed comparison of the different algorithms has previously been attempted. In this paper, we describe these different parallel algorithms and report on computational experiments that we have conducted to evaluate their efficiency on parallel computers. The experiments used a testbed code that solves the nonlinear shallow water equations on a sphere; considerable care was taken to ensure that the experiments provide a fair comparison of the different algorithms and that the results are relevant to global models. We focus on hypercube and meshconnected multicomputers with cutthrough routing, such as the Intel iPSC/860, DELTA, and Paragon, and the nCUBE/2, but also indicate how th...
Index Transformation Algorithms in a Linear Algebra Framework
, 1992
"... We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
We present a linear algebraic formulation for a class of index transformations such as Gray code encoding and decoding, matrix transpose, bit reversal, vector reversal, shuffles, and other index or dimension permutations. This formulation unifies, simplifies, and can be used to derive algorithms for hypercube multiprocessors. We show how all the widely known properties of Gray codes, and some not so wellknown properties as well, can be derived using this framework. Using this framework, we relate hypercube communications algorithms to GaussJordan elimination on a matrix of 0's and 1's.
Bit Reversal On Uniprocessors
 SIAM Rev
, 1996
"... Manyversions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index. There has been a surprisingly large number of papers on this subject in the recent literature. ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Manyversions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index. There has been a surprisingly large number of papers on this subject in the recent literature.
Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes
 DISCRETE APPLIED MATHEMATICS
, 1992
"... We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for K elements per node is K 2 , irrespective ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
We present optimal schedules for permutations in which each node sends one or several unique messages to every other node. With concurrent communication on all channels of every node in binary cube networks, the number of element transfers in sequence for K elements per node is K 2 , irrespective of the number of nodes over which the data set is distributed. For a succession of s permutations within disjoint subcubes of d dimensions each, our schedules yield min( K 2 + (s \Gamma 1)d; (s + 3)d; K 2 + 2d) exchanges in sequence. The algorithms can be organized to avoid indirect addressing in the internode data exchanges, a property that increases the performance on some architectures. For message passing communication libraries, we present a blocking procedure that minimizes the number of block transfers while preserving the utilization of the communication channels. For schedules with optimal channel utilization, the number of block transfers for a binary dcube is d. The maximum ...
Pratial multinode broadcast and partial exchange algorithms for ddimension meshes
 Journal of Parallel and Distributed Computing
, 1994
"... by ..."
Data Parallel Programming: A Survey and a Proposal for a New Model
, 1993
"... We give a brief description of what we consider to be data parallel programming and processing, trying to pinpoint the typical problems and pitfalls that occur. We then proceed with a short annotated history of data parallel programming, and sketch a taxonomy in which data parallel languages can be ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We give a brief description of what we consider to be data parallel programming and processing, trying to pinpoint the typical problems and pitfalls that occur. We then proceed with a short annotated history of data parallel programming, and sketch a taxonomy in which data parallel languages can be classified. Finally we present our own model of data parallel programming, which is based on the view of parallel data collections as functions. We believe that this model has a number of distinct advantages, such as being abstract, independent of implicitly assumed machine models, and general.
Optimal AlltoAll Personalized Communication with Minimum Span on Boolean Cubes
 in proceedings of the 6th Distributed Memory Computing Conf
, 1991
"... Alltoall personalized communication is a class of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case whe ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Alltoall personalized communication is a class of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For K elements per processor our algorithms give the optimal number of elements transfer, K=2. For a succession of alltoall personalized communications on disjoint subcubes of fi dimensions each, our best algorithm yields K 2 +oe \Gamma fi element exchanges in sequence, where oe is the total number of processor dimensions in the permutation. An implementation on the Connection Machine of one of the algorithms offers a maximum speedup of 50% compared to the previously best known algorithm. 1 Introduction We give simple, yet optimal, schedules for alltoall personalized commun...
CooleyTukey FFT on the Connection Machine
 In: Parallel Computing. Volume
, 1991
"... We describe an implementation of the Cooley Tukey complextocomplex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We describe an implementation of the Cooley Tukey complextocomplex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM200 with 2048 floatingpoint processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32bit precision, and 10.7 Gflops/s in 64bit precision. The same FFT routine is used to perform both one and multidimensional FFT without any explicit data rearrangement. The peak performance for a onedimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32bit precision and 3.2 Gflops/s in 64bit precision. The peak performance for square, twodimensional transforms, is 3.1 Gflops/s in 32bit precision, and for cubic, three dimensi...