Results 1 
9 of
9
Scientific Computing on Bulk Synchronous Parallel Architectures
"... We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures. ..."
Abstract

Cited by 72 (13 self)
 Add to MetaCart
(Show Context)
We theoretically and experimentally analyse the efficiency with which a wide range of important scientific computations can be performed on bulk synchronous parallel architectures.
A TwoDimensional Data Distribution Method For Parallel Sparse MatrixVector Multiplication
 SIAM REVIEW
"... A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipar ..."
Abstract

Cited by 72 (8 self)
 Add to MetaCart
(Show Context)
A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipartitioning of the sparse matrix, each time splitting a rectangular matrix into two parts with a nearly equal number of nonzeros. The communication volume caused by the split is minimised. After the matrix partitioning, the input and output vectors are partitioned with the objective of minimising the maximum communication volume per processor. Experimental results of our implementation, Mondriaan, for a set of sparse test matrices show a reduction in communication compared to onedimensional methods, and in general a good balance in the communication work.
An Efficient Parallel Algorithm for MatrixVector Multiplication
 International Journal of High Speed Computing
, 1995
"... . The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrixvector multiplication algorithm which is particularly ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
. The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrixvector multiplication algorithm which is particularly well suited to dense matrices or matrices with an irregular sparsity pattern. Such matrices can arise from discretizing partial differential equations on irregular grids or from problems exhibiting nearly random connectivity between data structures. The communication cost of the algorithm is independent of the matrix sparsity pattern and is shown to scale as O(n= p p + log(p)) for an n \Theta n matrix on p processors. The algorithm's performance is demonstrated by using it within the well known NAS conjugate gradient benchmark. This resulted in the fastest run times achieved to date on both the 1024 node nCUBE 2 and the 128 node Intel iPSC/860. Additional improvements to the algorithm whic...
A Parallel GMRES Version For General Sparse Matrices
 Electronic Transactions on Numerical Analysis
, 1995
"... . This paper describes the implementation of a parallel variant of GMRES on Paragon. This variant builds an orthonormal Krylov basis in two steps: it first computes a Newton basis then orthogonalises it. The first step requires matrixvector products with a general sparse unsymmetric matrix and the ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
. This paper describes the implementation of a parallel variant of GMRES on Paragon. This variant builds an orthonormal Krylov basis in two steps: it first computes a Newton basis then orthogonalises it. The first step requires matrixvector products with a general sparse unsymmetric matrix and the second step is a QR factorisation of a rectangular matrix with few long vectors. The algorithm has been implemented for a distributed memory parallel computer. The distributed sparse matrixvector product avoids global communications thanks to the initial setup of the communication pattern. The QR factorisation is distributed by using Givens rotations which require only local communications. Results on an Intel Paragon show the e#ciency and the scalability of our algorithm. Key words. GMRES, parallelism, sparse matrix, Newton basis. AMS subject classifications. 65F10, 65F25, 65F50. 1. Introduction. Many scientific applications make use of sparse linear algebra. Because they are quite time ...
Performance Analysis of the IQMR Method on Bulk Synchronous Parallel Architectures
, 1997
"... For the solutions of unsymmetric linear systems of equations, we have proposed an improved version of the quasiminimal residual (IQMR) method [21] by using the Lanczos process as a major component combining elements of numerical stability and parallel algorithm design. For Lanczos process, stabilit ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
For the solutions of unsymmetric linear systems of equations, we have proposed an improved version of the quasiminimal residual (IQMR) method [21] by using the Lanczos process as a major component combining elements of numerical stability and parallel algorithm design. For Lanczos process, stability is obtained by a couple twoterm procedure that generates Lanczos vectors scaled to unit length. The algorithm is derived such that all inner products and matrixvector multiplications of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time. In this paper, we use the Bulk Synchronous Parallel (BSP) model to design a fully efficient, scalable and portable parallel IQMR algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec GC/PowerPlus, and a cluster of workstations connected by an Ethernet. This performance model ...
Highly Scalable Parallel LinearlyImplicit Extrapolation Algorithms
, 1996
"... We present parallel formulations of the well established extrapolation algorithms EULSIM and LIMEX and its implementation on a distributed memory architecture. The discretization of partial differential equations by the method of lines yields large banded systems, which can be efficiently solved ..."
Abstract
 Add to MetaCart
(Show Context)
We present parallel formulations of the well established extrapolation algorithms EULSIM and LIMEX and its implementation on a distributed memory architecture. The discretization of partial differential equations by the method of lines yields large banded systems, which can be efficiently solved in parallel only by iterative methods. Polynomial preconditioning with a Neumann series expansion combined with an overlapping domain decomposition appears as a very efficient, robust and highly scalable preconditioner for different iterative solvers. A further advantage of this preconditioner is that all computation can be restricted to the overlap region as long as the subdomain problems are solved exactly. With this approach the iterative algorithms operate on very short vectors, the length of the vectors depends only on the number of gridpoints in the overlap region and the number of processors, but not on the size of the linear system. As the most reliable and fast iterative me...
On analysis of partitioning models and metrics in parallel sparse matrixvector multiplication
, 2013
"... ..."
FR +E N
"... On analysis of partitioning models and metrics in parallel sparse matrixvector multiplication ..."
Abstract
 Add to MetaCart
On analysis of partitioning models and metrics in parallel sparse matrixvector multiplication