Results 1  10
of
15
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
SPOOLES: An ObjectOriented Sparse Matrix Library
 In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested dissection and multisection orderings are comparable to other state of the art packages. Factorizations of square matrices have the form A = PLDUQ and A = PLDL T P T , where P and Q are permutation matrices. Square systems of the form A + #B may also be factored and solved (as found in shiftandinvert eigensolvers), as well as full rank overdetermined linear systems, where a QR factorization is computed and the solution found by solving the seminormal equations. # This research was supported in part by the
Communicationoptimal parallel 2.5D matrix multiplication and LU factorization algorithms
"... One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p pr ..."
Abstract

Cited by 23 (16 self)
 Add to MetaCart
One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communicationavoiding pivoting, a stable alternative to partialpivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speedup over 3D matrix multiplication, due to a better balance between communication costs.
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions
 In Proceedings of Irregular'2000, Cancun, Mexique, number 1800 in Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fanin approach and ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fanin approach and is fully driven by this scheduling. We give an overview of the algorithm and present performance results and comparisons with PSPASES on an IBMSP2 with 120 MHz Power2SC nodes for a collection of irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer t...
A Mapping and Scheduling Algorithm for Parallel Sparse FanIn Numerical Factorization
 In EuroPar'99 Parallel Processing, Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fanin approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fanin approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance results on a 16node IBMSP2 with 66 MHz Power2 thin nodes for a collection of grid and irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer to [6] for a complete survey on high performance sparse factorization. In this paper, we focus on the block par...
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
Parallel Direct Methods For Sparse Linear Systems
, 1997
"... We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of the numerically intensive factorization process, but we also briefly consider the other major components of direct methods, such as parallel ordering. Introduction In this paper we present a brief overview of parallel direct methods for solving sparse linear systems. Paradoxically, sparse matrix factorization offers additional opportunities for exploiting parallelism beyond those available with dense matrices, yet it is often more difficult to attain good efficiency in the sparse case. We examine both sides of this paradox: the additional parallelism induced by sparsity, and the difficulty in achieving high efficiency in spite of it. We focus on Cholesky factorization, primarily because th...
Towards an Accurate Performance Modeling of Parallel Sparse
 LU Factorization, in "Applicable Algebra in Engineering, Communication, and Computing
, 2006
"... We present a simulationbased performance model to analyze a parallel sparse LU factorization algorithm on modern cachedbased, highend parallel architectures. We consider supernodal rightlooking parallel factorization on a bidimensional grid of processors, that uses static pivoting. Our model ch ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We present a simulationbased performance model to analyze a parallel sparse LU factorization algorithm on modern cachedbased, highend parallel architectures. We consider supernodal rightlooking parallel factorization on a bidimensional grid of processors, that uses static pivoting. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the implementation in the SuperLU DIST linear system solver, the sparse matrices from real application, and an IBM POWER3 parallel machine. Our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR, and on different parallel machines. 1