Results 1 - 10
of
15
Highly scalable parallel algorithms for sparse matrix factorization
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract
-
Cited by 100 (29 self)
- Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge,
SPOOLES: An Object-Oriented Sparse Matrix Library
- In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested dissection and multisection orderings are comparable to other state of the art packages. Factorizations of square matrices have the form A = PLDUQ and A = PLDL T P T , where P and Q are permutation matrices. Square systems of the form A + #B may also be factored and solved (as found in shift-and-invert eigensolvers), as well as full rank overdetermined linear systems, where a QR factorization is computed and the solution found by solving the semi-normal equations. # This research was supported in part by the
A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization
- In EuroPar'99 Parallel Processing, Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fan-in approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fan-in approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance results on a 16-node IBM-SP2 with 66 MHz Power2 thin nodes for a collection of grid and irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer to [6] for a complete survey on high performance sparse factorization. In this paper, we focus on the block par...
permission. MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions
- In Proceedings of Irregular'2000, Cancun, Mexique, number 1800 in Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fan-in approach and ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fan-in approach and is fully driven by this scheduling. We give an overview of the algorithm and present performance results and comparisons with PSPASES on an IBM-SP2 with 120 MHz Power2SC nodes for a collection of irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer t...
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
"... Abstract. One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arrange ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
Abstract. One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communication-avoiding pivoting, a stable alternative to partial-pivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speed-up over 3D matrix multiplication, due to a better balance between communication costs. 1
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrix-vector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
Parallel Direct Methods For Sparse Linear Systems
, 1997
"... We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of the numerically intensive factorization process, but we also briefly consider the other major components of direct methods, such as parallel ordering. Introduction In this paper we present a brief overview of parallel direct methods for solving sparse linear systems. Paradoxically, sparse matrix factorization offers additional opportunities for exploiting parallelism beyond those available with dense matrices, yet it is often more difficult to attain good efficiency in the sparse case. We examine both sides of this paradox: the additional parallelism induced by sparsity, and the difficulty in achieving high efficiency in spite of it. We focus on Cholesky factorization, primarily because th...
Towards an Accurate Performance Modeling of Parallel Sparse
- LU Factorization, in "Applicable Algebra in Engineering, Communication, and Computing
, 2006
"... We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model ch ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the implementation in the SuperLU DIST linear system solver, the sparse matrices from real application, and an IBM POWER3 parallel machine. Our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR, and on different parallel machines. 1

