Results 1 
7 of
7
Elimination Forest Guided 2D Sparse LU Factorization
"... Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S* that incorporates static symbolic factorization, supernode p ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S* that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping e ectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S+ improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450MHz nodes, which is the highest performance reported in the literature.
S+: Efficient 2D sparse LU factorization on parallel machines
 SIAM J. Matrix Anal. Appl
, 2001
"... Abstract. Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high gigaflop rates for parallel sparse LU factorization with partial pivoting. This paper studies properties of elimination forests and uses them to optimize supernode par ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Static symbolic factorization coupled with supernode partitioning and asynchronous computation scheduling can achieve high gigaflop rates for parallel sparse LU factorization with partial pivoting. This paper studies properties of elimination forests and uses them to optimize supernode partitioning/amalgamation and execution scheduling. It also proposes supernodal matrix multiplication to speed up kernel computation by retaining the BLAS3 level efficiency and avoiding unnecessary arithmetic operations. The experiments show that our new design with proper space optimization, called S +, improves our previous solution substantially and can achieve up to 10 GFLOPS on 128 Cray T3E 450MHz nodes. Key words. Gaussian elimination with partial pivoting, LU factorization, sparse matrices, elimination forests, supernode amalgamation and partitioning, asynchronous computation scheduling AMS subject classifications. 65F50, 65F05 PII. S0895479898337385
Efficient Sparse LU Factorization with Lazy Space Allocation
 IN PROCEEDINGS OF THE NINTH SIAM CONFERENCE ON PARALLEL PROCESSING FOR SCIENTI C COMPUTING
, 1999
"... Static symbolic factorization coupled with 2D supernode partitioning and asynchronous computation scheduling is a viable approach for sparse LU with dynamic partial pivoting. Our previous implementation, called S +, uses those techniques and achieves high giga op rates on distributed memory machines ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Static symbolic factorization coupled with 2D supernode partitioning and asynchronous computation scheduling is a viable approach for sparse LU with dynamic partial pivoting. Our previous implementation, called S +, uses those techniques and achieves high giga op rates on distributed memory machines. This paper studies the space requirement of this approach and proposes an optimization strategy called lazy space allocation which acquires memory onthefly only when it is necessary. This strategy can effectively control memory usage, especially when static symbolic factorization overestimates fillins excessively. Our experiments show that the improved S + code, which combines this strategy with eliminationforest guided partitioning and scheduling, has sequential time and space cost competitive with SuperLU, is space scalable for solving problems of large sizes on multiple processors, and can deliver up to 10 GFLOPS on 128 Cray 450Mhz T3E nodes.
Parallel Sparse Gaussian Elimination with Partial Pivoting and 2D Data Mapping
, 1997
"... Sparse Gaussian elimination with partial pivoting is a fundamental algorithm for many scientific and engineering applications, but it is still an open problem to develop a time and space efficient algorithm on distributed memory machines. In this thesis, we present an asynchronous algorithm which ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Sparse Gaussian elimination with partial pivoting is a fundamental algorithm for many scientific and engineering applications, but it is still an open problem to develop a time and space efficient algorithm on distributed memory machines. In this thesis, we present an asynchronous algorithm which incorporates static symbolic factorization, nonsymmetric L/U supernode partitioning and supernode amalgamation strategies, 2D block mapping scheme, and irregular parallelism scheduling. The distinguished features of the algorithm are low memory requirements, simple runtime control, and BLAS3 computation kernel. We demonstrate that the new algorithm can solve very large sparse linear systems and achieve high absolute performance. We have conducted experimental studies on the CrayT3D a...
Parallel Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
, 1997
"... Gaussian elimination based sparse LU factorization with partial pivoting is important to many scientific applications, but it is still an open problem to develop a high performance sparse LU code on distributed memory machines. The main difficulty is that partial pivoting operations make structures ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Gaussian elimination based sparse LU factorization with partial pivoting is important to many scientific applications, but it is still an open problem to develop a high performance sparse LU code on distributed memory machines. The main difficulty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents an approach called S for parallelizing this problem on distributed memory machines. The S approach adopts static symbolic factorization to avoid runtime control overhead, incorporates 2D L/U supernode partitioning and amalgamation strategies to improve caching performance, and exploits irregular task parallelism embedded in sparse LU using asynchronous computation scheduling. The paper discusses and compares the algorithms using 1D and 2D data mapping schemes, and presents experimental studies on CrayT3D and T3E. The performance results for a set of nonsymmetric benchmark matrices are very encouraging and S has ac...
Parallel Sparse LU Factorization on Secondclass Message Passing Platforms
, 2005
"... Several message passingbased parallel solvers have been developed for general (nonsymmetric) sparse LU factorization with partial pivoting. Due to the finegrain synchronization and large communication volume between computing nodes for this application, existing solvers are mostly intended to run ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Several message passingbased parallel solvers have been developed for general (nonsymmetric) sparse LU factorization with partial pivoting. Due to the finegrain synchronization and large communication volume between computing nodes for this application, existing solvers are mostly intended to run on tightlycoupled parallel computing platforms with high message passing performance (e.g., 1–10 µs in message latency and 100–1000 Mbytes/sec in message throughput). In order to utilize platforms with slower message passing, this paper investigates techniques that can significantly reduce the application’s communication needs. In particular, we propose batch pivoting to make pivot selections in groups through speculative factorization, and thus substantially decrease the interprocessor synchronization granularity. We experimented with an MPIbased implementation on several message passing platforms. While the speculative batch pivoting provides no performance benefit and even slightly weakens the numerical stability on an IBM Regatta multiprocessor with fast message passing, it improves the performance of our test matrices by 28–292 % on an Ethernetconnected 16node PC cluster. We also evaluated several other communication reduction techniques and showed that they are not as effective as our proposed approach.
Efficient Sparse Gaussian Elimination with Lazy Space Allocation
, 1999
"... A parallel algorithm is implemented for sparse Gaussian elimination on distributed memory machines. At First, we utilize the minimum degree ordering algorithm and transversal algorithm to reorder the columns and rows of the matrix. Next, we implement the LU factorization of the reordered matrix by ..."
Abstract
 Add to MetaCart
(Show Context)
A parallel algorithm is implemented for sparse Gaussian elimination on distributed memory machines. At First, we utilize the minimum degree ordering algorithm and transversal algorithm to reorder the columns and rows of the matrix. Next, we implement the LU factorization of the reordered matrix by combining various techniques, such as static symbolic factorization, 2D supernode partitioning, asynchronous computation scheduling and the new lazy space allocation strategy. This lazy space allocation strategy can effectively control memory usage, especially when static symbolic factorization overestimates fillins excessively. Our experiments show that the new LU code using this strategy has sequential time and space cost competitive with SuperLU, and can deliver up to 10 GFLOPS when running on 128 Cray ...