Results 1  10
of
24
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Improved load distribution in parallel sparse Cholesky factorization
 In Proc. of Supercomputing'94
, 1994
"... Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fanout method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196node Paragon system. 1
Sparse Gaussian Elimination on High Performance Computers
, 1996
"... This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performan ..."
Abstract

Cited by 36 (6 self)
 Add to MetaCart
This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performance machines. In the first part we discuss optimizations of a sequential algorithm to exploit the memory hierarchies that exist in most RISCbased superscalar computers. We begin with the leftlooking supernodecolumn algorithm by Eisenstat, Gilbert and Liu, which includes Eisenstat and Liu's symmetric structural reduction for fast symbolic factorization. Our key contribution is to develop both numeric and symbolic schemes to perform supernodepanel updates to achieve better data reuse in cache and floatingpoint register...
SPOOLES: An ObjectOriented Sparse Matrix Library
 In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested dissection and multisection orderings are comparable to other state of the art packages. Factorizations of square matrices have the form A = PLDUQ and A = PLDL T P T , where P and Q are permutation matrices. Square systems of the form A + #B may also be factored and solved (as found in shiftandinvert eigensolvers), as well as full rank overdetermined linear systems, where a QR factorization is computed and the solution found by solving the seminormal equations. # This research was supported in part by the
Hybridizing Nested Dissection and Halo Approximate Minimum Degree for Efficient Sparse Matrix Ordering
 IN PROCEEDINGS OF IRREGULAR'99, LNCS 1586
, 1999
"... Minimum degree and nested dissection are the two most popular reordering schemes used to reduce llin and operation count when factoring and solving sparse matrices. Most of the stateoftheart ordering packages hybridize these methods by performing incomplete nested dissection and ordering by ..."
Abstract

Cited by 34 (15 self)
 Add to MetaCart
Minimum degree and nested dissection are the two most popular reordering schemes used to reduce llin and operation count when factoring and solving sparse matrices. Most of the stateoftheart ordering packages hybridize these methods by performing incomplete nested dissection and ordering by minimum degree the subgraphs associated with the leaves of the separation tree, but most often only loose couplings have been achieved, resulting in poorer performance than could have been expected. This paper presents a tight coupling of the nested dissection and halo approximate minimum degree algorithms, which allows the minimum degree algorithm to use exact degrees on the boundaries of the subgraphs passed to it, and to yield back not only the ordering of the nodes of the subgraph, but also the amalgamated assembly subtrees, for efficient block computations. Experimental results show the performance improvement of this hybridization, both in terms of fillin reduction and increa...
Performance of a Fully Parallel Sparse Solver
 Int. Journal of Supercomputer Applications
, 1996
"... The performance of a fully parallel direct solver for large sparse symmetric positive definite systems of linear equations is demonstrated. The solver is designed for distributedmemory, messagepassing parallel computer systems. All phases of the computation, including symbolic processing as well a ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
The performance of a fully parallel direct solver for large sparse symmetric positive definite systems of linear equations is demonstrated. The solver is designed for distributedmemory, messagepassing parallel computer systems. All phases of the computation, including symbolic processing as well as numeric factorization and triangular solution, are performed in parallel. A parallel Cartesian nested dissection algorithm is used to compute a fillreducing ordering for the matrix and an appropriate partitioning of the problem across the processors. The separator This research was supported by the Advanced Research Projects Agency through the Army Research Office under contract number DAAL0391C0047. y Department of Computer Science and NCSA, University of Illinois, 1304 West Springfield Ave., Urbana, IL 61801, email: heath@cs.uiuc.edu. z Department of Computer Science, University of Tennessee, 107 Ayres Hall, Knoxville, TN 37996, email: padma@cs.utk.edu. Parallel Sparse Sol...
A Parallel Formulation of Interior Point Algorithms
 DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MINNESOTA
, 1994
"... In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a s ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a sparse system of linear equations using Cholesky factorization. The performance of parallel Cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors. In our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for Cholesky factorization developed to date. The computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization. To balance the computation, we implemented and evaluated four di#erent ordering algorithms. Among these algorithms, KernighanLin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited. Our preliminary implementation achieves a speedup as high as 108 on 256processor nCUBE 2 on moderatesize problems.
A high performance sparse Cholesky factorization algorithm for scalable parallel computers
 Department of Computer Science, University of Minnesota
, 1994
"... Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtreetosubcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 6GFlops on a 256processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
RunTime Optimization of Sparse MatrixVector Multiplication on SIMD Machines
 PARLE 94 Parallel Architectures and Languages Europe
, 1994
"... Sparse matrixvector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrixvector product is computed repeatedly, often thousands of times, with updated values of the vector until convergence is ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Sparse matrixvector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrixvector product is computed repeatedly, often thousands of times, with updated values of the vector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated offprocessor vector elements while computing its share of the product. In this paper, we report on runtime optimization of array distribution and offprocessor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse rowwise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations over a wider range of processors indicate that up to a 60 percent improvement may be possible in some cases.