Results 1 - 10
of
26
Highly scalable parallel algorithms for sparse matrix factorization
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract
-
Cited by 100 (29 self)
- Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge,
Improved load distribution in parallel sparse Cholesky factorization
- In Proc. of Supercomputing'94
, 1994
"... Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fan-out method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196-node Paragon system. 1
Sparse Gaussian Elimination on High Performance Computers
, 1996
"... This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performan ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performance machines. In the first part we discuss optimizations of a sequential algorithm to exploit the memory hierarchies that exist in most RISC-based superscalar computers. We begin with the left-looking supernode-column algorithm by Eisenstat, Gilbert and Liu, which includes Eisenstat and Liu's symmetric structural reduction for fast symbolic factorization. Our key contribution is to develop both numeric and symbolic schemes to perform supernodepanel updates to achieve better data reuse in cache and floating-point register...
Efficient Parallel Solutions Of Large Sparse SPD Systems On Distributed-Memory Multiprocessors
- Advanced Computing Research Institute, Center for Theory and Simulation in Science and Engineering, Cornell
"... . We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributed-memory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantl ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
. We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributed-memory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantly improves the performance of a distributed multifrontal algorithm previously designed. Second, new parallel algorithms for computing sparse forward elimination and sparse backward substitution are described. The new algorithms solve the sparse triangular systems in a multifrontal fashion. Numerical experiments run on an Intel iPSC/860 and an Intel iPSC/2 for a set of problems with regular and irregular sparsity structure are reported. More than 180 million flops per second during the numerical factorization are achieved for a three-dimensional grid problem on an iPSC/860 machine with 32 processors. Key words. Cholesky factorization, clique tree, distributed-memory multiprocessors, multifro...
A high performance sparse Cholesky factorization algorithm for scalable parallel computers
- Department of Computer Science, University of Minnesota
, 1994
"... Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtree-to-subcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 6GFlops on a 256-processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
Task Scheduling in an Asynchronous Distributed Memory Multifrontal Solver
, 2002
"... We describe the improvements to the task scheduling for MUMPS, an asynchronous distributed memory direct solver for sparse linear systems. In the new approach, we determine, during the analysis of the matrix, candidate processes for the tasks that will be dynamically scheduled during the subsequent ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
We describe the improvements to the task scheduling for MUMPS, an asynchronous distributed memory direct solver for sparse linear systems. In the new approach, we determine, during the analysis of the matrix, candidate processes for the tasks that will be dynamically scheduled during the subsequent factorization. This approach signi cantly improves the scalability of the solver in terms of execution time and storage. By comparison with the previous version of MUMPS, we demonstrate the eciency and the scalability of the new algorithm on up to 512 processors. Our test cases include matrices from regular 3D grids and irregular ones from real-life applications.
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrix-vector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
Multifrontal Computation with the Orthogonal Factors of Sparse Matrices
- SIAM Journal on Matrix Analysis and Applications
, 1994
"... . This paper studies the solution of the linear least squares problem for a large and sparse m by n matrix A with m n by QR factorization of A and transformation of the righthand side vector b to Q T b. A multifrontal-based method for computing Q T b using Householder factorization is presented ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
. This paper studies the solution of the linear least squares problem for a large and sparse m by n matrix A with m n by QR factorization of A and transformation of the righthand side vector b to Q T b. A multifrontal-based method for computing Q T b using Householder factorization is presented. A theoretical operation count for the K by K unbordered grid model problem and problems defined on graphs with p n-separators shows that the proposed method requires O(NR ) storage and multiplications to compute Q T b, where NR = O(n log n) is the number of nonzeros of the upper triangular factor R of A. In order to introduce BLAS-2 operations, Schreiber and Van Loan's Storage-Efficient-WY Representation [SIAM J. Sci. Stat. Computing, 10(1989),pp. 55-57] is applied for the orthogonal factor Q i of each frontal matrix F i . If this technique is used, the bound on storage increases to O(n(logn) 2 ). Some numerical results for the grid model problems as well as Harwell-Boeing problems...
Parallel Direct Methods For Sparse Linear Systems
, 1997
"... We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of the numerically intensive factorization process, but we also briefly consider the other major components of direct methods, such as parallel ordering. Introduction In this paper we present a brief overview of parallel direct methods for solving sparse linear systems. Paradoxically, sparse matrix factorization offers additional opportunities for exploiting parallelism beyond those available with dense matrices, yet it is often more difficult to attain good efficiency in the sparse case. We examine both sides of this paradox: the additional parallelism induced by sparsity, and the difficulty in achieving high efficiency in spite of it. We focus on Cholesky factorization, primarily because th...
Parallel Multifrontal Solution Of Sparse Linear Least Squares Problems On Distributed-Memory Multiprocessors
- Advanced Computing Research Institute, Center for Theory and Simulation in Science and Engineering, Cornell
, 1994
"... . We describe the issues involved in the design and implementation of efficient parallel algorithms for solving sparse linear least squares problems on distributed-memory multiprocessors. We consider both the QR factorization method due to Golub and the method of corrected semi-normal equations due ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
. We describe the issues involved in the design and implementation of efficient parallel algorithms for solving sparse linear least squares problems on distributed-memory multiprocessors. We consider both the QR factorization method due to Golub and the method of corrected semi-normal equations due to Bj¨orck. The major tasks involved are sparse QR factorization, sparse triangular solution and sparse matrix-vector multiplication. The sparse QR factorization is accomplished by a parallel multifrontal scheme recently introduced. New parallel algorithms for solving the related sparse triangular systems and for performing sparse matrix-vector multiplications are proposed. The arithmetic and communication complexities of our algorithms on regular grid problems are presented. Experimental results on an Intel iPSC/860 machine are described. Key words. parallel algorithms, sparse matrix, orthogonal factorization, multifrontal method, least squares problems, triangular solution, distributed-me...

