Results 1  10
of
32
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Improved load distribution in parallel sparse Cholesky factorization
 In Proc. of Supercomputing'94
, 1994
"... Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fanout method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196node Paragon system. 1
Spectral Nested Dissection
, 1992
"... . We describe a spectral nested dissection algorithm for computing orderings appropriate for parallel factorization of sparse, symmetric matrices. The algorithm makes use of spectral properties of the Laplacian matrix associated with the given matrix to compute separators. We evaluate the quality of ..."
Abstract

Cited by 30 (5 self)
 Add to MetaCart
. We describe a spectral nested dissection algorithm for computing orderings appropriate for parallel factorization of sparse, symmetric matrices. The algorithm makes use of spectral properties of the Laplacian matrix associated with the given matrix to compute separators. We evaluate the quality of the spectral orderings with respect to several measures: fill, elimination tree height, height and weight balances of elimination trees, and clique tree heights. Spectral orderings compare quite favorably with commonly used orderings, outperforming them by a wide margin for some of these measures. These results are confirmed by computing a multifrontal numerical factorization with the different orderings on a Cray YMP with eight processors. Keywords. graph partitioning, graph spectra, Laplacian matrix, ordering algorithms, parallel orderings, parallel sparse Cholesky factorization, sparse matrix, vertex separator AMS(MOS) subject classifications. 65F50, 65F05, 65F15, 68R10 1. Introducti...
Sparse Numerical Linear Algebra: Direct Methods and Preconditioning
, 1996
"... Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be us ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be used. Both sparse LU and QR factorizations can be implemented within this framework. Partitioning and ordering techniques have seen major activity in recent years. We discuss bisection and multisection techniques, extensions to orderings to block triangular form, and recent improvements and modifications to standard orderings such as minimum degree. We also study advances in the solution of indefinite systems and sparse leastsquares problems. The desire to exploit parallelism has been responsible for many of the developments in direct methods for sparse matrices over the last ten years. We examine this aspect in some detail, illustrating how current techniques have been developed or ...
A Parallel Formulation of Interior Point Algorithms
 DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MINNESOTA
, 1994
"... In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a s ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a sparse system of linear equations using Cholesky factorization. The performance of parallel Cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors. In our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for Cholesky factorization developed to date. The computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization. To balance the computation, we implemented and evaluated four di#erent ordering algorithms. Among these algorithms, KernighanLin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited. Our preliminary implementation achieves a speedup as high as 108 on 256processor nCUBE 2 on moderatesize problems.
Runtime Compilation for Parallel Sparse Matrix Computations
 In Proceedings of ACM International Conference on Supercomputing
, 1996
"... Runtime compilation techniques have been shown effective for automating the parallelization of loops with unstructured indirect data accessing patterns. However, it is still an open problem to efficiently parallelize sparse matrix factorizations commonly used in iterative numerical problems. The di ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
Runtime compilation techniques have been shown effective for automating the parallelization of loops with unstructured indirect data accessing patterns. However, it is still an open problem to efficiently parallelize sparse matrix factorizations commonly used in iterative numerical problems. The difficulty is that a factorization process contains irregularlyinterleaved communication and computation with varying granularities and it is hard to obtain scalable performance on distributed memory machines. In this paper, we present an inspector/executor approach for parallelizing such applications by embodying automatic graph scheduling techniques to optimize interleaved communication and computation. We describe a runtime system called RAPID that provides a set of library functions for specifying irregular data objects and tasks that access these objects. The system extracts a task dependence graph from data access patterns, and executes tasks efficiently on a distributed memory machine....
The Tao of Parallelism in Algorithms
 In PLDI
, 2011
"... For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph i ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are “irregular ” data structures like graphs, trees, and sets. To address the need for better abstractions, we introduce a datacentric formulation of algorithms called the operator formulation in which an algorithm is expressed in terms of its action on data structures. This formulation is the basis for a structural analysis of algorithms that we call taoanalysis. Taoanalysis can be viewed as an abstraction of algorithms that distills out algorithmic properties
A high performance sparse Cholesky factorization algorithm for scalable parallel computers
 Department of Computer Science, University of Minnesota
, 1994
"... Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtreetosubcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 6GFlops on a 256processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
Parallel Direct Solution of Large Sparse Systems in Finite Element Computations
, 1993
"... An integrated approach for the parallel solution of large sparse systems arisen in finite element computations is presented. The approach includes a threephase preprocessor and a macro dataflow execution scheme. The three phases of the preprocessor are: (1) Extracting parallelism by means of an aut ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
An integrated approach for the parallel solution of large sparse systems arisen in finite element computations is presented. The approach includes a threephase preprocessor and a macro dataflow execution scheme. The three phases of the preprocessor are: (1) Extracting parallelism by means of an automatic domain decomposer
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...