Results 1 
7 of
7
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systemsâ€”both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Highly Parallel Sparse Cholesky Factorization
 SIAM Journal on Scientific and Statistical Computing
, 1992
"... We develop and compare several finegrained parallel algorithms to compute the Cholesky factorization of a sparse matrix. Our experimental implementations are on the Connection Machine, a distributedmemory SIMD machine whose programming model conceptually supplies one processor per data element. In ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
We develop and compare several finegrained parallel algorithms to compute the Cholesky factorization of a sparse matrix. Our experimental implementations are on the Connection Machine, a distributedmemory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to specialpurpose algorithms in which the matrix structure conforms to the connection structure of the machine, our focus is on matrices with arbitrary sparsity structure.
Using Domain Decomposition to find Graph Bisectors
 BIT
, 1995
"... In this paper we introduce a threestep approach to find a vertex bisector of a graph. The first step finds a domain decomposition of the graph, a set of connected subgraphs, the domains, and a multisector, the remaining vertices that separate the domains from each other. The second step uses a bloc ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
In this paper we introduce a threestep approach to find a vertex bisector of a graph. The first step finds a domain decomposition of the graph, a set of connected subgraphs, the domains, and a multisector, the remaining vertices that separate the domains from each other. The second step uses a block variant of the KernighanLin scheme to find a bisector that is a subset of the multisector. The third step improves the bisector by bipartite graph matching. Experimental results show this domain decomposition method finds graph partitions that compare favorably with a stateoftheart multilevel partitioning scheme in both quality and execution time. 1 Introduction Graph partitioning is a wellknown practical problem that has many important applications, such as task allocation for parallel computations [13] and circuit partitioning for VLSI design [22]. Our driving interest is to find lowfill orderings for sparse matrix computation [4], [6], [15], [19]. An effective approach to find fi...
Irregular Parallel Algorithms in Java
 In Irregular'99: Sixth International Workshop on Solving Irregularly Structured Problems in Parallel
, 1999
"... The nested dataparallel programming model supports the design and implementation of irregular parallel algorithms. This paper describes work in progress to incorporate nested data parallelism into the object model of Java by developing a library of collection classes and adding a forall statement t ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
The nested dataparallel programming model supports the design and implementation of irregular parallel algorithms. This paper describes work in progress to incorporate nested data parallelism into the object model of Java by developing a library of collection classes and adding a forall statement to the language. The collection classes provide parallel implementations of operations on the collections. The forall statement allows operations over the elements of a collection to be expressed in parallel. We distinguish between shape and data components in the collection classes, and use this distinction to simplify algorithm expression and to improve performance. We present initial performance data on two benchmarks with irregular algorithms, EM3d and Convex Hull, and on several microbenchmark programs.
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
WSMP: A HighPerformance Shared and DistributedMemory Parallel Sparse Linear Equation Solver
, 2001
"... The Watson Sparse Matrix Package, WSMP, is a highperformance, robust, and easy to use software package for solving large sparse systems of linear equations. It can be used as a serial package, or in a sharedmemory multiprocessor environment, or as a scalable parallel solver in a messagepassing en ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The Watson Sparse Matrix Package, WSMP, is a highperformance, robust, and easy to use software package for solving large sparse systems of linear equations. It can be used as a serial package, or in a sharedmemory multiprocessor environment, or as a scalable parallel solver in a messagepassing environment, where each node can either be a uniprocessor or a sharedmemory multiprocessor. A unique aspect of WSMP is that it exploits both SMP and MPP parallelism using Pthreads and MPI, respectively, while mostly shielding the user from the details of the architecture. Sparse symmetric factorization in WSMP has been clocked at up to 1.2 Gigaflops on RS6000 workstations with two 200 MHz Power3 CPUs and in excess of 90 Gigaflops on 128node (256processor) SP with twoway SMP 200 MHz Power3 nodes. This paper gives an overview of the algorithms, implementation aspects, performance results, and the user interface of WSMP for solving symmetric sparse systems of linear equations. Key words. Parallel software, Scientific computing, Sparse linear systems, Sparse matrix factorization, Highperformance computing 1.
Task Scheduling Using a Block Dependency DAG for BlockOriented Sparse Cholesky Factorization
 in: Proceedings of 14th ACM Symposium on Applied Computing
, 2000
"... Blockoriented sparse Cholesky factorization decomposes a sparse matrix into rectangular subblocks; each block can then be handled as a computational unit in order to increase data reuse in a hierarchical memory system. Also, the factorization method increases the degree of concurrency with the red ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Blockoriented sparse Cholesky factorization decomposes a sparse matrix into rectangular subblocks; each block can then be handled as a computational unit in order to increase data reuse in a hierarchical memory system. Also, the factorization method increases the degree of concurrency with the reduction of communication volumes so that it performs more efficiently on a distributedmemory multiprocessor system than the customary columnoriented factorization method. But until now, mapping of blocks to processors has been designed for load balance with restricted communication patterns. In this paper, we represent tasks using a block dependency DAG that shows the execution behavior of block sparse Cholesky factorization in a distributedmemory system. Since the characteristics of tasks for the block Cholesky factorization are different from those of the conventional parallel task model, we propose a new task scheduling algorithm using a block dependency DAG. The proposed algorithm consi...