Results 1  10
of
10
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systemsâ€”both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Improved load distribution in parallel sparse Cholesky factorization
 In Proc. of Supercomputing'94
, 1994
"... Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
Compared to the customary columnoriented approaches, blockoriented, distributedmemory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fanout method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196node Paragon system. 1
A Parallel Formulation of Interior Point Algorithms
 DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MINNESOTA
, 1994
"... In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a s ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
In recent years, interior point algorithms have been used successfully for solving medium to largesize linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a sparse system of linear equations using Cholesky factorization. The performance of parallel Cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors. In our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for Cholesky factorization developed to date. The computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization. To balance the computation, we implemented and evaluated four di#erent ordering algorithms. Among these algorithms, KernighanLin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited. Our preliminary implementation achieves a speedup as high as 108 on 256processor nCUBE 2 on moderatesize problems.
A high performance sparse Cholesky factorization algorithm for scalable parallel computers
 Department of Computer Science, University of Minnesota
, 1994
"... Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforesttosubcube mapping instead of the subtreetosubcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtreetosubcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 6GFlops on a 256processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
Twodimensional Block Partitionings for the Parallel Sparse Cholesky Factorization: the Fanin Method
, 1997
"... This paper presents a discussion on 2D block mappings for the sparse Cholesky factorization on parallel MIMD architectures with distributed memory. It introduces the fanin algorithm in a general manner and proposes several mapping strategies. The grid mapping with row balancing, inspired from Rothb ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper presents a discussion on 2D block mappings for the sparse Cholesky factorization on parallel MIMD architectures with distributed memory. It introduces the fanin algorithm in a general manner and proposes several mapping strategies. The grid mapping with row balancing, inspired from Rothberg's work [21, 22] proved to be more robust than the original fanout algorithm. Even more efficient is the proportional mapping, as show the experiments on a 32 processors IBM SP1 and on a Cray T3D. Subforesttosubcube mappings are also considered and give good results on the T3D.
A Scalable Parallel Algorithm for Sparse Cholesky Factorization
 In SuperComputing '94
"... In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm improves the state of the art in parallel direct solution of sparse linear systems by an order of magnitudeboth in terms of speedups and the number of processors that can be utilized effectively for a given problem size. This algorithm incurs strictly less communication overhead and is more scalable than any known parallel formulation of sparse matrix factorization. We show that our algorithm is optimally scalable on hypercube and mesh architectures and that its asymptotic scalability is the same as that of dense matrix factorization for a wide class of sparse linear systems, including those arising in all two and three dimensional finite element problems. 1 Introduction ...
WSMP: A HighPerformance Shared and DistributedMemory Parallel Sparse Linear Equation Solver
, 2001
"... The Watson Sparse Matrix Package, WSMP, is a highperformance, robust, and easy to use software package for solving large sparse systems of linear equations. It can be used as a serial package, or in a sharedmemory multiprocessor environment, or as a scalable parallel solver in a messagepassing en ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The Watson Sparse Matrix Package, WSMP, is a highperformance, robust, and easy to use software package for solving large sparse systems of linear equations. It can be used as a serial package, or in a sharedmemory multiprocessor environment, or as a scalable parallel solver in a messagepassing environment, where each node can either be a uniprocessor or a sharedmemory multiprocessor. A unique aspect of WSMP is that it exploits both SMP and MPP parallelism using Pthreads and MPI, respectively, while mostly shielding the user from the details of the architecture. Sparse symmetric factorization in WSMP has been clocked at up to 1.2 Gigaflops on RS6000 workstations with two 200 MHz Power3 CPUs and in excess of 90 Gigaflops on 128node (256processor) SP with twoway SMP 200 MHz Power3 nodes. This paper gives an overview of the algorithms, implementation aspects, performance results, and the user interface of WSMP for solving symmetric sparse systems of linear equations. Key words. Parallel software, Scientific computing, Sparse linear systems, Sparse matrix factorization, Highperformance computing 1.
Task Scheduling Using a Block Dependency DAG for BlockOriented Sparse Cholesky Factorization
 in: Proceedings of 14th ACM Symposium on Applied Computing
, 2000
"... Blockoriented sparse Cholesky factorization decomposes a sparse matrix into rectangular subblocks; each block can then be handled as a computational unit in order to increase data reuse in a hierarchical memory system. Also, the factorization method increases the degree of concurrency with the red ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Blockoriented sparse Cholesky factorization decomposes a sparse matrix into rectangular subblocks; each block can then be handled as a computational unit in order to increase data reuse in a hierarchical memory system. Also, the factorization method increases the degree of concurrency with the reduction of communication volumes so that it performs more efficiently on a distributedmemory multiprocessor system than the customary columnoriented factorization method. But until now, mapping of blocks to processors has been designed for load balance with restricted communication patterns. In this paper, we represent tasks using a block dependency DAG that shows the execution behavior of block sparse Cholesky factorization in a distributedmemory system. Since the characteristics of tasks for the block Cholesky factorization are different from those of the conventional parallel task model, we propose a new task scheduling algorithm using a block dependency DAG. The proposed algorithm consi...
SCALABILITY OF SPARSE DIRECT SOLVERS "
"... Abstract. We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors. In this paper we show that the columnori ..."
Abstract
 Add to MetaCart
Abstract. We shall say that a scalable algorithm achieves efficiency that is bounded away from zero as the number of processors and the problem size increase in such a way that the size of the data structures increases linearly with the number of processors. In this paper we show that the columnoriented approach to sparse Cholesky for distributedmemory machines is not scalable. By considering message volume, node contention, and bisection width, one may obtain lower bounds on the time required for communication in a distributed algorithm. Applying this technique to distributed, columnoriented, full Cholesky leads to the conclusion that N (the order of the matrix) must scale with P (the number of processors) so that storage grows like p2. So the algorithm is not scalable. Identical conclusions have previously been obtained by consideration of communication and computation latency on the critical path in the algorithm; these results complement and reinforce that conclusion. For the sparse case, we have experimental measurements that make the same point: for columnoriented distributed methods, the number of gridpoints (which is O(N)) must grow as P _ in order to maintain parallel efficiency bounded above zero. Our sparse matrix results employ the &quot;fanin &quot; distributed scheme, implemented on machines with either a grid or a fattree interconnect using a subtreetosubmachine mapping of the columns.