Results 1 
9 of
9
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 116 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Parallel BlockDiagonalBordered Sparse Linear Solvers for Electrical Power System Applications
, 1995
"... This thesis presents research into parallel linear solvers for blockdiagonalbordered sparse matrices. The blockdiagonalbordered form identifies parallelism that can be exploited for both direct and iterative linear solvers. We have developed efficient parallel blockdiagonalbordered sparse dire ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
This thesis presents research into parallel linear solvers for blockdiagonalbordered sparse matrices. The blockdiagonalbordered form identifies parallelism that can be exploited for both direct and iterative linear solvers. We have developed efficient parallel blockdiagonalbordered sparse direct methods based on both LU factorization and Choleski factorization algorithms, and we have also developed a parallel blockdiagonalbordered sparse iterative method based on the GaussSeidel method. Parallel factorization algorithms for blockdiagonalbordered form matrices require a specialized ordering step coupled to an explicit load balancing step in order to generate this matrix form and to distribute the computational workload uniformly for an irregular matrix throughout a distributedmemory multiprocessor. Matrix orderings are performed using a diakoptic technique based on nodetearingnodal analysis. Parallel GaussSeidel algorithms for blockdiagonalbordered form matrices require a twopart matrix ordering technique  first to partition the matrix into blockdiagonalbordered form, again, using the nodetearing diakoptic techniques and then to multicolor the data in the last diagonal block using graph coloring techniques. The ordered matrices have extensive parallelism, while maintaining the strict precedence relationships in the GaussSeidel algorithm. Empirical
Analysis and Design of Scalable Parallel Algorithms for Scientific Computing
, 1995
"... This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
This dissertation presents a methodology for understanding the performance and scalability of algorithms on parallel computers and the scalability analysis of a variety of numerical algorithms. We demonstrate the analytical power of this technique and show how it can guide the development of better parallel algorithms. We present some new highly scalable parallel algorithms for sparse matrix computations that were widely considered to be poorly suitable for large scale parallel computers. We present some laws governing the performance and scalability properties that apply to all parallel systems. We show that our results generalize or extend a range of earlier research results concerning the performance of parallel systems. Our scalability analysis of algorithms such as fast Fourier transform (FFT), dense matrix multiplication, sparse matrixvector multiplication, and the preconditioned conjugate gradient (PCG) provides many interesting insights into their behavior on parallel computer...
On Estimating the Useful Work Distribution of Parallel Programs under the P³T: A Static Performance Estimator
 Concurrency, Practice and Experience (Ed. Geoffrey Fox
, 1996
"... In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method is that it cannot identify whether the processors are performing useful or redundant (replicated) work. This paper describes a novel method of statically estimating the useful work distribution of distributed memory parallel programs at the program level, which carefully distinguishes between useful and redundant work. The amount of work contained in a parallel program, which correlates with the number of loop iterations to be executed by each processor, is estimated by accurately modeling loop iteration spaces, array access patterns and data distributions. A cost function defines the useful work distribution of loops, procedures and the entire program. Lower and upper bounds of the describ...
A Scalable Parallel Algorithm for Sparse Cholesky Factorization
 In SuperComputing '94
"... In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse Cholesky factorization, analyze its performance and scalability, and present experimental results of its implementation on a 1024processor nCUBE2 parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm improves the state of the art in parallel direct solution of sparse linear systems by an order of magnitudeboth in terms of speedups and the number of processors that can be utilized effectively for a given problem size. This algorithm incurs strictly less communication overhead and is more scalable than any known parallel formulation of sparse matrix factorization. We show that our algorithm is optimally scalable on hypercube and mesh architectures and that its asymptotic scalability is the same as that of dense matrix factorization for a wide class of sparse linear systems, including those arising in all two and three dimensional finite element problems. 1 Introduction ...
Software Support For Parallel Processing Of Irregular And Dynamic Computations
, 1996
"... Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Many real world scientific computations are irregular and dynamic, which pose great challenge to the effort of parallelization. In this thesis we study the efficient mapping of a subclass of these problems, namely the "stepwise slowly changing" problems, onto distributed memory multiprocessors using the task graph scheduling approach. There exists a large class of applications which belong to this category. Intuitively, the irregularity requires sophisticated mapping algorithms, and the "slowness" in the changes of the computational structures between steps allows the scheduling cost to be amortized, justifying the approach. We study three representative and widelyused applications: The Nbody simulation in astrophysics, the VortexSheet RollUp and the Contour Dynamics Computation from Computational Fluid Dynamics. We sta...
On Using Volume Computation to Estimate the Work Distribution for Parallel Programs
, 1995
"... In this paper we describe a performance parameter which models the work contained in a parallel program and the corresponding work distribution. The work distribution is modeled at the program level which carefully distinguishes between useful and redundant work. We achieve high accuracy due to aggr ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
In this paper we describe a performance parameter which models the work contained in a parallel program and the corresponding work distribution. The work distribution is modeled at the program level which carefully distinguishes between useful and redundant work. We achieve high accuracy due to aggressive exploitation of compiler knowledge such as loop iteration spaces, array access patterns and data distributions. The underlying algorithm is based on the intersection and volume computation of ndimensional linear and convex polytopes. The performance parameter has been implemented under the P 3 T , which is a static parameter based performance prediction tool under the Vienna Fortran Compilation System (VFCS). 1 Introduction In order to parallelize scientific applications for distributed memory systems such as the iPSC/860 hypercube, Meiko CS2, Intel Paragon, CM5 and the Delta Touchstone the programmer commonly decomposes the physical domain of the application  represented by a...
Parallel Direct Methods for BlockDiagonalBordered Sparse Matrices
, 1994
"... This paper presents research into parallel direct methods for blockdiagonalbordered sparse matrices  LU factorization and Choleski factorization algorithms developed with special consideration for irregular sparse matrices from the electrical power systems community. Direct blockdiagonal borde ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper presents research into parallel direct methods for blockdiagonalbordered sparse matrices  LU factorization and Choleski factorization algorithms developed with special consideration for irregular sparse matrices from the electrical power systems community. Direct blockdiagonal bordered sparse linear solvers exhibit distinct advantages when compared to general direct parallel sparse algorithms for irregular matrices. Task assignments for numerical factorization on distributedmemory multiprocessors depend only on the assignment of data to blocks, and data communications are significantly reduced with uniform and structured communications. Factorization algorithms for blockdiagonalbordered form matrices require a specialized ordering step coupled to an explicit load balancing step in order to generate this matrix form and to uniformly distribute the computational workload for an irregular matrix throughout a distributedmemory multiprocessor. This ordering relates to m...
P³T+: A Performance Estimator for Distributed and Parallel Systems
, 2000
"... Device Interface. The layered architecture has allowed research organizations and commercial vendors to port MPICH to a great variety of multiprocessor and multicomputer platforms and distributed environments. CHAPTER 1. INTRODUCTION 14 1.4 SCALA P 3 T+ calculates at compile time a set of perfor ..."
Abstract
 Add to MetaCart
Device Interface. The layered architecture has allowed research organizations and commercial vendors to port MPICH to a great variety of multiprocessor and multicomputer platforms and distributed environments. CHAPTER 1. INTRODUCTION 14 1.4 SCALA P 3 T+ calculates at compile time a set of performance parameters which reflect the quality of the chosen parallelization strategy based on the following data obtained by a single profile run of SCALA [31], a postexecution performance analysis tool developed at the Institute of Software Technology and Parallel Systems in Vienna: ffl Statement execution counts (how many times has a statement been executed during runtime of the program?) ffl Loop iteration counts (what is the average number of iterations of a specific loop throughout the execution of the program?) ffl Branching probabilities for conditional statements (how many times was a specific condition evaluated to TRUE throughout the execution of the program?) In following we des...