## Improved load distribution in parallel sparse Cholesky factorization (1994)

### Cached

### Download Links

- [ntrs.nasa.gov]
- [ntrs.nasa.gov]
- [ftp.cerfacs.fr]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. of Supercomputing'94 |

Citations: | 40 - 1 self |

### BibTeX

@INPROCEEDINGS{Schreiber94improvedload,

author = {Robert Schreiber and Edward Rothberg and Edward Rothberg},

title = {Improved load distribution in parallel sparse Cholesky factorization},

booktitle = {In Proc. of Supercomputing'94},

year = {1994},

pages = {783--792}

}

### Years of Citing Articles

### OpenURL

### Abstract

Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fan-out method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196-node Paragon system. 1

### Citations

11244 | Computers and Intractability, a Guide to the Theory of NP-Completeness - Garey, Johnson - 1979 |

517 |
Computer Solution of Large Sparse Positive Definite Systems
- George, Liu
- 1981
(Show Context)
Citation Context ...hird step is the numerical factorization, in which the actual non-zero values in L are computed. This step is by far the most time-consuming, and it is the focus of this paper. We refer the reader to =-=[11]-=- for more information on these steps. The following pseudo-code performs the numerical factorization step: 1. L := A 2. for k = 1 to n do 3. L kk := p L kk 4. for i = k + 1 to n do 5. L ik := L ik =L ... |

287 |
Sparse matrix test problems
- Duff, Grimes, et al.
- 1989
(Show Context)
Citation Context ...ices (DENSE1024 and DENSE2048), two 2-D grid problems (GRID150 and Gt_ID300), two 3-D grid problems (CUBE30 and CUBE35), and 4 irregular sparse matrices from the Harwell-Boeing sparse matrix test set =-=[4]-=-. The 2-D and 3-D grid matrices are pre-ordered using nested dissection, which gives asymptotically optimal orderings for these problems. The ttarwell-Boeing matrices are pre-ordered using multiple mi... |

168 | The role of elimination trees in sparse factorization - Liu - 1990 |

162 | ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers - Choi, Dongarra, et al. - 1992 |

144 |
Modification of the minimum degree algorithm by multiple elimination
- Liu
- 1985
(Show Context)
Citation Context ... grid matrices are pre-ordered using nested dissection [11], which gives asymptotically optimal orderings for these problems. The Harwell-Boeing matrices are pre-ordered using multiple minimum degree =-=[15]-=-, which is considered the best for most irregular sparse matrices with respect to sequential operation count and fill. Note that the floating-point operation counts listed in the table are from the be... |

55 | An ecient block-oriented approach to parallel sparse cholesky factorization - Rothberg, Gupta - 1993 |

51 |
A new implementation of sparse Gaussian elimination
- Schreiber
- 1982
(Show Context)
Citation Context ...he following parent relationship: parent(j) = minfijl ij 6= 0; i ? jg: A column of L is only modified by descendent columns in the elimination tree; equivalently, a column only modifies its ancestors =-=[16, 20]-=-. 2.3 Parallel Block-Oriented Sparse Cholesky Factorization We now discuss parallel block-oriented sparse factorization. As mentioned earlier, the approach we use is the block fan-out method. We give ... |

49 |
Task Scheduling for Parallel Sparse Cholesky Factorization
- Geist, Ng
- 1990
(Show Context)
Citation Context ...non-zeroes in the domain portion are still stored as a set of blocks). The root portion of the matrix is mapped to processors using a 2-D mapping. Details on the use of domains are provided elsewhere =-=[1, 9, 14, 18]-=-. The main advantage of using domains is that they significantly reduce interprocessor communication volumes. 2.4 Block Mappings A crucial issue in any block factorization is the mapping of blocks to ... |

45 |
Progress in sparse matrix methods for large linear systems on vector supercomputers
- ASHCRAFT, GRIMES, et al.
- 1987
(Show Context)
Citation Context ...s only blocks in block row I or block column I. 2.2 Supernodes Before discussing parallel sparse factorization, we must first discuss an important concept in sparse factorization, that of a supernode =-=[2]-=-. A supernode is a set of adjacent columns in the factor L whose non-zero structure consists of a dense lower-triangular block on the diagonal, and an identical set of non-zeroes for each column below... |

43 | Highly parallel sparse cholesky factorization - Gilbert, Schreiber - 1992 |

42 |
The in of relaxed supernode partitions on the multifrontal method
- Ashcraft, Grimes
- 1989
(Show Context)
Citation Context .... This regular structure allows the block factorization primitives (BFAC, BDIV , and BMOD) to be implemented efficiently. This regularity can be further increased by performing supernode amalgamation =-=[2]-=- on the factor matrix. Amalgamation is a heuristic that merges supernodes with very similar non-zero structures into larger supernodes. We use amalgamation for all results presented in this paper. The... |

40 | Exploiting the memory hierarchy in sequential and parallel sparse Cholesky factorization - Rothberg - 1993 |

34 | Communication results for parallel sparse Cholesky factorization on a hypercube - George, Liu, et al. - 1989 |

31 | Scalability of sparse direct solvers - Schreiber - 1992 |

22 | Solution of sparse positive definite systems on a hypercube - George, Heath, et al. - 1988 |

14 |
A comparison of three columnbased distributed sparse factorization schemes
- Ashcraft, Eisenstat, et al.
- 1990
(Show Context)
Citation Context ...non-zeroes in the domain portion are still stored as a set of blocks). The root portion of the matrix is mapped to processors using a 2-D mapping. Details on the use of domains are provided elsewhere =-=[1, 9, 14, 18]-=-. The main advantage of using domains is that they significantly reduce interprocessor communication volumes. 2.4 Block Mappings A crucial issue in any block factorization is the mapping of blocks to ... |

14 |
Limiting Communication in Parallel Sparse Cholesky Factorization
- Hulbert, Zmijewski
- 1991
(Show Context)
Citation Context ...non-zeroes in the domain portion are still stored as a set of blocks). The root portion of the matrix is mapped to processors using a 2-D mapping. Details on the use of domains are provided elsewhere =-=[1, 9, 14, 18]-=-. The main advantage of using domains is that they significantly reduce interprocessor communication volumes. 2.4 Block Mappings A crucial issue in any block factorization is the mapping of blocks to ... |

13 | An evaluation of left-looking, right-looking and multifrontal approaches to sparse Cholesky factorization on hierarchical-memory machines - Rothberg, Gupta - 1991 |

6 | Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC/860 systems
- Geijn
- 1991
(Show Context)
Citation Context ...tigation is whether the sparse factorization approach proposed 14 here may actually provide higher performance for dense problems than is currently obtained by specialized dense factorization methods =-=[22]-=- that use cyclic mappings. Dense matrix methods avoid load imbalance by dividing the matrix into extremely small blocks. The blocks mapped to a processor are then grouped into long, narrow block-colum... |

5 |
parallel sparse LU factorization
- Conroy, Kratzer, et al.
- 1994
(Show Context)
Citation Context ...e interconnection network whose topology is not a grid. Several researchers have obtained excellent performance using a block-oriented approach, both on fine-grained, massively-parallel SIMD machines =-=[3]-=- and on coarsegrained, highly-parallel MIMD machines [12]. To define a 2-D block mapping, one must specify the mappings of matrix (block) rows to processor rows and of columns to processor columns. Wh... |

1 | The influence of relaxed supernode partitions on the mnltifrontal method - Ashcraft, Grimes - 1989 |

1 |
private communication
- Conroy, Kratzer, et al.
(Show Context)
Citation Context ... interconnection networks whose topology is not a grid. Several researchers have obtained excellent performance using a block-oriented approach, both on fine-grained, massively-parallel SIMD machines =-=[5]-=- and on coarse-grained, highly-parallel MIMD machines [18]. To define a 2-D block mapping, one must specify the mappings of matrix (block) rows to processor rows and of columns to processor columns. W... |