Results 1 - 10
of
42
A supernodal approach to sparse partial pivoting
- SIAM Journal on Matrix Analysis and Applications
, 1999
"... We investigate several ways to improve the performance of sparse LU factorization with partial pivoting, as used to solve unsymmetric linear systems. To perform most of the numerical computation in dense matrix kernels, we introduce the notion of unsymmetric supernodes. To better exploit the memory ..."
Abstract
-
Cited by 158 (20 self)
- Add to MetaCart
We investigate several ways to improve the performance of sparse LU factorization with partial pivoting, as used to solve unsymmetric linear systems. To perform most of the numerical computation in dense matrix kernels, we introduce the notion of unsymmetric supernodes. To better exploit the memory hierarchy, weintroduce unsymmetric supernode-panel updates and two-dimensional data partitioning. To speed up symbolic factorization, we use Gilbert and Peierls's depth- rst search with Eisenstat and Liu's symmetric structural reductions. We have implemented a sparse LU code using all these ideas. We present experiments demonstrating that it is signi cantly faster than earlier partial pivoting codes. We also compare performance with Umfpack, which uses a multifrontal approach; our code is usually faster.
Highly scalable parallel algorithms for sparse matrix factorization
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract
-
Cited by 100 (29 self)
- Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge,
Tuning the Performance of I/O-Intensive Parallel Applications
, 1996
"... Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four appl ..."
Abstract
-
Cited by 70 (24 self)
- Add to MetaCart
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four applications achieve application-level I/O rates of over 100 MB/s on 16 processors. The total volume of I/O required by the programs ranged from about 75 MB to over 200 GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and knowledge of future I/O requests. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance. 1 Introduction I/O has been identified as one of the major obstacles to achieving high performance from para...
Data Locality and Load Balancing in COOL
, 1993
"... Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory refer ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory references. In this paper we explore how knowledge of the underlying memory hierarchy can be used to schedule computation and distribute data structures, and thereby improve data locality. Our study is done in the context of CooL, a concurrent object-oriented language developed at Stanford. We develop abstractions for the programmer to supply optional information about the data reference patterns of the program. This information is used by the runtime system to distribute tasks and objects so that the tasks execute close (in the memory hierarchy) to the objects they reference.
Improved load distribution in parallel sparse Cholesky factorization
- In Proc. of Supercomputing'94
, 1994
"... Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fan-out method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196-node Paragon system. 1
Making Sparse Gaussian Elimination Scalable by Static Pivoting
- In Proceedings of Supercomputing
, 1998
"... We propose several techniques as alternatives to partial pivoting to stabilize sparse Gaussian elimination. From numerical experiments we demonstrate that for a wide range of problems the new method is as stable as partial pivoting. The main advantage of the new method over partial pivoting is th ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
We propose several techniques as alternatives to partial pivoting to stabilize sparse Gaussian elimination. From numerical experiments we demonstrate that for a wide range of problems the new method is as stable as partial pivoting. The main advantage of the new method over partial pivoting is that it permits a priori determination of data structures and communication pattern for Gaussian elimination, which makes it more scalable on distributed memory machines. Based on this a priori knowledge, we design highly parallel algorithms for both sparse Gaussian elimination and triangular solve and we show that they are suitable for large-scale distributed memory machines. Keywords: sparse unsymmetric linear systems, static pivoting, iterative refinement, MPI, 2-D matrix decomposition. 1 Introduction In our earlier work [8, 9, 22], we developed new algorithms to solve unsymmetric sparse linear systems using Gaussian elimination with partial pivoting (GEPP). The new algorithms are hi...
A Parallel Formulation of Interior Point Algorithms
- DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MINNESOTA
, 1994
"... In recent years, interior point algorithms have been used successfully for solving medium to large-size linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a s ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
In recent years, interior point algorithms have been used successfully for solving medium to large-size linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a sparse system of linear equations using Cholesky factorization. The performance of parallel Cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors. In our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for Cholesky factorization developed to date. The computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization. To balance the computation, we implemented and evaluated four di#erent ordering algorithms. Among these algorithms, Kernighan-Lin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited. Our preliminary implementation achieves a speedup as high as 108 on 256-processor nCUBE 2 on moderate-size problems.
A high performance sparse Cholesky factorization algorithm for scalable parallel computers
- Department of Computer Science, University of Minnesota
, 1994
"... Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Abstract This paper presents a new parallel algorithm for sparse matrix factorization. This algorithm uses subforest-to-subcube mapping instead of the subtree-to-subcube mapping of another recently introduced scheme by Gupta and Kumar [13]. Asymptotically, both formulations are equally scalable on a wide range of architectures and a wide variety of problems. But the subtree-to-subcube mapping of the earlier formulation causes significant load imbalance among processors, limiting overall efficiency and speedup. The new mapping largely eliminates the load imbalance among processors. Furthermore, the algorithm has a number of enhancements to improve the overall performance substantially. This new algorithm achieves up to 6GFlops on a 256-processor Cray T3D for moderately large problems. To our knowledge, this is the highest performance ever obtained on an MPP for sparse Cholesky factorization.
A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization
- In EuroPar'99 Parallel Processing, Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fan-in approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fan-in approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance results on a 16-node IBM-SP2 with 66 MHz Power2 thin nodes for a collection of grid and irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer to [6] for a complete survey on high performance sparse factorization. In this paper, we focus on the block par...
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions
- In Proceedings of Irregular'2000, Cancun, Mexique, number 1800 in Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fan-in approach and ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fan-in approach and is fully driven by this scheduling. We give an overview of the algorithm and present performance results and comparisons with PSPASES on an IBM-SP2 with 120 MHz Power2SC nodes for a collection of irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and time-consuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer t...

