## The Data-Distribution-Independent Approach to Scalable Parallel Libraries (1995)

Citations: | 13 - 1 self |

### BibTeX

@TECHREPORT{Bangalore95thedata-distribution-independent,

author = {Purushotham V. Bangalore},

title = {The Data-Distribution-Independent Approach to Scalable Parallel Libraries},

institution = {},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

1077 |
Using MPI: Portable Parallel Programming with the Message-Passing Interface; 2nd ed
- Gropp, Lusk, et al.
- 1999
(Show Context)
Citation Context ... the distributed data structure for the level-3 LU factorization is described. All the notations used are based on the notations and data structures used in (Skjellum 1990; Skjellum and Baldwin 1991; =-=Gropp et al. 1994-=-), which in turn follow van de Velde (van de Velde 1990). Virtual Grids The virtual grid topology is a canonical format for vectors and matrices in multicomputer algorithms (Fox et al. 1988). It descr... |

760 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...ank-1 and rank-2 updates, and solution of triangular equations. These operations involve O(rm) floating-point operations and data movement, where m andsare the dimensions of the matrix. Level-3 BLAS (=-=Dongarra et al. 1990-=-) involves O( 3) floating-point operations but requires only O( 2) data movement, and is used to perform matrix-matrix operations like matrixmatrix multiply, rank-k updates of a symmetric matrix, and ... |

559 |
Basic linear algebra subprograms for Fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ... Hence they also provide portability, robustness, and readability. There are different levels of BLAS based on the amount of data used for an operation and its computational complexity. Level-1 BLAS (=-=Lawson et al. 1979-=-) perform vector-vector operations and these operations involve O() floating-point operations and O() data access. Level-2 BLAS (Dongarra et al. 1988) perform matrix-vector operations like matrix-vect... |

459 | An extended set of FORTRAN basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1988
(Show Context)
Citation Context ...h a memory hierarchy. This ratio of number of floating point operations per memory reference is often called as surface-to-volume ratio (Fox et al. 1988). The Basic Linear Algebra Subprograms (BLAS) (=-=Dongarra et al. 1988-=-) provide a standardized and often optimized set of routines for vector and matrix operations for vector and low-concurrency shared memory machines. The BLAS are used to obtain better performance on m... |

339 | Performance of various computers using standard linear equation software
- Dongarra
- 1993
(Show Context)
Citation Context ...ctorization algorithms on vector and parallel architectures has been extensively studied (Robert 1990) and its performance is measured by a number of vendors as part of their benchmarking activities (=-=Dongarra 1994-=-). LU factorization with partial row pivoting is the decomposition of a general matrix A into L and U where PA = LU and L is a lower triangular matrix, U is an upper triangular matrix, and P is the pe... |

162 | ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers - Choi, Dongarra, et al. - 1992 |

117 | LAPACK: A portable linear algebra library for high-performance computers - Anderson, Bai, et al. - 1990 |

72 | Matrix Computations, 2nd Edition - Goub, Loan - 1989 |

66 | High Performance Fortran Language Specification
- Forum
- 1992
(Show Context)
Citation Context ...ta distribution, we need to redistribute the data to the appropriate format if the data generated is not in the specified format. Several high-level languages like High Performance Fortran (HPF) (HPF =-=Forum 1993-=-) support primitives for data redistribution. DaReL, a portable data redistribution library designed specifically for HPF, performs multidimensional data redistribution for different HPF data distribu... |

44 |
Solving problems on concurrent processors
- Johnson, Salmon, et al.
- 1988
(Show Context)
Citation Context ...he choice of the appropriate algorithm, the data structures, the data distribution, and the communication mechanism, in addition to features like portability, scalability, robustness, and efficiency (=-=Fox et al. 1988-=-). Parallel algorithms, encapsulated in scalable libraries, are inherently constructed bottom-up, that is, from the machine up to the user interface, with the hope that the final product will be of in... |

25 |
LAPACK block factorization algorithms on the Intel iPSC/860. Computer Science Dept
- Dongarra, Ostrouchov
- 1990
(Show Context)
Citation Context ...est suited for distributed memory MIMD architectures. The rightlooking algorithm is scalable with the problem size and the number of processors, provided that a two-dimensional decomposition is used (=-=Dongarra and Ostrouchov 1990-=-; Dongarra and Walker 1993; yon Laszewski et al. 1992; Skjellum and Leung 1990; van de Geijn 1991; Skjellum and Baldwin 1991; Fox et al. 1988). The block size is the key factor affecting performance w... |

22 |
The impact of vector and parallel architectures on the Gaussian elimination algorithm
- Robert
- 1991
(Show Context)
Citation Context ...ly used method for solving dense linear system of equations (Golub and Loan 1991). The implementation of LU factorization algorithms on vector and parallel architectures has been extensively studied (=-=Robert 1990-=-) and its performance is measured by a number of vendors as part of their benchmarking activities (Dongarra 1994). LU factorization with partial row pivoting is the decomposition of a general matrix A... |

21 | The Design and Evolution of Zipcode - Skjellum, Smith, et al. - 1994 |

19 | The Multicomputer Toolbox: Scalable Parallel Libraries for Large-Scale Concurrent Applications
- Skjellum, Baldwin
- 1991
(Show Context)
Citation Context ...al for cost savings (of tinhe and memory) by avoiding explicit redistribution (at both the entry- and exitinterfaces of a library) because of these available optimizations. The Multicomputer Toolbox (=-=Skjellum and Baldwin 1991-=-; Skjellum et al. 1994) is a collection of scalable parallel libraries that provide data-distribution-independent programming for a large number of numerical algorithms, while providing needed portabi... |

18 | Solving Linear Systems on Vector and Shared Memory Computers - Dongarra, Duff, et al. - 1991 |

15 | Two dimensional basic linear algebra communication subprograms - DONGARRA, GEIJN, et al. - 1993 |

14 | Document for a standard message-passing interface
- Forum
- 1993
(Show Context)
Citation Context ...The Toolbox supports distributed data structures 3 encapsulating data layout of matrices and vectors. The distributed data structures combined with the message-passing primitives provided by MPI (MPI =-=Forum 1994-=-) provide a strong basis for building reusable scalable parallel libraries that will potentially satisfy both the application goals and the simultaneous desires for high performance and portability. T... |

13 |
Concurrent Dynamic Simulation: Multicomputer Algorithms Research Applied to Ordinary Di erential-Algebraic
- Skjellum
- 1990
(Show Context)
Citation Context ...el-3 BLAS LAPACK version of the algorithm. This algorithm uses a fixed block size that is determined empirically for each machine (van de Geijn 1991). Multicomputer Toolbox The Multicomputer Toolbox (=-=Skjellum 1990-=-; Skjellum and Baldwin 1991; Skjellum et al. 1994) is a set of scalable parallel libraries that support datadistribution-independent programming for a large number of important algorithms. The Toolbox... |

11 | The Multicomputer Toolboxâ€”First-Generation Scalable Libraries, HICSS-27
- Skjellum, Leung, et al.
- 1994
(Show Context)
Citation Context ...he and memory) by avoiding explicit redistribution (at both the entry- and exitinterfaces of a library) because of these available optimizations. The Multicomputer Toolbox (Skjellum and Baldwin 1991; =-=Skjellum et al. 1994-=-) is a collection of scalable parallel libraries that provide data-distribution-independent programming for a large number of numerical algorithms, while providing needed portability and performance. ... |

8 |
An adaptive blocking strategy for matrix factorization
- BISCHOF, LACROUTE
- 1990
(Show Context)
Citation Context ...ed updates (Dongarra et al. 1991). The most common approach is to determine an optimal fixed-size block size through experimentation for a particular machine and problem size (van de Geijn 1991). In (=-=Bischof and Lacroute 1990-=-) an adaptive blocking strategy for matrix factorizations is discussed. The results obtained indicate that the performance of the adaptive blocking strategy is as good as any fixed-size blocking strat... |

8 |
An initial implementation
- Doss, Gropp, et al.
- 1993
(Show Context)
Citation Context ... in dedicated mode using the US protocol. Thin nodes have 64MB memory with 180MB paging. The programs were executed with the IBM's Parallel Operating Environment (POE) using the MPICH implementation (=-=Doss et al. 1993-=-) of MPI. Summary In this thesis, we present the blocking, level-3 BLAS, right-looking LU factorization algorithm developed within the Toolbox framework. While the earlier work was done using the Zipc... |

7 | LU factorization of sparse, unsymmetric, Jacobian matrices on multicomputers - Skjellum, Leung - 1990 |

6 | Application of massively parallel computation to integral equation models of electromagnetic scattering - Cwik, Geijn, et al. - 1994 |

6 |
DaReL: A portable data redistribution library for distributedmemory machines
- Kalns, Ni
- 1994
(Show Context)
Citation Context ...tives for data redistribution. DaReL, a portable data redistribution library designed specifically for HPF, performs multidimensional data redistribution for different HPF data distribution patterns (=-=Kalns and Ni 1994-=-). The DaReL library is specific to HPF and also was not available for public use at the time this work was developed. Hence we have developed a simple redistribution 35 algorithm to measure the time ... |

6 | Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC/860 systems - Geijn - 1991 |

4 | Adaptive data distribution for concurrent continuation - Velde, Lorenz - 1989 |

2 |
Dense and iterative linear algebra in the multicomputer toolbox
- Bangalore, Skjellum, et al.
- 1993
(Show Context)
Citation Context ...yout of matrices and vectors to support data-distributionindependent programming. The initial work on dense linear algebra libraries was presented in (Skjellum and Baldwin 1991; Skjellum et al. 1994; =-=Bangalore et al. 1993-=-). The performance of a non-blocking, level-2 BLAS LU factorization algorithm was presented for various data distributions. The results obtained show that DDI approach can produce almost the same leve... |

2 | Data redistribution and concurrency - Velde - 1990 |

1 |
Poly-algorithmic approach to the design of scalable parallel libraries. In preparation for submission to Parallel Computing
- Bangalore, Skjellum
- 1995
(Show Context)
Citation Context ...lication, problem size, grid shape, and amount of memory available one of the above algorithms can be used. We have also considered a hybrid of the LINPACK and Toolbox approaches, which we detail in (=-=Bangalore and Skjellum 1995-=-). In this thesis, we have concentrated on the development of the factorization algorithm. We currently implement a simple backsolve strategy; more sophisticated algorithms are possible that take adva... |

1 |
LAPACK working note 58: The design of linear algebra libraries for high performance computers
- Dongarra, Walker
- 1993
(Show Context)
Citation Context ...ory MIMD architectures. The rightlooking algorithm is scalable with the problem size and the number of processors, provided that a two-dimensional decomposition is used (Dongarra and Ostrouchov 1990; =-=Dongarra and Walker 1993-=-; yon Laszewski et al. 1992; Skjellum and Leung 1990; van de Geijn 1991; Skjellum and Baldwin 1991; Fox et al. 1988). The block size is the key factor affecting performance when using the blocked algo... |

1 |
eraall: henrySSD.intel.com
- Henry
- 1994
(Show Context)
Citation Context ...140 Supercomputer, a peak performance of 143.4 double-precision GFLOPS on 1840 processors (configured as a 16 x 115 grid) for problem sizes of 55700 x 55700 with a block size of 16 has been recorded (=-=Henry 1994-=-). The LINPACK benchmark is based on the right-looking, level-3 BLAS LAPACK version of the algorithm. This algorithm uses a fixed block size that is determined empirically for each machine (van de Gei... |

1 |
IBM httP://ibm'tc'crnell'edu/index'html' POWERparallel systems
- Corporation
- 1995
(Show Context)
Citation Context ... This was done to illustrate the datadistribution-independent approach for the development of scalable parallel libraries. Performance results obtained from executing the pr%rarn on the IBM SP-2 (IBM =-=Corporation 1995-=-) for different logical grid shapes and sizes, matrix sizes, and panel sizes are presented. Furthermore, we compare the results obtained from the DDI algorithm we have developed with a fixed-data-dist... |