Results 1  10
of
16
The TorusWrap Mapping For Dense Matrix Calculations On Massively Parallel Computers
 SIAM J. SCI. STAT. COMPUT
, 1994
"... Dense linear systems of equations are quite common in science and engineering, arising in boundary element methods, least squares problems and other settings. Massively parallel computers will be necessary to solve the large systems required by scientists and engineers, and scalable parallel algori ..."
Abstract

Cited by 66 (5 self)
 Add to MetaCart
Dense linear systems of equations are quite common in science and engineering, arising in boundary element methods, least squares problems and other settings. Massively parallel computers will be necessary to solve the large systems required by scientists and engineers, and scalable parallel algorithms for the linear algebra applications must be devised for these machines. A critical step in these algorithms is the mapping of matrix elements to processors. In this paper, we study the use of the toruswrap mapping in general dense matrix algorithms, from both theoretical and practical viewpoints. We prove that, under reasonable assumptions, this assignment scheme leads to dense matrix algorithms that achieve (to within a constant factor) the lower bound on interprocessor communication. We also show that the toruswrap mapping allows algorithms to exhibit less idle time, better load balancing and less memory overhead than the more common row and column mappings. Finally, we discuss ...
Design of a Parallel Nonsymmetric Eigenroutine Toolbox, Part I
, 1993
"... The dense nonsymmetric eigenproblem is one of the hardest linear algebra problems to solve effectively on massively parallel machines. Rather than trying to design a "black box" eigenroutine in the spirit of EISPACK or LAPACK, we propose building a toolbox for this problem. The tools are meant to ..."
Abstract

Cited by 63 (14 self)
 Add to MetaCart
The dense nonsymmetric eigenproblem is one of the hardest linear algebra problems to solve effectively on massively parallel machines. Rather than trying to design a "black box" eigenroutine in the spirit of EISPACK or LAPACK, we propose building a toolbox for this problem. The tools are meant to be used in different combinations on different problems and architectures. In this paper, we will describe these tools which include basic block matrix computations, the matrix sign function, 2dimensional bisection, and spectral divide and conquer using the matrix sign function to find selected eigenvalues. We also outline how we deal with illconditioning and potential instability. Numerical examples are included. A future paper will discuss error analysis in detail and extensions to the generalized eigenproblem.
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Tridiagonal, and Bidiagonal Form
, 1995
"... This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. These routines are important in the solution of eigenproblems. The paper focuses on how building blocks are used to create higherlevel library routines. Results are presented that demonstrate the scalability of the reduction routines. The most commonlyused building blocks used in ScaLAPACK are the sequential BLAS, the Parallel BLAS (PBLAS) and the Basic Linear Algebra Communication Subprograms (BLACS). Each of the matrix reduction algorithms consists of a series of steps in each of which one block column (or panel), and/or block row, of the matrix is reduced, followed by an update of the portion of the matrix that has not been factorized so far. This latter phase is performed usin...
Local Basic Linear Algebra Subroutines (LBLAS) for distributed memory architectures and languages with array syntax
 The International Journal of Supercomputer Applications
, 1992
"... We describe a subset of the level1, level2, and level3 BLAS implemented for each node of the Connection Machine system CM200. The routines, collectively called LBLAS, have interfaces consistent with languages with an array syntax such as Fortran 90. One novel feature, important for distribut ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
We describe a subset of the level1, level2, and level3 BLAS implemented for each node of the Connection Machine system CM200. The routines, collectively called LBLAS, have interfaces consistent with languages with an array syntax such as Fortran 90. One novel feature, important for distributed memory architectures, is the capability of performing computations on multiple instances of objects in a single call. The number of instances and their allocation across memory units, and the strides for the different axes within the local memories, are derived from an array descriptor that contains type, shape, and data distribution information. Another novel feature of the LBLAS is a selection of loop order for rank1 updates and matrixmatrix multiplication based upon array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in excess of 75%. Matrixvector multiplication achieves a peak efficiency of 92%. The optimization of loop ordering has a success ...
LoadBalanced LU and QR Factor and Solve Routines for Scalable Processors with Scalable I/O
, 1994
"... . The concept of blockcyclic order elimination can be applied to outof core LU and QR matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optim ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
. The concept of blockcyclic order elimination can be applied to outof core LU and QR matrix factorizations on distributed memory architectures equipped with a parallel I/O system. This elimination scheme provides load balanced computation in both the factor and solve phases and further optimizes the use of the network bandwidth to perform I/O operations. Stability of LU factorization is enforced by full column pivoting. Performance results are presented for the Connection Machine system CM5. 1 Introduction Loadbalance for incore matrix factorization on distributed memory architectures can be achieved using a cyclic ordering of the data. In fact, one need not allocate data explicitly in a cyclic fashion. Instead, the elimination can be performed in a cyclic order. That blockcyclic order elimination is an efficient alternative to blockcyclic data allocation for incore dense matrix factorization was shown in [1]. The present note extends this concept to outofcore ...
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for highperformance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an objectoriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...
Implementation of QR Up and Downdating on a Massively Parallel Computer
, 1996
"... We describe an implementation of QR up and downdating on a massively parallel computer (the Connection Machine CM200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected seminormal equations for downdating can be efficiently implemented. We al ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We describe an implementation of QR up and downdating on a massively parallel computer (the Connection Machine CM200) and show that the algorithm maps well onto the computer. In particular, we show how the use of corrected seminormal equations for downdating can be efficiently implemented. We also illustrate the use of our algorithms in a new LP algorithm. Key words. up and downdating of QR factorization, corrected seminormal equations, CM200. 1 Introduction In this paper we describe an efficient implementation of updating and downdating of a QR factorization on the Connection Machine CM200, which is a massively parallel SIMD computer [11]. Many of our considerations are general for massively parallel computers. This project was sponsored by the Danish Center for Parallel Computer Research. M. Pinar was also sponsored by the Danish Natural Science Research Council, Grant No. 110505. y UNIfflC (Danish Computing Centre for Research and Education), Building 305, Technical...
High Performance, Scalable Scientific Software Libraries
, 1994
"... Massively parallel processors introduces new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Massively parallel processors introduces new demands on software systems with respect to performance, scalability, robustness and portability. The increased complexity of the memory systems and the increased range of problem sizes for which a given piece of software is used, poses serious challenges to software developers. The Connection Machine Scientific Software Library, CMSSL, uses several novel techniques to meet these challenges. The CMSSL contains routines for managing the data distribution and provides data distribution independent functionality. High performance is achieved through careful scheduling of operations and data motion, and through the automatic selection of algorithms at runtime. We discuss some of the techniques used, and provide evidence that CMSSL has reached the goals of performance and scalability for an important set of applications. 1.1 INTRODUCTION The main reason for large scale parallelism is performance. In order for massively parallel ar...
A 3D Parallel CommunicationEfficient Dense Linear Solver
, 2000
"... We present new communicationefficient parallel dense linear solvers: a solver for triangular linear systems with multiple righthand sides and an LU factorization algorithm. These solvers are asymtotically work efficient and they perform a factor of P 1/6 less communication than any existing algori ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We present new communicationefficient parallel dense linear solvers: a solver for triangular linear systems with multiple righthand sides and an LU factorization algorithm. These solvers are asymtotically work efficient and they perform a factor of P 1/6 less communication than any existing algorithm, where P is number of processors. In other words, these solvers are likely to run faster than any other solvers on parallel computers with a large number of processors. These provablyefficient algorithms are the main contribution of this thesis. The new solvers reduce communication at the expense of using more temporary storage. Previously, algorithms that reduce communication by using more memory were only known for matrix multiplication. These socalled threedimensional matrixmultiplication algorithms use a threedimensional grid of processors and they replicate matrices on each twodimensional "layer" of the 3D processor grid. (the processor grid can be a virtual grid embeded in any ...