Results 1 - 10
of
12
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Tridiagonal, and Bidiagonal Form
, 1995
"... This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
This paper discusses issues in the design of ScaLAPACK, a software library for performing dense linear algebra computations on distributed memory concurrent computers. These issues are illustrated using the ScaLAPACK routines for reducing matrices to Hessenberg, tridiagonal, and bidiagonal forms. These routines are important in the solution of eigenproblems. The paper focuses on how building blocks are used to create higher-level library routines. Results are presented that demonstrate the scalability of the reduction routines. The most commonly-used building blocks used in ScaLAPACK are the sequential BLAS, the Parallel BLAS (PBLAS) and the Basic Linear Algebra Communication Subprograms (BLACS). Each of the matrix reduction algorithms consists of a series of steps in each of which one block column (or panel), and/or block row, of the matrix is reduced, followed by an update of the portion of the matrix that has not been factorized so far. This latter phase is performed usin...
Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality Accepted for publication
- in SIAM Journal of Scientific Computing, Date unknown
, 1996
"... Abstract. Over the last few years, it has been suggested that the popular QR algorithm for the unsymmetric eigenvalue problem does not parallelize. In this paper, we present both positive and negative results on this subject: In theory, asymptotically perfect speedup can be obtained. In practice, re ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. Over the last few years, it has been suggested that the popular QR algorithm for the unsymmetric eigenvalue problem does not parallelize. In this paper, we present both positive and negative results on this subject: In theory, asymptotically perfect speedup can be obtained. In practice, reasonable speedup can be obtained on a MIMD distributed memory computer, for a relatively small number of processors. However, we also show theoretically that it is impossible for the standard QR algorithm to be scalable. Performance ofaparallel implementation of the LAPACK DLAHQR routine on the Intel Paragon TM system is reported. 1. Introduction. Distributed
Towards an Accurate Performance Modeling of Parallel Sparse
- LU Factorization, in "Applicable Algebra in Engineering, Communication, and Computing
, 2006
"... We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model ch ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the implementation in the SuperLU DIST linear system solver, the sparse matrices from real application, and an IBM POWER3 parallel machine. Our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR, and on different parallel machines. 1
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an object-oriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...
Parallel Algorithms for LQ Optimal Control of Discrete-Time Periodic Linear Systems
- J. Parallel Distrib. Comput
, 2001
"... This paper analyzes the performance of two parallel algorithms for solving the ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper analyzes the performance of two parallel algorithms for solving the
Toward scalable matrix multiply on multithreaded architectures
- In Euro-Par ’07: Proceedings of the Thirteenth International European Conference on Parallel and Distributed Computing
, 2007
"... Abstract. We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory architectures with many simultaneous threads of execution, including SMP architectures and future m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory architectures with many simultaneous threads of execution, including SMP architectures and future multicore processors. The always-important matrix-matrix multiplication is used to demonstrate that a simple one-dimensional data partitioning is suboptimal in the context of dense linear algebra operations and hinders scalability. In addition we advocate the publishing of low-level interfaces to supporting operations, such as the copying of data to contiguous memory, so that library developers may further optimize parallel linear algebra implementations. Data collected on a 16 CPU Itanium2 server supports these observations. 1
PLAPACK: High performance through high level abstraction
- In Proceedings of ICPP98
, 1998
"... ..."
Analyzing Parallel Program Performance Using Normalized Performance Indices and Trace Transformation Techniques
, 1996
"... In this paper we describe how a performance tuning tool-set, AIMS, guides the user towards developing efficient and scalable production-level parallel programs by locating performance improvement opportunities and determining optimization benefits. AIMS's Xisk helps identify potential optimizations ..."
Abstract
- Add to MetaCart
In this paper we describe how a performance tuning tool-set, AIMS, guides the user towards developing efficient and scalable production-level parallel programs by locating performance improvement opportunities and determining optimization benefits. AIMS's Xisk helps identify potential optimizations by computing various pre-defined normalized performance indices from program traces. Inspection of these index point to specific optimizations that may benefit program performance. After identifying and characterizing performance problems, AIMS's MK can provide quantitative estimates of performance benefits to help the user avoid arduous optimizations that may not lead to expected performance improvements by. MK also helps identify potential pitfalls or benefits of changing any of various system parameters. Based on MK's performance projection, an informed decision regarding the most beneficial program optimizations or upgrades in execution environments can be chosen. 1. Introduction Parall...
A Comprehensive Approach to Parallel Linear Algebra Libraries
, 1995
"... This document is in constant state of flux It is being distributed to generate discussions We are aware of its incompleteness and the fact that there are many typos. Contents 1 ..."
Abstract
- Add to MetaCart
This document is in constant state of flux It is being distributed to generate discussions We are aware of its incompleteness and the fact that there are many typos. Contents 1

