Results 1  10
of
11
An Extended Set of Fortran Basic Linear Algebra Subprograms
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1986
"... This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers. ..."
Abstract

Cited by 523 (68 self)
 Add to MetaCart
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers.
A class of parallel tiled linear algebra algorithms for multicore architectures
"... Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a ..."
Abstract

Cited by 169 (58 self)
 Add to MetaCart
Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1
ALGORITHM 656  An Extended Set of Basic Linear Algebra . . .
, 1988
"... ... Subprograms (Level 2 BLAS). Level 2 BLAS are targeted at matrixvector operations with the aim of providing more efficient, but portable, implementations of algorithms on highperformance computers. The model implementation provides a portable set of FORTRAN 77 Level 2 BLAS for machines where sp ..."
Abstract

Cited by 45 (9 self)
 Add to MetaCart
... Subprograms (Level 2 BLAS). Level 2 BLAS are targeted at matrixvector operations with the aim of providing more efficient, but portable, implementations of algorithms on highperformance computers. The model implementation provides a portable set of FORTRAN 77 Level 2 BLAS for machines where specialized implementations do not exist or are not required. The test software aims to verify that specialized implementations meet the specification of Level 2 BLAS and that implementations are correctly installed.
A Parallel Implementation of the Nonsymmetric QR Algorithm for Distributed Memory Architectures
 SIAM J. SCI. COMPUT
, 2002
"... One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems ..."
Abstract

Cited by 37 (3 self)
 Add to MetaCart
One approach to solving the nonsymmetric eigenvalue problem in parallel is to parallelize the QR algorithm. Not long ago, this was widely considered to be a hopeless task. Recent efforts have led to significant advances, although the methods proposed up to now have suffered from scalability problems. This paper discusses an approach to parallelizingthe QR algorithm that greatly improves scalability. A theoretical analysis indicates that the algorithm is ultimately not scalable, but the nonscalability does not become evident until the matrix dimension is enormous. Experiments on the Intel Paragon system, the IBM SP2 supercomputer, the SGI Origin 2000, and the Intel ASCI Option Red supercomputer are reported.
New Serial and Parallel Recursive QR Factorization Algorithms for SMP Systems
, 1998
"... . We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the update ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
. We present a new recursive algorithm for the QR factorization of an m by n matrix A. The recursion leads to an automatic variable blocking that allow us to replace a level 2 part in a standard block algorithm by level 3 operations. However, there are some additional costs for performing the updates which prohibits the efficient use of the recursion for large n. This obstacle is overcome by using a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by 78% to 21% as m = n increases from 100 to 1000. A successful parallel implementation on a PowerPC 604 based IBM SMP node based on dynamic load balancing is presented. For 2, 3, 4 processors and m = n = 2000 it shows speedups of 1.96, 2.99, and 3.92 compared to our uniprocessor algorithm. 1 Introduction LAPACK algorithm DGEQRF requires more floating point operations than LAPACK algorithm DGEQR2, see [1]. Yet, DGEQRF outperforms DGEQR2 on a RS/6000 workstation by nearly a factor of 3 on large matrices. Dongarra, Kaufm...
Scheduling dense linear algebra operations on multicore processors
 CONCURRENCY COMPUTAT.:PRACT EXPER
, 2010
"... ..."
Solution of large, dense symmetric generalized eigenvalue problems using secondary storage
 ACM Transactions on Mathematical Software
, 1988
"... This paper describes a new implementation of algorithms for solving large, dense symmetric eigenproblems AX = BXA, where the matrices A and B are too large to fit in the central memory of the computer. Here A is assumed to be symmetric, and B symmetric positive definite. A combination of block Chol ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
This paper describes a new implementation of algorithms for solving large, dense symmetric eigenproblems AX = BXA, where the matrices A and B are too large to fit in the central memory of the computer. Here A is assumed to be symmetric, and B symmetric positive definite. A combination of block Cholesky and block Householder transformations are used to reduce the problem to a symmetric banded eigenproblem whose eigenvalues can be computed in central memory. Inverse iteration is applied to the banded matrix to compute selected eigenvectors, which are then transformed back to eigenvectors of the original problem. This method is especially suitable for the solution of large eigenproblems arising in quantum physics, using a vector supercomputer with fast secondary storage device such as the Cray XMP with SSD. Some numerical results demonstrate the efficiency of the new implementation.
QR factorization for the Cell Broadband Engine
, 2009
"... The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK libra ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrixâ€“vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multicore processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and threadlevel parallelism.
Numerical Linear Algebra for HighPerformance Computers
, 1998
"... This is a survey of some work recently done at Argonne National Laboratory in an attempt to discover ways to construct numerical software for high performance computers. The numerical algorithms discussed are taken from several areas of numerical linear algebra. We discuss certain architectural feat ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This is a survey of some work recently done at Argonne National Laboratory in an attempt to discover ways to construct numerical software for high performance computers. The numerical algorithms discussed are taken from several areas of numerical linear algebra. We discuss certain architectural features of advanced computer architectures that will affect the design of algorithms. The technique of restructuring algorithms in terms of certain modules is reviewed. This technique has proved very successful in obtaining a high level of transportability without severe loss of performance on a wide variety of both vector and parallel computers. The module technique is demonstrably effective for dense linear algebra problems. However, in the case of sparse and structured problems it may be difficult to identify general modules that will be as effective. New algorithms have been devised for certain problems in this category. We present examples in three important areas: banded systems, sparse QR factorization, and symmetric eigenvalue problems. 1.