Results 1  10
of
55
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 66 (33 self)
 Add to MetaCart
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
A Framework for Symmetric Band Reduction
, 1999
"... this paper, we generalize the ideas behind the RSalgorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser alg ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
this paper, we generalize the ideas behind the RSalgorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, the band reduction algorithm is repeatedly used until the reduced matrix is tridiagonal. If d = b 1, it is the MHLalgorithm; and if d = 1 is used for each reduction step, it results in the Rutishauser algorithm. However, d need not be chosen this way; indeed, exploiting the freedom we have in choosing d leads to a class of algorithms for banded reduction and tridiagonalization with favorable computational properties. In particular, we can derive algorithms with
Implementing linear algebra routines on multicore processors with pipelining and a look ahead. LAPACK Working Note 178
 University of Tennessee
, 2006
"... Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the forkjoin model of parallel execution, which may result in suboptimal performance on current and future generations of multicore processors. To overcome the shortcoming ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the forkjoin model of parallel execution, which may result in suboptimal performance on current and future generations of multicore processors. To overcome the shortcomings of this approach a pipelined model of parallel execution is presented, and the idea of the look ahead is utilized in order to suppress the negative effects of sequential formulation of the algorithms. Application to onesided matrix factorizations, LU, Cholesky and QR, is described. Shared memory implementation using POSIX threads is presented.
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n
Scheduling of QR factorization algorithms on SMP and multicore architectures
 IN PDP ’08: PROCEEDINGS OF THE SIXTEENTH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORKBASED PROCESSING
, 2008
"... This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operat ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the socalled critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix runtime system utilizes FLASH to assemble and represent matrices but also provides outoforder scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
High performance algorithms for Toeplitz and block Toeplitz matrices
, 1996
"... In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLA ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLAS3 primitives and makes efficient use of a memory hierarchy. We present three algorithms for indefinite Toeplitz matrices. Two of these are based on lookahead strategies and produce an exact factorization of the Toeplitz matrix. The third produces an inexact faetorization via perturbations of singular principal minors. We also present an analysis of the numerical behavior of the third algorithm and derive a bound for the number of iterations to improve the accuracy of the solution. For rankdeficient Toeplitz leastsquares problems, we present a variant of the generalized Schur algorithm that avoids breakdown due to an exact rankdeficiency. In the presence of a near rankdeficiency, an approximate rank factorization of the Toeplitz matrix is produced. Finally, we suggest an algorithm to solve the normal equations resulting from a real Toeplitz leastsquares problem based on transforming to Cauehylike matrices. This algorithm exploits both realness and symmetry in the normal equations.
Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices II
, 2004
"... This article describes Fortran 77 subroutines for computing eigenvalues and invariant subspaces of Hamiltonian and skewHamiltonian matrices. The implemented algorithms are based on orthogonal symplectic decompositions, implying numerical backward stability as well as symmetry preservation for the c ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
This article describes Fortran 77 subroutines for computing eigenvalues and invariant subspaces of Hamiltonian and skewHamiltonian matrices. The implemented algorithms are based on orthogonal symplectic decompositions, implying numerical backward stability as well as symmetry preservation for the computed eigenvalues. These algorithms are supplemented with balancing and block algorithms, which can lead to considerable accuracy and performance improvements. As a byproduct, an efficient implementation for computing symplectic QR decompositions is provided. We demonstrate the usefulness of the subroutines for several, practically relevant examples.
SkewHamiltonian and Hamiltonian eigenvalue problems: Theory, algorithms and applications
 Proceedings of ApplMath03, Brijuni (Croatia
"... SkewHamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computation ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
SkewHamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computational methods. We will discuss the relation of structured and unstructured condition numbers for these problems as well as algorithms exploiting the given matrix structures. Applications of Hamiltonian and skewHamiltonian eigenproblems are briefly described.
Scaling LAPACK Panel Operations Using Parallel Cache Assignment
 PPOPP 2010
, 2010
"... In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl’s law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the highorder flop count, and do not perform as well. By demonstrating a straightforward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.
Parallel algorithms in linear algebra
 Computer Sciences Laboratory, ANU
, 1991
"... This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributedmemory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributedmemory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular linear system by Gaussian elimination with partial pivoting. This problem has come to be regarded as a benchmark for the performance of parallel machines. We consider its appropriateness as a benchmark, its communication requirements, and schemes for data distribution to facilitate communication and load balancing. In addition, we describe some parallel algorithms for orthogonal (QR) factorization and the singular value decomposition (SVD). 1. Introduction – Gaussian