Results 1  10
of
74
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 67 (35 self)
 Add to MetaCart
(Show Context)
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
A Framework for Symmetric Band Reduction
, 1999
"... this paper, we generalize the ideas behind the RSalgorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
this paper, we generalize the ideas behind the RSalgorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, the band reduction algorithm is repeatedly used until the reduced matrix is tridiagonal. If d = b 1, it is the MHLalgorithm; and if d = 1 is used for each reduction step, it results in the Rutishauser algorithm. However, d need not be chosen this way; indeed, exploiting the freedom we have in choosing d leads to a class of algorithms for banded reduction and tridiagonalization with favorable computational properties. In particular, we can derive algorithms with
Implementing linear algebra routines on multicore processors with pipelining and a look ahead. LAPACK Working Note 178
 University of Tennessee
, 2006
"... Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the forkjoin model of parallel execution, which may result in suboptimal performance on current and future generations of multicore processors. To overcome the shortcoming ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
(Show Context)
Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the forkjoin model of parallel execution, which may result in suboptimal performance on current and future generations of multicore processors. To overcome the shortcomings of this approach a pipelined model of parallel execution is presented, and the idea of the look ahead is utilized in order to suppress the negative effects of sequential formulation of the algorithms. Application to onesided matrix factorizations, LU, Cholesky and QR, is described. Shared memory implementation using POSIX threads is presented.
Scheduling dense linear algebra operations on multicore processors
 CONCURRENCY COMPUTAT.:PRACT EXPER
, 2010
"... ..."
High performance algorithms for Toeplitz and block Toeplitz matrices
, 1996
"... In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLA ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLAS3 primitives and makes efficient use of a memory hierarchy. We present three algorithms for indefinite Toeplitz matrices. Two of these are based on lookahead strategies and produce an exact factorization of the Toeplitz matrix. The third produces an inexact faetorization via perturbations of singular principal minors. We also present an analysis of the numerical behavior of the third algorithm and derive a bound for the number of iterations to improve the accuracy of the solution. For rankdeficient Toeplitz leastsquares problems, we present a variant of the generalized Schur algorithm that avoids breakdown due to an exact rankdeficiency. In the presence of a near rankdeficiency, an approximate rank factorization of the Toeplitz matrix is produced. Finally, we suggest an algorithm to solve the normal equations resulting from a real Toeplitz leastsquares problem based on transforming to Cauehylike matrices. This algorithm exploits both realness and symmetry in the normal equations.
Scheduling of QR factorization algorithms on SMP and multicore architectures
 IN PDP ’08: PROCEEDINGS OF THE SIXTEENTH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORKBASED PROCESSING
, 2008
"... This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operat ..."
Abstract

Cited by 15 (9 self)
 Add to MetaCart
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the socalled critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix runtime system utilizes FLASH to assemble and represent matrices but also provides outoforder scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n³
SkewHamiltonian and Hamiltonian eigenvalue problems: Theory, algorithms and applications
 Proceedings of ApplMath03, Brijuni (Croatia
"... SkewHamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computation ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
SkewHamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computational methods. We will discuss the relation of structured and unstructured condition numbers for these problems as well as algorithms exploiting the given matrix structures. Applications of Hamiltonian and skewHamiltonian eigenproblems are briefly described.
Fortran 77 subroutines for computing the eigenvalues of Hamiltonian matrices II
, 2004
"... This article describes Fortran 77 subroutines for computing eigenvalues and invariant subspaces of Hamiltonian and skewHamiltonian matrices. The implemented algorithms are based on orthogonal symplectic decompositions, implying numerical backward stability as well as symmetry preservation for the c ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
This article describes Fortran 77 subroutines for computing eigenvalues and invariant subspaces of Hamiltonian and skewHamiltonian matrices. The implemented algorithms are based on orthogonal symplectic decompositions, implying numerical backward stability as well as symmetry preservation for the computed eigenvalues. These algorithms are supplemented with balancing and block algorithms, which can lead to considerable accuracy and performance improvements. As a byproduct, an efficient implementation for computing symplectic QR decompositions is provided. We demonstrate the usefulness of the subroutines for several, practically relevant examples.
Parallel algorithms in linear algebra
 Computer Sciences Laboratory, ANU
, 1991
"... This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributedmemory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributedmemory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular linear system by Gaussian elimination with partial pivoting. This problem has come to be regarded as a benchmark for the performance of parallel machines. We consider its appropriateness as a benchmark, its communication requirements, and schemes for data distribution to facilitate communication and load balancing. In addition, we describe some parallel algorithms for orthogonal (QR) factorization and the singular value decomposition (SVD). 1. Introduction – Gaussian