Results 1 - 10
of
46
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract
-
Cited by 49 (26 self)
- Add to MetaCart
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
A Framework for Symmetric Band Reduction
, 1999
"... this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser alg ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, the band reduction algorithm is repeatedly used until the reduced matrix is tridiagonal. If d = b 1, it is the MHL-algorithm; and if d = 1 is used for each reduction step, it results in the Rutishauser algorithm. However, d need not be chosen this way; indeed, exploiting the freedom we have in choosing d leads to a class of algorithms for banded reduction and tridiagonalization with favorable computational properties. In particular, we can derive algorithms with
Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. LAPACK Working Note 178
- University of Tennessee
, 2006
"... Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the fork-join model of parallel execution, which may result in suboptimal performance on current and future generations of multi-core processors. To overcome the shortcoming ..."
Abstract
-
Cited by 21 (9 self)
- Add to MetaCart
Linear algebra algorithms commonly encapsulate parallelism in Basic Linear Algebra Subroutines (BLAS). This solution relies on the fork-join model of parallel execution, which may result in suboptimal performance on current and future generations of multi-core processors. To overcome the shortcomings of this approach a pipelined model of parallel execution is presented, and the idea of the look ahead is utilized in order to suppress the negative effects of sequential formulation of the algorithms. Application to one-sided matrix factorizations, LU, Cholesky and QR, is described. Shared memory implementation using POSIX threads is presented.
Communication and Matrix Computations on Large Message Passing Systems
, 1990
"... This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associ ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This paper is concerned with the consequences for matrix computations of having a rather large number of general purpose processors, say ten or twenty thousand, connected in a network in such a way that a processor can communicate only with its immediate neighbors. Certain communication tasks associated with most matrix algorithms are defined and formulas developed for the time required to perform them under several communication regimes. The results are compared with the times for a nominal n
High performance algorithms for Toeplitz and block Toeplitz matrices
, 1996
"... In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLA ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLAS3 primitives and makes efficient use of a memory hierarchy. We present three algorithms for indefinite Toeplitz matrices. Two of these are based on look-ahead strategies and produce an exact factorization of the Toeplitz matrix. The third produces an inexact faetorization via perturbations of singular principal minors. We also present an analysis of the numerical behavior of the third algorithm and derive a bound for the number of iterations to improve the accuracy of the solution. For rank-deficient Toeplitz least-squares problems, we present a variant of the gene-ralized Schur algorithm that avoids breakdown due to an exact rank-deficiency. In the presence of a near rank-deficiency, an approximate rank factorization of the Toeplitz matrix is produced. Finally, we suggest an algorithm to solve the normal equations resulting from a real Toeplitz least-squares problem based on transforming to Cauehy-like matrices. This algorithm exploits both realness and symmetry in the normal equations.
Parallel algorithms in linear algebra
- Computer Sciences Laboratory, ANU
, 1991
"... This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributed-memory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributed-memory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular linear system by Gaussian elimination with partial pivoting. This problem has come to be regarded as a benchmark for the performance of parallel machines. We consider its appropriateness as a benchmark, its communication requirements, and schemes for data distribution to facilitate communication and load balancing. In addition, we describe some parallel algorithms for orthogonal (QR) factorization and the singular value decomposition (SVD). 1. Introduction – Gaussian
Scheduling of QR factorization algorithms on SMP and multi-core architectures
- IN PDP ’08: PROCEEDINGS OF THE SIXTEENTH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING
, 2008
"... This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operat ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
permission. MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Skew-Hamiltonian and Hamiltonian eigenvalue problems: Theory, algorithms and applications
- Proceedings of ApplMath03, Brijuni (Croatia
"... Skew-Hamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computation ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Skew-Hamiltonian and Hamiltonian eigenvalue problems arise from a number of applications, particularly in systems and control theory. The preservation of the underlying matrix structures often plays an important role in these applications and may lead to more accurate and more efficient computational methods. We will discuss the relation of structured and unstructured condition numbers for these problems as well as algorithms exploiting the given matrix structures. Applications of Hamiltonian and skew-Hamiltonian eigenproblems are briefly described.
Block Algorithms for Orthogonal Symplectic Factorizations
- BIT
, 2002
"... On the basis of a new WY-like representation block algorithms for orthogonal symplectic matrix factorizations are presented. Special emphasis is placed on symplectic QR and URV factorizations. The block variants mainly use level 3 (matrix-matrix) operations that permit data reuse in the higher level ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
On the basis of a new WY-like representation block algorithms for orthogonal symplectic matrix factorizations are presented. Special emphasis is placed on symplectic QR and URV factorizations. The block variants mainly use level 3 (matrix-matrix) operations that permit data reuse in the higher levels of a memory hierarchy. Timing results show that our new algorithms outperform standard algorithms by a factor 3-4 for sufficiently large problems.

