Results 1  10
of
69
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 66 (33 self)
 Add to MetaCart
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
What color is your Jacobian? Graph coloring for computing derivatives
 SIAM REV
, 2005
"... Graph coloring has been employed since the 1980s to efficiently compute sparse Jacobian and Hessian matrices using either finite differences or automatic differentiation. Several coloring problems occur in this context, depending on whether the matrix is a Jacobian or a Hessian, and on the specific ..."
Abstract

Cited by 41 (7 self)
 Add to MetaCart
Graph coloring has been employed since the 1980s to efficiently compute sparse Jacobian and Hessian matrices using either finite differences or automatic differentiation. Several coloring problems occur in this context, depending on whether the matrix is a Jacobian or a Hessian, and on the specifics of the computational techniques employed. We consider eight variant vertexcoloring problems here. This article begins with a gentle introduction to the problem of computing a sparse Jacobian, followed by an overview of the historical development of the research area. Then we present a unifying framework for the graph models of the variant matrixestimation problems. The framework is based upon the viewpoint that a partition of a matrixinto structurally orthogonal groups of columns corresponds to distance2 coloring an appropriate graph representation. The unified framework helps integrate earlier work and leads to fresh insights; enables the design of more efficient algorithms for many problems; leads to new algorithms for others; and eases the task of building graph models for new problems. We report computational results on two of the coloring problems to support our claims. Most of the methods for these problems treat a column or a row of a matrixas an atomic entity, and partition the columns or rows (or both). A brief review of methods that do not fit these criteria is provided. We also discuss results in discrete mathematics and theoretical computer science that intersect with the topics considered here.
Programming matrix algorithmsbyblocks for threadlevel parallelism
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
"... With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution ..."
Abstract

Cited by 29 (18 self)
 Add to MetaCart
With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithmsbyblocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads outoforder and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithmsbyblocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithmbyblocks for the QR factorization, both originally designed for outofcore computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
Scheduling of QR factorization algorithms on SMP and multicore architectures
 IN PDP ’08: PROCEEDINGS OF THE SIXTEENTH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORKBASED PROCESSING
, 2008
"... This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operat ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multicore architectures. Two implementations of algorithmsbyblocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the socalled critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix runtime system utilizes FLASH to assemble and represent matrices but also provides outoforder scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
Introducing: The libflame Library for Dense Matrix Computations
"... As part of the FLAME project, we have been dilligently developing new methodologies for analyzing, designing, and implementing linear algebra libraries. While we did not know it when we started, these techniques appear to solve many of the programmability problems that now face us with the advent of ..."
Abstract

Cited by 9 (6 self)
 Add to MetaCart
As part of the FLAME project, we have been dilligently developing new methodologies for analyzing, designing, and implementing linear algebra libraries. While we did not know it when we started, these techniques appear to solve many of the programmability problems that now face us with the advent of multicore and manycore architectures. These efforts have culminated in a new library, libflame, which strives to replace similar libraries that date back to the late 20th century. With this paper, we introduce the scientific computing community to this library.
SuperMatrix outoforder scheduling of matrix operations for SMP and multicore architectures
 N, Month 20YY. 24 · G. QuintanaOrtí et al. SPAA ’07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures
"... We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to re ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithmsbyblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and outoforder execution usual in superscalar processors, which we name SuperMatrix OutofOrder scheduling. Performance results on a 16 CPU Itanium2based server are used to highlight opportunities and issues related to this new approach. 1
Seven at one stroke: Results from a cacheoblivious paradigm for scalable matrix algorithms
 In MSPC ’06: Proc. 2006 Wkshp. Memory System Performance and Correctness
, 2006
"... A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy a ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cacheoblivious style of programming. Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on lowlevel performance, including the new Mortonhybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisively on TLBmisses.