Results 1 
7 of
7
Programming matrix algorithmsbyblocks for threadlevel parallelism
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
"... With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution ..."
Abstract

Cited by 46 (18 self)
 Add to MetaCart
With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithmsbyblocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads outoforder and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithmsbyblocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithmbyblocks for the QR factorization, both originally designed for outofcore computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest
Scaling LAPACK Panel Operations Using Parallel Cache Assignment
 PPOPP 2010
, 2010
"... In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl’s law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the highorder flop count, and do not perform as well. By demonstrating a straightforward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.
Code Generation and Optimization of DistributedMemory Dense Linear Algebra Kernels
"... Design by Transformation (DxT) is an approach to software development that encodes domainspecific programs as graphs and expert design knowledge as graph transformations. The goal of DxT is to mechanize the generation of highlyoptimized code. This paper demonstrates how DxT can be used to transfor ..."
Abstract
 Add to MetaCart
(Show Context)
Design by Transformation (DxT) is an approach to software development that encodes domainspecific programs as graphs and expert design knowledge as graph transformations. The goal of DxT is to mechanize the generation of highlyoptimized code. This paper demonstrates how DxT can be used to transform sequential specifications of an important set of Dense Linear Algebra (DLA) kernels, the level3 Basic Linear Algebra Subprograms (BLAS3), into highperforming library routines targeting distributedmemory (cluster) architectures. Getting good BLAS3 performance for such platforms requires deep domain knowledge, so their implementations are manually coded by experts. Unfortunately, there are few such experts and developing the full variety of BLAS3 implementations takes a lot of repetitive effort. A prototype tool, DxTer, automates this tedious task. We explain how we build on previous work to represent loops and multiple loopbased algorithms in DxTer. Performance results on a BlueGene/P parallel supercomputer show that the generated code meets or beats implementations that are handcoded by a human expert and outperforms the widely used ScaLAPACK library.
Automation in Dense Linear Algebra
"... Abstract. In this article we look at the generation of libraries for dense linear algebra operations from a different perspective: instead of focusing on the optimization (possibly automatically) of a routine, we address the question “what would it take for a computer to mechanically (automaticall ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In this article we look at the generation of libraries for dense linear algebra operations from a different perspective: instead of focusing on the optimization (possibly automatically) of a routine, we address the question “what would it take for a computer to mechanically (automatically) generate highperformance algorithms, much like a human expert?”. We will show that for a large class of operations, the mathematical description of the input and output variables represents the necessary and sufficient information for a symbolic system to generate loopbased algorithms as well as highperformance routines. Surprisingly, the generation process is entirely prescribed by a proof of correctness: taking Dijkstra’s advice, rather than starting from an algorithm and trying to prove its correctness, we systematically build one so that its correctness is guaranteed. This methodology is thus fundamentally different with respect to a standard autotuning approach: while autotuning often performs a parameter optimization over a large search space, our methodology and system rely on symbolic and algebraic transformations, and in principle, can be used to produce cost and error analyses hand in hand with the generated algorithms. 1
HighPerformance UpandDowndating via Householderlike Transformations
"... We present highperformance algorithms for upanddowndating a Cholesky factor or QR factorization. The method uses Householderlike transformations, sometimes called hyperbolic Householder transformations, that are accumulated so that most computation can be cast in terms of highperformance matrix ..."
Abstract
 Add to MetaCart
We present highperformance algorithms for upanddowndating a Cholesky factor or QR factorization. The method uses Householderlike transformations, sometimes called hyperbolic Householder transformations, that are accumulated so that most computation can be cast in terms of highperformance matrixmatrix operations. The resulting algorithms can then be used as building blocks for an algorithmbyblocks that allows computation to be conveniently scheduled to multithreaded architectures like multicore processors. Performance is shown to be similar to that achieved by a blocked QR factorization via Householder transformations.