Results 1  10
of
97
Minimizing Communication in Sparse Matrix Solvers
"... Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vecto ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
(Show Context)
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparsematrix kernel to compute a set of matrixvector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our sharedmemory implementation on an 8core Intel Clovertown gets speedups of up to 4.3 × over standard GMRES, without sacrificing convergence rate or numerical stability. 1.
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
"... Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we pre ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
(Show Context)
Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt highlevel algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
"... Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in highperformance computing and the use of grids for speeding up largescale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a stateoftheart dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization – one of the main dense linear algebra kernels – of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topologyaware middleware (QCGOMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid’5000 platform shows that the resulting performance increases linearly with the number of geographical sites on largescale problems (and is in particular consistently higher than ScaLAPACK’s).
Communicationavoiding QR decomposition for
 GPU,” GPU Technology Conference, Research Poster A01
, 2010
"... Abstract—We describe an implementation of the CommunicationAvoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
(Show Context)
Abstract—We describe an implementation of the CommunicationAvoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tallskinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a generalpurpose processor or using entirely bandwidthbound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using computebound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tallskinny matrices and Intel’s Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tallskinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidthbound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel’s Math Kernel Library (MKL) singular value decomposition routine on a multicore
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
, 2009
"... To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
(Show Context)
To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new fully asynchronous method for computing a QR factorization on sharedmemory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named CommunicationAvoiding QR and initially designed for distributedmemory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to stateoftheart approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) library. 1