Results 1  10
of
17
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
, 2008
"... Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight ..."
Abstract

Cited by 30 (12 self)
 Add to MetaCart
Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications. We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LUfactorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99.4 GFlop/s for double precision factorization on a hybrid Intel Xeon
Improving communication performance in dense linear algebra via topology aware collectives
, 2011
"... ..."
Accelerating the reduction to upper Hessenberg form through hybrid GPUbased computing
, 2009
"... We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorit ..."
Abstract

Cited by 10 (8 self)
 Add to MetaCart
We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are significant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a significant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to twosided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get significant computational acceleration which otherwise may have been impossible.
Brief Announcement: Lower Bounds on Communication for Sparse Cholesky Factorization of a Model Problem
"... Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (nonStrassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (nonStrassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for dense or sparse matrices. There are analogous lower bounds for the number of messages, and for parallel algorithms instead of sequential algorithms. Our goal here is to find algorithms that attain these lower bounds on interesting classes of sparse matrices. We focus on matrices for which there is a lower bound on the number of flops of their Cholesky factorization. Our Cholesky lower bounds on communication hold for any possible ordering of the rows and columns of the matrix, and so are globally optimal in this sense. For matrices arising from discretization on two dimensional and three dimensional regular grids, we discuss sequential and parallel algorithms that are optimal in terms of communication. The algorithms turn out to require combining previously known sparse and dense Cholesky algorithms in simple ways.
permission. LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
OPTIMIZING HALLEY’S ITERATION FOR COMPUTING THE MATRIX POLAR DECOMPOSITION ∗
"... Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iteration ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iterations for convergence with the tolerance 10−16. The Halley iteration can be implemented via QR decompositions without explicit matrix inversions. Therefore, it is an inverse free communication friendly algorithm for the emerging multicore and hybrid high performance computing systems. Key words. Polar decomposition, Halley’s iteration, Newton’s iteration, inverse free iterations, QR decomposition, numerical stability AMS subject classifications. 15A15, 15A23, 65F30
Communication Avoiding and Overlapping for Numerical Linear Algebra
"... Abstract—To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communicationavoiding 2.5D algorithms improve scalability by reducing interprocessor data transfer volume at the cost of extra memory usage. Communication overl ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract—To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communicationavoiding 2.5D algorithms improve scalability by reducing interprocessor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast onesided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system. I.
Leadership Computing Facility
"... eliminating load imbalance in massively parallel contractions ..."