Results 1  10
of
20
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
, 2008
"... Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight ..."
Abstract

Cited by 33 (12 self)
 Add to MetaCart
Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications. We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LUfactorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99.4 GFlop/s for double precision factorization on a hybrid Intel Xeon
Improving communication performance in dense linear algebra via topology aware collectives
, 2011
"... ..."
(Show Context)
Accelerating the reduction to upper Hessenberg form through hybrid GPUbased computing
, 2009
"... We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorit ..."
Abstract

Cited by 10 (8 self)
 Add to MetaCart
(Show Context)
We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are significant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a significant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to twosided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get significant computational acceleration which otherwise may have been impossible.
permission. LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Brief Announcement: Lower Bounds on Communication for Sparse Cholesky Factorization of a Model Problem
"... Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (nonStrassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (nonStrassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for dense or sparse matrices. There are analogous lower bounds for the number of messages, and for parallel algorithms instead of sequential algorithms. Our goal here is to find algorithms that attain these lower bounds on interesting classes of sparse matrices. We focus on matrices for which there is a lower bound on the number of flops of their Cholesky factorization. Our Cholesky lower bounds on communication hold for any possible ordering of the rows and columns of the matrix, and so are globally optimal in this sense. For matrices arising from discretization on two dimensional and three dimensional regular grids, we discuss sequential and parallel algorithms that are optimal in terms of communication. The algorithms turn out to require combining previously known sparse and dense Cholesky algorithms in simple ways.
Communication Avoiding and Overlapping for Numerical Linear Algebra
"... Abstract—To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communicationavoiding 2.5D algorithms improve scalability by reducing interprocessor data transfer volume at the cost of extra memory usage. Communication overl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communicationavoiding 2.5D algorithms improve scalability by reducing interprocessor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast onesided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system. I.
OPTIMIZING HALLEY’S ITERATION FOR COMPUTING THE MATRIX POLAR DECOMPOSITION ∗
"... Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iteration ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iterations for convergence with the tolerance 10−16. The Halley iteration can be implemented via QR decompositions without explicit matrix inversions. Therefore, it is an inverse free communication friendly algorithm for the emerging multicore and hybrid high performance computing systems. Key words. Polar decomposition, Halley’s iteration, Newton’s iteration, inverse free iterations, QR decomposition, numerical stability AMS subject classifications. 15A15, 15A23, 65F30
Minimizing Communication for Eigenproblems and the Singular Value Decomposition
, 2010
"... Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In [4] lower bounds were presented on the amount of communication required for essentially all O(n 3)like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs. 1
Communication Avoiding Symmetric Band Reduction
"... The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algori ..."
Abstract
 Add to MetaCart
(Show Context)
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilateandchase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2 ⇥ speedup over Intel MKL while doing 43 % more floating point operations. Our parallel implementation targets sharedmemory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5 ⇥ scaling over our serial code and up to 6 ⇥ speedup over the PLASMA library, comparing parallel performance on a tencore processor.