Results 1 
8 of
8
Improving performance and energy efficiency of matrix multiplication via pipeline broadcast
 in: Proc. CLUSTER
"... Abstract—Boosting performance and energy efficiency of scientific applications running on high performance computing systems arise cruicially nowadays. Software and hardware based solutions for improving communication performance have been recognized as significant means of achieving performance gai ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Boosting performance and energy efficiency of scientific applications running on high performance computing systems arise cruicially nowadays. Software and hardware based solutions for improving communication performance have been recognized as significant means of achieving performance gain and thus energy savings for such applications. As a fundamental component of most numerical linear algebra algorithms, improving performance and energy efficiency of distributed matrix multiplication is of major concerns. For such purposes, we propose a high performance communication scheme that fully exploits network bandwidth via nonblocking pipeline broadcast with tuned chunk size. Empirically, substantial performance gain up to 8.4 % and energy savings up to 6.9 % are achieved compared to blocking pipeline broadcast, and against binomial tree broadcast, performance gain up to 6.5 % and energy savings up to 6.1 % are observed on a 64core cluster. Keywords—distributed matrix multiplication; performance; energy; pipeline broadcast; binomial tree broadcast; ScaLAPACK. I.
Communication Costs of Strassen’s Matrix Multiplication
, 2014
"... Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today’s machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than p ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today’s machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph. Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel comput
EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE MATRIXMATRIX MULTIPLICATION
"... Abstract. Sparse matrixmatrix multiplication (or SpGEMM) is a key primitive for many highperformance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2. ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Sparse matrixmatrix multiplication (or SpGEMM) is a key primitive for many highperformance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on ErdősRényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the stateoftheart publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. Key words. Parallel computing, numerical linear algebra, sparse matrixmatrix multiplication, 2.5D algorithms, 3D algorithms, multithreading, SpGEMM, 2D decomposition, graph algorithms.
Bordered Heegaard Floer . . .
, 2008
"... We construct Heegaard Floer theory for 3manifolds with connected boundary. The theory associates to an oriented twomanifold a differential graded algebra. For a threemanifold with specified boundary, the invariant comes in two different versions, one of which (type D) is a module over the algebra ..."
Abstract
 Add to MetaCart
(Show Context)
We construct Heegaard Floer theory for 3manifolds with connected boundary. The theory associates to an oriented twomanifold a differential graded algebra. For a threemanifold with specified boundary, the invariant comes in two different versions, one of which (type D) is a module over the algebra and the other of which (type A) is an A ∞ module. Both are welldefined up to chain homotopy equivalence. For a decomposition of a 3manifold into two pieces, the A∞ tensor product of the type D module of one piece and the type A module from the other piece is ̂ HF of the glued manifold. As a special case of the construction, we specialize to the case of threemanifolds with torus boundary. This case can be used to give another proof of the surgery exact triangle for ̂ HF. We relate the bordered Floer homology of a threemanifold with torus boundary with the knot Floer homology of a filling.
CommunicationAvoiding LinearAlgebraic Primitives for Graph Analytics
"... Graph algorithms typically have very low computational intensities, hence their execution times are bounded by their communication requirements. In addition to improving the running time drastically, reducing communication will also help improve the energy consumption of graph algorithms. Many of th ..."
Abstract
 Add to MetaCart
(Show Context)
Graph algorithms typically have very low computational intensities, hence their execution times are bounded by their communication requirements. In addition to improving the running time drastically, reducing communication will also help improve the energy consumption of graph algorithms. Many of the positive results for communicationavoiding algorithms come from numerical linear algebra. This suggests an immediate path forward for developing communicationavoiding graph algorithms in the language of linear algebra. Unfortunately, the algorithms that achieve communication optimality for asymptotically more available memory are the socalled 3D algorithms, yet the existing software for graph analytics is either 1D or 2D. In this talk, I will describe two new communicationavoiding kernels for graph computations, discuss how they can be integrated into an existing library like the Combinatorial BLAS and how they can be incorporated into the future Graph BLAS standard. Sparse matrixmatrix multiplication (SpGEMM) enables efficient parallelization of various graph algorithms. It is the workhorse of a scalable distributedmemory implementation of betweenness centrality, an algorithm that finds influential entities in networks. Existing parallel algorithms for SpGEMM spend the majority of their time in internode communication on large concurrencies.
An Efficient GPU General Sparse MatrixMatrix Multiplication for Irregular Data
"... Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle ext ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory preallocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited onchip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the stateoftheart GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures. Keywordssparse matrices; matrix multiplication; linear algebra; GPU; merging; parallel algorithms; I.