Results 1  10
of
12
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
On sharedmemory parallelization of a sparse matrix scaling algorithm
"... Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by smal ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by small number of operations per each data access. Our main application is a sparse matrix scaling algorithm which is more memory bound than the sparse matrix vector multiplication operation. We take the application and parallelize it using the standard OpenMP programming principles. Apart from the common race condition avoiding constructs, we do not reorganize the algorithm. Rather, we identify associated performance metrics and describe models to optimize them. By using these models, we implement parallel matrix scaling algorithms for two wellknown sparse matrix storage formats. Experimental results show that simple parallelization attempts which leave data/work partitioning to the runtime scheduler can suffer from the overhead of avoiding race conditions especially when the number of threads increases. The proposed algorithms perform better than these algorithms by optimizing the identified performance metrics and reducing the overhead. KeywordsSharedmemory parallelization, sparse matrices, hypergraphs, matrix scaling I.
Utilizing Recursive Storage in Sparse MatrixVector Multiplication—Preliminary Considerations
"... Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operati ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operation on two multicore and one multiprocessor machines. We report benchmark results showing high performance and scalability comparable to current state of the art implementations. 1
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid ∗
"... Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techni ..."
Abstract
 Add to MetaCart
Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techniques for SpMV in a common representation. Using this representation in conjuction with precisionreduction techniques, we develop and evaluate highperformance SpMV kernels. We also study the implications of using our SpMV kernels in a complete iterative solver. Our method of choice is a Combinatorial Multigrid solver that can fully utilize our fastest reducedprecision SpMV kernel without sacrificing the quality of the solution. We provide extensive empirical evaluation of the effectiveness of the approach on a variety of benchmark matrices, demonstrating substantial speedups on all matrices considered. I.
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
, 1302
"... Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse mat ..."
Abstract
 Add to MetaCart
Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cuttingedge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi’s sparse kernel performance is very promising and even better than that of cuttingedge general purpose processors and GPUs. 1
01ER25509. Autotuning Sparse MatrixVector Multiplication for Multicore
"... All rights reserved. ..."
Fluid Pinchoff
"... This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day ..."
Abstract
 Add to MetaCart
This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day
Algebraic Domain Decomposition Methods for Highly Heterogeneous Problems
, 2013
"... We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a ..."
Abstract
 Add to MetaCart
We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a Schwarz domain decomposition method. The unknowns are partitioned with a criterion based on the entries of the input matrix. This leads to substantial gains compared to a partition based only on the adjacency graph of the matrix. From the information generated during the solving of the first linear system, it is possible to build a coarse space for a twolevel domain decomposition algorithm that leads to an acceleration of the convergence of the subsequent linear systems. We compare two coarse spaces: a classical approach and a new one adapted to parallel
Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Manycore Architectures (Regular Submission)
"... Shared memory manycore processors such as GPUs have been extensively used in accelerating computationintensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory manycore architectures, nontrivial modifications are ..."
Abstract
 Add to MetaCart
Shared memory manycore processors such as GPUs have been extensively used in accelerating computationintensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory manycore architectures, nontrivial modifications are often needed in order to match the execution patterns of the target algorithms with the characteristics of manycore architectures. 3D curve traversal is a fundamental process in many applications, and is commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g., kdtrees). However, curve traversal using hierarchical data structures needs to conduct repeated hierarchical searches. Such search process is timeconsuming on shared memory manycore architectures since it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper, we propose a novel spatial decomposition based data structure, called Shell, which completely avoids hierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region in the decomposed space, which allows any curve traversing in a region to find the next neighboring region to traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversal approach works for other spatial decomposition paradigms and manycore processors, we illustrate it using kdtree decomposition on GPU and compare with the fastest known kdtree searching algorithms for ray traversal. Analysis and experimental results show that our approach improves ray traversal performance considerably over the kdtree searching approaches.
Reducing Multicore Bandwidth Requirements for Combinatorial Multigrid
"... Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel algorithms. In this paper, we introduce hierarchical diagonal blocking, a sparse matrix representation which we believe captures most of optimization techniques in a common representation. It can take advantage of sy ..."
Abstract
 Add to MetaCart
Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel algorithms. In this paper, we introduce hierarchical diagonal blocking, a sparse matrix representation which we believe captures most of optimization techniques in a common representation. It can take advantage of symmetry while still being easy to parallelize. It takes advantage of, or actually requires, reordering. It also allows for simple compression of column indices. As applications, we show how to use this highperformance SpMV kernel, together with precision reduction techniques, in a combinatorial multigrid solver to lower the bandwidth consumption without sacrificing the final solution’s quality. We provide extensive empirical evaluation of the effectiveness of the approach on a variety of benchmark matrices, demonstrating substantial speedups on all matrices considered. I.