Results 1  10
of
26
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
, 1302
"... Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse mat ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cuttingedge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi’s sparse kernel performance is very promising and even better than that of cuttingedge general purpose processors and GPUs. 1
Utilizing Recursive Storage in Sparse MatrixVector Multiplication—Preliminary Considerations
"... Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operati ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operation on two multicore and one multiprocessor machines. We report benchmark results showing high performance and scalability comparable to current state of the art implementations. 1
Exact sparse matrixvector multiplication on gpu’s and multicore architectures
 In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, PASCO ’10
, 2010
"... We propose different implementations of the sparse matrix–dense vector multiplication (SpMV) for finite fields and rings Z /mZ. We take advantage of graphic card processors (GPU) and multicore architectures. Our aim is to improve the speed of SpMV in the LinBox library, and henceforth the speed of ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
We propose different implementations of the sparse matrix–dense vector multiplication (SpMV) for finite fields and rings Z /mZ. We take advantage of graphic card processors (GPU) and multicore architectures. Our aim is to improve the speed of SpMV in the LinBox library, and henceforth the speed of its black box algorithms. Besides, we use this and a new parallelization of the sigmabasis algorithm in a parallel block Wiedemann rank implementation over finite fields. 1
Use of hybrid recursive csr/coo data structures in sparse matricesvector multiplication
 In IMCSIT
, 2010
"... Abstract—Recently, we have introduced an approach to basic sparse matrix computations on multicore cache based machines using recursive partitioning. Here, the memory representation of a sparse matrix consists of a set of submatrices, which are used as leaves of a quadtree structure. In this paper, ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Recently, we have introduced an approach to basic sparse matrix computations on multicore cache based machines using recursive partitioning. Here, the memory representation of a sparse matrix consists of a set of submatrices, which are used as leaves of a quadtree structure. In this paper, we evaluate the performance impact, on the Sparse MatrixVector Multiplication (SpMV), of a modification to our Recursive CSR implementation, allowing the use of multiple data structures in leaf matrices (CSR/COO, with either 16/32 bit indices). I.
On sharedmemory parallelization of a sparse matrix scaling algorithm
"... Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by smal ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We discuss efficient shared memory parallelization of sparse matrix computations whose main traits resemble to those of the sparse matrixvector multiply operation. Such computations are difficult to parallelize because of the relatively small computational granularity characterized by small number of operations per each data access. Our main application is a sparse matrix scaling algorithm which is more memory bound than the sparse matrix vector multiplication operation. We take the application and parallelize it using the standard OpenMP programming principles. Apart from the common race condition avoiding constructs, we do not reorganize the algorithm. Rather, we identify associated performance metrics and describe models to optimize them. By using these models, we implement parallel matrix scaling algorithms for two wellknown sparse matrix storage formats. Experimental results show that simple parallelization attempts which leave data/work partitioning to the runtime scheduler can suffer from the overhead of avoiding race conditions especially when the number of threads increases. The proposed algorithms perform better than these algorithms by optimizing the identified performance metrics and reducing the overhead. KeywordsSharedmemory parallelization, sparse matrices, hypergraphs, matrix scaling I.
Algebraic Domain Decomposition Methods for Highly Heterogeneous Problems
 in "SIAM Journal on Scientific Computing
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Assembling Recursively Stored Sparse Matrices
"... Abstract—Recently, we have introduced an approach to multicore computations on sparse matrices using recursive partitioning, called Recursive Sparse Blocks (RSB). In this document, we discuss issues involved in assembling matrices in the RSB format. Since the main expected application area is iter ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Recently, we have introduced an approach to multicore computations on sparse matrices using recursive partitioning, called Recursive Sparse Blocks (RSB). In this document, we discuss issues involved in assembling matrices in the RSB format. Since the main expected application area is iterative methods, we compare the performance of matrix assembly to that of matrixvector multiply (SpMV), outlining both scalability of the method and execution times ratio. I.
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid ∗
"... Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techni ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Memory bandwidth is a major limiting factor in the scalability of parallel iterative algorithms that rely on sparse matrixvector multiplication (SpMV). This paper introduces Hierarchical Diagonal Blocking (HDB), an approach which we believe captures many of the existing optimization techniques for SpMV in a common representation. Using this representation in conjuction with precisionreduction techniques, we develop and evaluate highperformance SpMV kernels. We also study the implications of using our SpMV kernels in a complete iterative solver. Our method of choice is a Combinatorial Multigrid solver that can fully utilize our fastest reducedprecision SpMV kernel without sacrificing the quality of the solution. We provide extensive empirical evaluation of the effectiveness of the approach on a variety of benchmark matrices, demonstrating substantial speedups on all matrices considered. I.