Results 1  10
of
22
Efficient Sparse MatrixVector Multiplication on x86based Manycore Processors
 In 27th International Conference on Supercomputing (ICS
"... Sparse matrixvector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory b ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) loadimbalance due to nonuniform matrix structures. We describe an efficient implementation of SpMV on the Intel R© Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90 % of KNC’s achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual Intel R© Xeon R © Processor E52680 and the NVIDIA Tesla K20X architecture.
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
, 1302
"... Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse mat ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cuttingedge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi’s sparse kernel performance is very promising and even better than that of cuttingedge general purpose processors and GPUs. 1
Hardware/software vectorization for closeness centrality on multi/manycore architectures
 In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with finegrain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concurrent breadthfirst search operations and significantly increases the performance. We provide a comparison of different vectorization schemes and experimentally evaluate our contributions with respect to the existing parallel CPUbased solutions on cuttingedge hardware. Our implementations achieve to be 11 times faster than the stateoftheart implementation for a graph with 234 million edges. The proposed techniques are beneficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a largescale network on cuttingedge architectures. KeywordsCentrality, closeness centrality, vectorization, breadthfirst search, Intel Xeon Phi. I.
Efficient Multithreaded Untransposed, Transposed or Symmetric Sparse MatrixVector Multiplication with the Recursive Sparse Blocks Format
, 2014
"... In earlier work we have introduced the “Recursive Sparse Blocks ” (RSB) sparse matrix storage scheme oriented towards cache efficient matrixvector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV T) and symmetric (SymSpM ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In earlier work we have introduced the “Recursive Sparse Blocks ” (RSB) sparse matrix storage scheme oriented towards cache efficient matrixvector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV T) and symmetric (SymSpMV) matrixvector multiply variants are supported. RSB stands for a metaformat: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional format — either Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV T, SymSpMV to that of the stateoftheart Intel Math Kernel Library (MKL) CSR implementation on the recent Intel’s Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB’s SymSpMV (and in most cases, SpMV T as well) took less than half of MKL CSR’s time; SpMV ’s advantage was smaller. Furthermore, RSB’s SpMV T is more scalable than MKL’s CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the stateofthe art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a nontraditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar rowordered representation arrays in the time of a few dozens of matrixvector multiply executions. Thanks to its significant advantage over MKL’s CSR routines for symmetric or transposed matrixvector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations.
An Extended Compression Format for the Optimization of Sparse MatrixVector Multiplication
"... Sparse matrixvector multiplication (SpM×V) has been characterized as one of the most significant computational scientific kernels. The key algorithmic characteristic of the SpM×V kernel, that inhibits it from achieving high performance, is its very low flop:byte ratio. In this paper, we present an ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Sparse matrixvector multiplication (SpM×V) has been characterized as one of the most significant computational scientific kernels. The key algorithmic characteristic of the SpM×V kernel, that inhibits it from achieving high performance, is its very low flop:byte ratio. In this paper, we present an extended and integrated compressed storage format, called Compressed Sparse eXtended (CSX), that is able to detect and encode simultaneously multiple commonly encountered substructures inside a sparse matrix. Relying on aggressive compression techniques of the sparse matrix’s indexing structure, CSX is able to considerably reduce the memory footprint of a sparse matrix, therefore alleviating the pressure to the memory subsystem. In a diverse set of sparse matrices, CSX was able to provide a more than 40 % average performance improvement over the standard CSR format in symmetric shared memory architectures and surpassed 20 % improvement in NUMA architectures, significantly outperforming other CSR alternatives. Additionally, it was able to adapt successfully to the nonzero element structure of the considered matrices, exhibiting very stable performance. Finally, in the context of a ‘reallife’ multiphysics simulation software, CSX was able to accelerate the SpM×V component nearly 40% and the total solver time approximately 15 % after 1000 linear system iterations.
An Efficient GPU General Sparse MatrixMatrix Multiplication for Irregular Data
"... Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle ext ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—General sparse matrixmatrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory preallocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited onchip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the stateoftheart GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures. Keywordssparse matrices; matrix multiplication; linear algebra; GPU; merging; parallel algorithms; I.
Incremental closeness centrality in distributed memory
"... a b s t r a c t Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metri ..."
Abstract
 Add to MetaCart
(Show Context)
a b s t r a c t Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose STREAMER, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMMbased BFSs, and it takes NUMA effects into account. It makes maintaining the closeness centrality values of reallife networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster.
Optimization by Runtime Specialization for Sparse MatrixVector Multiplication
, 2014
"... Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient codes. In this paper, we explore the potential f ..."
Abstract
 Add to MetaCart
Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient codes. In this paper, we explore the potential for obtaining speedups for sparse matrixdense vector multiplication using runtime specialization, in the case where a single matrix is to be multiplied by many vectors. We experiment with five methods involving runtime specialization, comparing them to methods that do not (including Intel’s MKL library). For this work, our focus is the evaluation of the speedups that can be obtained with runtime specialization without considering the overheads of the code generation. Our experiments use 23 matrices from the Matrix Market and Florida collections, and run on five different machines. In 94 of those 115 cases, the specialized code runs faster than any version without specialization. If we only use specialization, the average speedup with respect to Intel’s MKL library ranges from 1.44x to 1.77x, depending on the machine. We have also found that the best method depends on the matrix and machine; no method is best for all matrices and machines.
Fluid Pinchoff
"... This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day ..."
Abstract
 Add to MetaCart
This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day