Results 1  10
of
28
OSKI: A library of automatically tuned sparse matrix kernels
 Institute of Physics Publishing
, 2005
"... kernels ..."
(Show Context)
Optimization of Sparse Matrixvector Multiplication on Emerging Multicore Platforms
 In Proc. SC2007: High performance computing, networking, and storage conference
, 2007
"... We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore spec ..."
Abstract

Cited by 153 (20 self)
 Add to MetaCart
(Show Context)
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrixvector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dualcore and Intel quadcore designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing stateoftheart serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memorybound numerical algorithms. 1.
Autotuning Performance on Multicore Computers
, 2008
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract

Cited by 30 (8 self)
 Add to MetaCart
(Show Context)
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
Cacheoblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods.
 SIAM Journal on Scientific Computing
, 2009
"... Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak per ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak performance. Cacheaware algorithms take information on specifics of the cache architecture as a parameter to derive an efficient SpMV multiply. In contrast, cacheoblivious algorithms strive to obtain efficient algorithms regardless of cache specifics. In this area, earlier research by
Efficient Sparse MatrixVector Multiplication on x86based Manycore Processors
 In 27th International Conference on Supercomputing (ICS
"... Sparse matrixvector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory b ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) loadimbalance due to nonuniform matrix structures. We describe an efficient implementation of SpMV on the Intel R© Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90 % of KNC’s achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual Intel R© Xeon R © Processor E52680 and the NVIDIA Tesla K20X architecture.
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
, 1302
"... Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse mat ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Intel Xeon Phi is a recently released highperformance coprocessor which features 61 cores each supporting 4 hardware threads with 512bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cuttingedge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi’s sparse kernel performance is very promising and even better than that of cuttingedge general purpose processors and GPUs. 1
Fast Iterative Graph Computation with Block Updates
"... Scaling iterative graph processing applications to large graphs is an important problem. Performance is critical, as data scientists need to execute graph programs many times with varying parameters. The need for a highlevel, highperformance programming model has inspired much research on graph pr ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Scaling iterative graph processing applications to large graphs is an important problem. Performance is critical, as data scientists need to execute graph programs many times with varying parameters. The need for a highlevel, highperformance programming model has inspired much research on graph programming frameworks. In this paper, we show that the important class of computationally light graph applications – applications that perform little computation per vertex – has severe scalability problems across multiple cores as these applications hit an early “memory wall ” that limits their speedup. We propose a novel blockoriented computation model, in which computation is iterated locally over blocks of highly connected nodes, significantly improving the amount of computation per cache miss. Following this model, we describe the design and implementation of a blockaware graph processing runtime that keeps the familiar vertexcentric programming paradigm while reaping the benefits of blockoriented execution. Our experiments show that blockoriented execution significantly improves the performance of our framework for several graph applications. 1.
Utilizing Recursive Storage in Sparse MatrixVector Multiplication—Preliminary Considerations
"... Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operati ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Computations with sparse matrices on “multicore cache based ” computers are affected by the irregularity of the problem at hand, and performance degrades easily. In this note we propose a recursive storage format for sparse matrices, and evaluate its usage for the Sparse MatrixVector (SpMV) operation on two multicore and one multiprocessor machines. We report benchmark results showing high performance and scalability comparable to current state of the art implementations. 1