Results 1  10
of
74
Optimization of Sparse Matrixvector Multiplication on Emerging Multicore Platforms
 In Proc. SC2007: High performance computing, networking, and storage conference
, 2007
"... We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore spec ..."
Abstract

Cited by 153 (22 self)
 Add to MetaCart
(Show Context)
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrixvector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dualcore and Intel quadcore designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing stateoftheart serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memorybound numerical algorithms. 1.
OSKI: A library of automatically tuned sparse matrix kernels
 Institute of Physics Publishing
, 2005
"... kernels ..."
(Show Context)
Implementing sparse matrixvector multiplication on throughputoriented processors
 In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing
, 2009
"... Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential ..."
Abstract

Cited by 137 (6 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughputoriented processors for sparse operations requires that we expose substantial finegrained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are wellsuited to throughputoriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quadcore Intel Clovertown system. 1.
The Potential of the Cell Processor for Scientific Computing
 CF'06
, 2006
"... The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations o ..."
Abstract

Cited by 95 (7 self)
 Add to MetaCart
(Show Context)
The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cachebased designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future highend computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of doubleprecision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
Modeldriven autotuning of sparse matrixvector multiply on GPUs
 In PPoPP
, 2010
"... We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for G ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for GPUs, identifying key GPUspecific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed stateoftheart implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in singleprecision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior stateoftheart unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5 × for single and doubleprecision respectively. However, achieving this level of performance requires input matrixdependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and runtime estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15 % of those found through exhaustive search.
Minimizing Communication in Sparse Matrix Solvers
"... Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vecto ..."
Abstract

Cited by 35 (10 self)
 Add to MetaCart
(Show Context)
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparsematrix kernel to compute a set of matrixvector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our sharedmemory implementation on an 8core Intel Clovertown gets speedups of up to 4.3 × over standard GMRES, without sacrificing convergence rate or numerical stability. 1.
Autotuning Performance on Multicore Computers
, 2008
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
(Show Context)
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
When cache blocking sparse matrix vector multiply works and why
 In Proceedings of the PARA’04 Workshop on the Stateoftheart in Scientific Computing
, 2004
"... Abstract We present new performance models and more compact data structures for cache blocking when applied to sparse matrixvector multiply (SpM×V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimu ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Abstract We present new performance models and more compact data structures for cache blocking when applied to sparse matrixvector multiply (SpM×V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM×V faster.
A Case for Machine Learning to Optimize Multicore Performance
"... Multicore architectures have become so complex and diverse that there is no obvious path to achieving good performance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machinemonths to explore exhaustive ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Multicore architectures have become so complex and diverse that there is no obvious path to achieving good performance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machinemonths to explore exhaustively. Inspired by successes in the systems community, we apply stateoftheart machine learning techniques to explore this space more intelligently. On 7point and 27point stencil code, our technique takes about two hours to discover a configuration whose performance is within 1 % of and up to 18 % better than that achieved by a human expert. This factor of 2000 speedup over manual exploration of the autotuning parameter space enables us to explore optimizations that were previously offlimits. We believe the opportunity for using machine learning in multicore autotuning is even more promising than the successes to date in the systems literature. 1