Results 1  10
of
25
Optimization of Sparse Matrixvector Multiplication on Emerging Multicore Platforms
 In Proc. SC2007: High performance computing, networking, and storage conference
, 2007
"... We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore spec ..."
Abstract

Cited by 153 (20 self)
 Add to MetaCart
(Show Context)
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrixvector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dualcore and Intel quadcore designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing stateoftheart serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memorybound numerical algorithms. 1.
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
Cacheoblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods.
 SIAM Journal on Scientific Computing
, 2009
"... Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak per ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak performance. Cacheaware algorithms take information on specifics of the cache architecture as a parameter to derive an efficient SpMV multiply. In contrast, cacheoblivious algorithms strive to obtain efficient algorithms regardless of cache specifics. In this area, earlier research by
Loop Chaining: A Programming Abstraction For Balancing Locality and Parallelism
"... Abstract—There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
Abstract—There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding interprocedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or runtime system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domainspecific library, OP2, used in full applications with unstructured grids, and a domainspecific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores. I.
Toward Harnessing DOACROSS Parallelism for MultiGPGPUs
"... Abstract—To exploit the full potential of GPGPUs for generalpurpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of crossiteration data dependences in DOACR loops poses an obstacle to execute their computations concurrentl ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—To exploit the full potential of GPGPUs for generalpurpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of crossiteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of finegrained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performancetuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient dataparallel SIMD execution than redblack SOR on GPGPUs. Our solution is obtained nonconventionally, by starting from a Klayer SSOR method and then parallelizing it by applying a nondependencepreserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms redblack SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDEbased DOACR loops, on GPGPUs.
Set and Relation Manipulation for the Sparse Polyhedral Framework
"... Abstract. The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compiletime specification of runtime reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of it ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compiletime specification of runtime reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of iteration across loops. The Polyhedral Model represents sets of iteration points in imperfectly nested loops with unions of polyhedral and represents loop transformations with affine functions applied to such polyhedra sets. Existing tools such as ISL, Cloog, and Omega manipulate polyhedral sets and affine functions, however the ability to represent the sets and functions where some of the constraints include uninterpreted function calls such as those needed in the SPF is nonexistant or severely restricted. This paper presents algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework. The algorithms have been implemented in an open source, C++ library called IEGenLib (The Inspector/Executor Generator Library). 1
On the Scalability of Loop Tiling Techniques
"... The Polyhedral model has proven to be a valuable tool for improving memory locality and exploiting parallelism for optimizing dense array codes. This model is expressive enough to describe transformations of imperfectly nested loops, and to capture a variety of program transformations, including man ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The Polyhedral model has proven to be a valuable tool for improving memory locality and exploiting parallelism for optimizing dense array codes. This model is expressive enough to describe transformations of imperfectly nested loops, and to capture a variety of program transformations, including many approaches to loop tiling. Tools such as the highly successful PLuTo automatic parallelizer have provided empirical confirmation of the success of polyhedralbased optimization, through experiments in which a number of benchmarks have been executed on machines with small to mediumscale parallelism. In anticipation of ever higher degrees of parallelism, we have explored the impact of various loop tiling strategies on the asymptotic degree of available parallelism. In our analysis, we consider “weak scaling ” as described by Gustavson, i.e., in which the data set size grows linearly with the number of processors available. Some, but not all, of the approaches to tiling provide weak scaling. In particular, the tiling currently performed by PLuTo does not scale in this sense. In this article, we review approaches to loop tiling in the published literature, focusing on both scalability and implementation status. We find that fully scalable tilings are not available in generalpurpose tools, and call upon the polyhedral compilation community to focus on questions of asymptotic scalability. Finally, we identify ongoing work that may resolve this issue. 1.
Increasing the Locality of Iterative Methods and its Application to the Simulation of Semiconductor Devices
 Intl. J. of High Performance Computing Applications
"... Irregular codes are present in many scientific applications, such as finite element simulations. In these simulations the solution of large sparse linear equation systems is required, which are often solved using iterative methods. The main kernel of the iterative methods is the sparse matrix–vector ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Irregular codes are present in many scientific applications, such as finite element simulations. In these simulations the solution of large sparse linear equation systems is required, which are often solved using iterative methods. The main kernel of the iterative methods is the sparse matrix–vector multiplication which frequently demands irregular data accesses. Therefore, techniques that increase the performance of this operation will have a great impact on the global performance of the iterative method and, as a consequence, on the simulations. In this paper a technique for improving the locality of sparse matrix codes is presented. The technique consists of reorganizing the data guided by a locality model instead of restructuring the code or changing the sparse matrix storage format. We have applied our proposal to different iterative methods provided by two standard numerical libraries. Results show an impact on the overall performance of the considered iterative method due to the increase in the locality of the sparse matrix–vector product. Noticeable reductions in the execution time have been achieved both in sequential and in parallel executions. This positive behavior allows the reordering technique to be successfully applied to real problems. We have focused on the simulation of semiconductor devices and in particular on the BIPS3D simulator. The technique was integrated into the simulator. Both sequential and parallel executions have been analyzed extensively in this paper. Noticeable reductions in the execution time required by the simulations are observed when using our reordered matrices in comparison with the original simulator. Key words: irregular codes, reordering techniques, data locality, iterative methods, semiconductor devices simulation 1
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE
"... Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures ..."
Abstract
 Add to MetaCart
(Show Context)
Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures
SEE PROFILE
, 2002
"... Objective measurement of dental color for age estimation by spectroradiometry ..."
Abstract
 Add to MetaCart
(Show Context)
Objective measurement of dental color for age estimation by spectroradiometry