Results 1  10
of
113
Implementing sparse matrixvector multiplication on throughputoriented processors
 In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing
, 2009
"... Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential ..."
Abstract

Cited by 142 (7 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughputoriented processors for sparse operations requires that we expose substantial finegrained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are wellsuited to throughputoriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quadcore Intel Clovertown system. 1.
One point isometric matching with the heat kernel
 Computer Graphics Forum
"... HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
(Show Context)
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Modeldriven autotuning of sparse matrixvector multiply on GPUs
 In PPoPP
, 2010
"... We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for G ..."
Abstract

Cited by 65 (4 self)
 Add to MetaCart
(Show Context)
We present a performance modeldriven framework for automated performance tuning (autotuning) of sparse matrixvector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully handtuned SpMV implementations for GPUs, identifying key GPUspecific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed stateoftheart implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in singleprecision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior stateoftheart unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5 × for single and doubleprecision respectively. However, achieving this level of performance requires input matrixdependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and runtime estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15 % of those found through exhaustive search.
Accelerating CUDA graph algorithms at maximum warp
 In PPoPP
, 2011
"... Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs t ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
(Show Context)
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most realworld graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warpcentric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables tradeoffs between workload imbalance and ALU underutilization for finetuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30 % improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multithreaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
Assembly of finite element methods on graphics processors
 International Journal for Numerical Methods in Engineering
"... Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30 or more in comparison to a well optimized serial implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finiteelement discretization. Copyright c©
Parallel simrank computation on large graphs with iterative aggregation
 KDD'10
, 2010
"... Recently there has been a lot of interest in graphbased analysis. One of the most important aspects of graphbased analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing method ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
Recently there has been a lot of interest in graphbased analysis. One of the most important aspects of graphbased analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing methods on SimRank computation suffer from two limitations: 1) the computing cost can be very high in practice; and 2) they can only be applied on static graphs. In this paper, we exploit the inherent parallelism and high memory bandwidth of graphics processing units (GPU) to accelerate the computation of SimRank on large graphs. Furthermore, based on the observation that SimRank is essentially a firstorder Markov Chain, we propose to utilize the iterative aggregation techniques for uncoupling Markov chains to compute SimRank scores in parallel for large graphs. The iterative aggregation method can be applied on dynamic graphs. Moreover, it can handle not only the linkupdating problem but also the nodeupdating problem. Extensive experiments on synthetic and real data sets verify that the proposed methods are efficient and effective.
Exposing finegrained parallelism in algebraic multigrid methods
, 2012
"... Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating mu ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarsegrained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughputoriented processors, such as the GPU, demands algorithms with abundant finegrained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial finegrained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8 × in the setup phase and 5.7 × in the cycling phase when compared to a representative CPU implementation.
A memory efficient and fast sparse matrix vector product on a GPU
 PROGRESS IN ELECTROMAGNETIC RESEARCH
, 2011
"... This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLRT has ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLRT has been designed specifically for accelerating the iterative solution of a large sparse and complexvalued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of onboard memory.
DL: A Data Layout Transformation System for Heterogeneous Computing
 Proc. IEEE Conf. Innovative Parallel Computing (InPar 12), IEEE
, 2012
"... For manycore architectures like the GPUs, efficient offchip memory access is crucial to high performance; the applications are often limited by offchip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve offchip memory access behavior, but sev ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
For manycore architectures like the GPUs, efficient offchip memory access is crucial to high performance; the applications are often limited by offchip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve offchip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparently marshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems. This paper presents DL, a practical GPU data layout transformation system that addresses these problems: first, a novel approach to laying out array of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete arrays today. Our proposed new layout can be derived in situ from the traditional Array of Structure, Structure of Arrays, and adjacent Discrete Arrays layouts used by programmers. Second, DL has a runtime library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient runtime marshaling library. In particular, the in situ transformation implemented in the library is comparable or faster than optimized traditional outofplace transformations while avoiding doubling the GPU DRAM usage. Third, we show experimental results that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling cost is taken into account.
A Parallel Algebraic Multigrid Solver on Graphics Processing Units ⋆
"... Abstract. The paper presents a multiGPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCGAMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrixvector multiplication scheme underlying the PCG ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The paper presents a multiGPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCGAMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrixvector multiplication scheme underlying the PCGAMG algorithm is presented for the manycore GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multiGPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. 1