Results 11  20
of
124
GP on SPMD parallel Graphics Hardware for mega Bioinformatics Data Mining
 SOFT COMPUTING
"... ..."
Accelerating GPU kernels for dense linear algebra
"... Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currentl ..."
Abstract

Cited by 15 (12 self)
 Add to MetaCart
(Show Context)
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimization techniques – allows us to easily remove performance oscillations associated with problem dimensions not divisible by fixed blocking sizes. For example, applied to the matrixmatrix multiplication routines, depending on the hardware configuration and routine parameters, this can lead to two times faster algorithms. Similarly, the matrixvector multiplication can be accelerated more than two times in both single and double precision arithmetic. Additionally, GPU specific acceleration techniques are applied to develop new kernels (e.g. syrk, symv) that are up to 20 × faster than the currently available kernels. We present these kernels and also show their acceleration effect to higher level dense linear algebra routines. The accelerated kernels are now freely available through the MAGMA BLAS library. Keywords: BLAS, GEMM, GPUs.
GPU Kernels as DataParallel Array Computations in Haskell
, 2009
"... We present a novel highlevel parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as dataparallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different m ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
We present a novel highlevel parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as dataparallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks. 1.
Vuduc, “Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems
 in ICS
, 2009
"... We describe heterogeneous multiCPU and multiGPU implementations of Jacobi’s iterative method for the 2D Poisson equation on a structured grid, in both single and doubleprecision. Properly tuned, our best implementation achieves 98 % of the empirical streaming GPU bandwidth (66 % of peak) on a NV ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
We describe heterogeneous multiCPU and multiGPU implementations of Jacobi’s iterative method for the 2D Poisson equation on a structured grid, in both single and doubleprecision. Properly tuned, our best implementation achieves 98 % of the empirical streaming GPU bandwidth (66 % of peak) on a NVIDIA C1060, and 78 % on a C870. Motivated to find a still faster implementation, we further consider “wildly asynchronous ” implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations. By doing so, we tradeoff more flops, via more iterations to converge, for a higher degree of asynchronous parallelism. Our wild implementations on a GPU can be 1.2–2.5 × faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly “fastandloose ” algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators LAPACK Working Note #223
"... Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs ’ compute power vs the CPUGPU communi ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs ’ compute power vs the CPUGPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPUbased computing. The algorithm features two levels of nested parallelism. A coarsegrained parallelism is provided by splitting the computation into tiles for concurrent execution between GPUs. A finegrained parallelism is further provided by splitting the workload within a tile for high efficiency computing on GPUs but also, in certain cases, to benefit from hybrid computations by using both GPUs and CPUs. Our resulting computational kernels are highly optimized. An efficient task scheduling mechanism ensures a load balanced execution over the entire multicore with GPU
A gpu accelerated storage system
 In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (2010), Proceedings of the International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC’10
"... Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any orderofmagnitude drop in the cost per unit of performance for a ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any orderofmagnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the costtoperformance relation. In this context, we focus on data storage: We explore the feasibility of harnessing the GPUs ’ computational power to improve the performance, reliability, or security of distributed storage systems. In this context we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications’ performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications. Further, this work sheds light on the use of heterogeneous multicore processors for enhancing lowlevel system primitives, and introduces techniques to efficiently leverage the processing power of GPUs.
Hardwareefficient belief propagation
 in Proc. CVPR
, 2009
"... Abstract—Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixelwise, and sequential operations of BP make ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Loopy belief propagation (BP) is an effective solution for assigning labels to the nodes of a graphical model such as the Markov random field (MRF), but it requires high memory, bandwidth, and computational costs. Furthermore, the iterative, pixelwise, and sequential operations of BP make it difficult to parallelize the computation. In this paper, we propose two techniques to address these issues. The first technique is a new message passing scheme named tilebased belief propagation that reduces the memory and bandwidth to a fraction of the ordinary BP algorithms without performance degradation by splitting the MRF into many tiles and only storing the messages across the neighboring tiles. The tilewise processing also enables data reuse and pipeline, resulting in efficient hardware implementation. The second technique is an O(L) parallel message construction algorithm that exploits the properties of robust functions for parallelization. We apply these two techniques to a VLSI circuit for stereo matching that generates highresolution disparity maps in near realtime. We also implement the proposed schemes on GPU which is fourtime faster than standard BP on GPU. Index Terms—Belief propagation, Markov random field, energy minimization, embedded systems, VLSI circuit design, generalpurpose computation on GPU (GPGPU). M I.
AutoTuning CUDA Parameters for Sparse Matrix Vector Multiplication on GPUs
"... Abstract—Graphics Processing Unit (GPU) has become an attractive coprocessor for scientific computing due to its massive processing capability. The sparse matrixvector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering applications, such as sparse linear a ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Graphics Processing Unit (GPU) has become an attractive coprocessor for scientific computing due to its massive processing capability. The sparse matrixvector multiplication (SpMV) is a critical operation in a wide variety of scientific and engineering applications, such as sparse linear algebra and image processing. This paper presents an autotuning framework that can automatically compute and select CUDA parameters for SpMV to obtain the optimal performance on specific GPUs. The framework is evaluated on two NVIDIA GPU platforms: GeForce 9500 GTX and GeForce GTX 295. Keywords GPU; CUDA; sparse matrixvector multiplication; performance I.
Using graphics processors for high performance ir query processing
 In WWW
, 2009
"... Research Interests Web Search technology Indexing, data compression, query processing and pruning, caching Distributed System Algorithm under Hadoop framework and performance issues GPUbased computation GPUbased compression, GPUbased search, GPUbased algorithms Temporal Web Graph and ranking Web ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Research Interests Web Search technology Indexing, data compression, query processing and pruning, caching Distributed System Algorithm under Hadoop framework and performance issues GPUbased computation GPUbased compression, GPUbased search, GPUbased algorithms Temporal Web Graph and ranking Web graph with temporal information, web graph compression, ranking using temporal webgraph. Machine learning related topic Document classification.
An Improved MAGMA GEMM for Fermi GPUs
, 2010
"... Abstract. We present an improved matrixmatrix multiplication routine (GEMM) in the MAGMA BLAS library that targets the Fermi GPUs. We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memor ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We present an improved matrixmatrix multiplication routine (GEMM) in the MAGMA BLAS library that targets the Fermi GPUs. We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and sizes. The improved kernels run at up to 300 GFlop/s in double and up to 600 GFlop/s in single precision arithmetic (on a C2050), which is 58 % of the theoretical peak. We compare the improved kernels with the currently available in CUBLAS 3.1. Further, we show the effect of the new kernels on higher level dense linear algebra (DLA) routines such as the onesided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems. A general conclusion is that DLA has become a better fit for the new GPU architectures, to the point where DLA can run more efficiently on GPUs than on current, highend homogeneous multicorebased systems. 1