Results 1 - 10
of
45
Fast Multipole Methods on Graphical Processors
- Journal of Computational Physics
"... The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), whic ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), which allows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on dataparallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architectures, and determined optimal settings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range 30-60 compared to a serial CPU implementation for benchmark cases of up to million size. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at 24-43 Teraflop rate if we use straightforward summation. 1
GP on SPMD parallel Graphics Hardware for mega Bioinformatics Data Mining
- SOFT COMPUTING
"... ..."
Avoiding Cache Thrashing due to Private Data Placement in Last-level Cache For Manycore Scaling
"... Abstract — Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid threeway communication for shared reads. However, if inclusion enc ..."
Abstract
-
Cited by 7 (7 self)
- Add to MetaCart
Abstract — Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid threeway communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then
A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators LAPACK Working Note #223
"... Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs ’ compute power vs the CPU-GPU communi ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. We present a Cholesky factorization for multicore with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs ’ compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithm features two levels of nested parallelism. A coarse-grained parallelism is provided by splitting the computation into tiles for concurrent execution between GPUs. A fine-grained parallelism is further provided by splitting the work-load within a tile for high efficiency computing on GPUs but also, in certain cases, to benefit from hybrid computations by using both GPUs and CPUs. Our resulting computational kernels are highly optimized. An efficient task scheduling mechanism ensures a load balanced execution over the entire multicore with GPU
Speeding up Subset Seed Algorithm for Intensive Protein Sequence Comparison
- PROCEEDINGS OF THE 6TH IEEE INTERNATIONAL CONFERENCE ON RESEARCH, INNOVATION & VISION FOR THE FUTURE (RIVF)
, 2008
"... Abstract—Sequence similarity search is a common and repeated task in molecular biology. The rapid growth of genomic databases leads to the need of speeding up the treatment of this task. In this paper, we present a subset seed algorithm for intensive protein sequence comparison. We have accelerated ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract—Sequence similarity search is a common and repeated task in molecular biology. The rapid growth of genomic databases leads to the need of speeding up the treatment of this task. In this paper, we present a subset seed algorithm for intensive protein sequence comparison. We have accelerated this algorithm by using indexing technique and fine grained parallelism of GPU and SIMD instructions. We have implemented two programs: iBLASTP, iTBLASTN. The GPU (SIMD) implementation of the two programs achieves a speed up ranging from 5.5 to 10 (4 to 5.6) compared to the BLASTP and TBLASTN of the BLAST program family, with comparable sensitivity. I.
GPU Acceleration of Numerical Weather Prediction
"... Abstract—Weather and climate prediction software has enjoyed the benefits of exponentially increasing processor power for almost 50 years. Even with the advent of large-scale parallelism in weather models, much of the performance increase has come from increasing processor speed rather than increase ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—Weather and climate prediction software has enjoyed the benefits of exponentially increasing processor power for almost 50 years. Even with the advent of large-scale parallelism in weather models, much of the performance increase has come from increasing processor speed rather than increased parallelism. This free ride is nearly over. Recent results also indicate that simply increasing the use of largescale parallelism will prove ineffective for many scenarios. We present an alternative method of scaling model performance by exploiting emerging architectures using the fine-grain parallelism once used in vector machines. The paper shows the promise of this approach by demonstrating a 20 × speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model on an NVIDIA 8800 GTX Graphics Processing Unit (GPU). We expect an overall 1.3 × speedup from this change alone. I.
On the Energy Efficiency of Graphics Processing Units for Scientific Computing
"... The graphics processing unit (GPU) has emerged as a computational accelerator that dramatically reduces the time to discovery in high-end computing (HEC). However, while today’s state-of-the-art GPU can easily reduce the execution time of a parallel code by many orders of magnitude, it arguably come ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The graphics processing unit (GPU) has emerged as a computational accelerator that dramatically reduces the time to discovery in high-end computing (HEC). However, while today’s state-of-the-art GPU can easily reduce the execution time of a parallel code by many orders of magnitude, it arguably comes at the expense of significant power and energy consumption. For example, the NVIDIA GTX 280 video card is rated at 236 watts, which is as much as the rest of a compute node, thus requiring a 500-W power supply. As a consequence, the GPU has been viewed as a “nongreen” computing solution. This paper seeks to characterize, and perhaps debunk, the notion of a “power-hungry GPU ” via an empirical study of the performance, power, and energy characteristics of GPUs for scientific computing. Specifically, we take an important biological code that runs in a traditional CPU environment and transform and map it to a hybrid CPU+GPU environment. The end result is that our hybrid CPU+GPU environment, hereafter referred to simply as GPU environment, delivers an energy-delay product that is multiple orders of magnitude better than a traditional CPU environment, whether unicore or multicore. 1.
An Improved MAGMA GEMM for Fermi GPUs
, 2010
"... Abstract. We present an improved matrix-matrix multiplication routine (GEMM) in the MAGMA BLAS library that targets the Fermi GPUs. We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memor ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. We present an improved matrix-matrix multiplication routine (GEMM) in the MAGMA BLAS library that targets the Fermi GPUs. We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and sizes. The improved kernels run at up to 300 GFlop/s in double and up to 600 GFlop/s in single precision arithmetic (on a C2050), which is 58 % of the theoretical peak. We compare the improved kernels with the currently available in CUBLAS 3.1. Further, we show the effect of the new kernels on higher level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems. A general conclusion is that DLA has become a better fit for the new GPU architectures, to the point where DLA can run more efficiently on GPUs than on current, high-end homogeneous multicore-based systems. 1
GPU Kernels as Data-Parallel Array Computations in Haskell
, 2009
"... We present a novel high-level parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different m ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present a novel high-level parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks. 1.
Wait-free Programming for General Purpose Computations on Graphics Processors
"... The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of the single-instruction multiple-data (SIMD). However, unlike CPUs, GPUs devote their transistors mainly to data processing rather than data caching and flow control, and consequently most of the powerful GPUs with many cores do not support any synchronization mechanisms between their cores. This prevents GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of synchronization mechanisms in recent GPU architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-andset and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), whether N is the number of processes. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.

