Results 1  10
of
30
Fast Multipole Methods on Graphical Processors
 Journal of Computational Physics
"... The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), whic ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
(Show Context)
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), which allows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on dataparallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architectures, and determined optimal settings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range 3060 compared to a serial CPU implementation for benchmark cases of up to million size. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at 2443 Teraflop rate if we use straightforward summation. 1
Concurrent number cruncher: a gpu implementation of a general sparse linear solver
 Int. J. Parallel Emerg. Distrib. Syst
"... A wide class of numerical methods needs to solve a linear system, where the matrix pattern of nonzero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purp ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
(Show Context)
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of nonzero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMDATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse generalpurpose linear solver. Our implementation of the Jacobipreconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leadingedge CPU counterparts, making it attractive for applications which content with single precision.
Using mixed precision for sparse matrix computations to enhance the performance while achieving 64bit accuracy
 ACM Trans. Math. Softw
"... By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techni ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as
Assembly of finite element methods on graphics processors
 International Journal for Numerical Methods in Engineering
"... Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30 or more in comparison to a well optimized serial implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finiteelement discretization. Copyright c©
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Implementation of floatfloat operators on graphics hardware
 In 7th conference on Real Numbers and Computers, RNC7
, 2006
"... The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floatingpoint precision. This computational power is now being used for genera ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floatingpoint precision. This computational power is now being used for generalpurpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44bit floatingpoint number format and its corresponding operations. An implementation is presented along with performance and accuracy results. 1
Quantum Monte Carlo on Graphical Processing Units
, 2007
"... Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an endtoend QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30x speed up, the overall application performs at up to 6x relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision.
Fast conjugate gradients with multiple GPUs
 In ICCS ’09: Proceedings of the 9th International Conference on Computational Science
, 2009
"... Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs in ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we utilize GPU’s high memory bandwidth for implementation of a sparse iterative solver for unstructured problems. We describe a fast Conjugate Gradient solver, which runs on multiple GPUs installed on a single mainboard. The solver achieves double precision accuracy with single precision GPUs, using a mixed precision iterative refinement algorithm. To achieve high computation speed, we propose a fast sparse matrixvector multiplication algorithm, which is the core operation of iterative solvers. The proposed multiplication algorithm efficiently utilizes GPU resources via caching, coalesced memory accesses and load balance between running threads. Experiments on wide range of matrices show that our matrixvector multiplication algorithm achieves up to 9.9 Gflops on single GeForce 8800 GTS card and CG implementation achieves up to 22.6 Gflops with four GPUs. 1
Supporting extended precision on graphics processors
 Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), June 7, 2010
, 2010
"... Scientific computing applications often require support for nontraditional data types, for example, numbers with a precision higher than 64bit floats. As graphics processors, or GPUs, have emerged as a powerful accelerator for scientific computing, we design and implement a GPUbased extended pre ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Scientific computing applications often require support for nontraditional data types, for example, numbers with a precision higher than 64bit floats. As graphics processors, or GPUs, have emerged as a powerful accelerator for scientific computing, we design and implement a GPUbased extended precision library to enable applications with high precision requirement to run on the GPU. Our library contains arithmetic operators, mathematical functions, and dataparallel primitives, each of which can operate at either multiterm or multidigit precision. The multiterm precision maintains an accuracy of up to 212 bits of signifcand whereas the multidigit precision allows an accuracy of an arbitrary number of bits. Additionally, we have integrated the extended precision algorithms to a GPUbased query processing engine to support efficient query processing with extended precision on GPUs. To demonstrate the usage of our library, we have implemented three applications: parallel summation in climate modeling, Newton’s method used in nonlinear physics, and high precision numerical integration in experimental mathematics. The GPUbased implementation is up to an order of magnitude faster, and achieves the same accuracy as their optimized, quadcore CPUbased counterparts. 1.
Mixed precision methods for convergent iterative schemes
 EDGE
, 2006
"... Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of