Results 1  10
of
13
Concurrent number cruncher: a gpu implementation of a general sparse linear solver
 Int. J. Parallel Emerg. Distrib. Syst
"... A wide class of numerical methods needs to solve a linear system, where the matrix pattern of nonzero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purp ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of nonzero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMDATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse generalpurpose linear solver. Our implementation of the Jacobipreconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leadingedge CPU counterparts, making it attractive for applications which content with single precision.
Using mixed precision for sparse matrix computations to enhance the performance while achieving 64bit accuracy
 ACM Trans. Math. Softw
"... By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techni ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
By using a combination of 32bit and 64bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Implementation of floatfloat operators on graphics hardware
 In 7th conference on Real Numbers and Computers, RNC7
, 2006
"... The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floatingpoint precision. This computational power is now being used for genera ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floatingpoint precision. This computational power is now being used for generalpurpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44bit floatingpoint number format and its corresponding operations. An implementation is presented along with performance and accuracy results. 1
Mixed precision methods for convergent iterative schemes
 EDGE
, 2006
"... Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of
Quantum Monte Carlo on Graphical Processing Units
, 2007
"... Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an endtoend QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30x speed up, the overall application performs at up to 6x relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision.
Contents lists available at ScienceDirect Journal of Computational Physics
"... journal homepage: www.elsevier.com/locate/jcp ..."
RealNumber Optimisation: A Speculative, ProfileGuided Approach
, 2007
"... From supercomputers for computational science to embedded processors in mobile phones, most important computing applications manipulate the set of real numbers, R. How these numbers are represented varies, with embedded applications picking fixedpoint formats compatible with integer operations and ..."
Abstract
 Add to MetaCart
From supercomputers for computational science to embedded processors in mobile phones, most important computing applications manipulate the set of real numbers, R. How these numbers are represented varies, with embedded applications picking fixedpoint formats compatible with integer operations and larger machines using IEEE754 floating point or a close variant. A large body of work describes methods for optimising floating point representations using static analysis techniques, however these must always take a conservative approach if they intend to ensure correctness. Taking our inspiration from work on speculative execution and profileguided compiler optimisations, we lay out a series of tools and techniques to produce optimised realnumber representations. Our speculative approach aims for greater reductions in hardware area and execution time than with more conservative approaches, while providing fallback options to ensure correctness in case of incorrect speculation. We describe a profiling tool for x86 binaries which reveals bucketised value ranges for floatingpoint operations within applications. A selection of profiling results for realworld scientific
Nodal Discontinuous Galerkin Methods on Graphics Processors
, 901
"... Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, a ..."
Abstract
 Add to MetaCart
Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, another property of DG has been growing in importance: The majority of a DG operator is applied in an elementlocal way, with weak penaltybased elementtoelement coupling. The resulting locality in memory access is one of the factors that enables DG to run on offtheshelf, massively parallel graphics processors (GPUs). In addition, DG’s highorder nature lets it require fewer data points per represented wavelength and hence fewer memory accesses, in exchange for higher arithmetic intensity. Both of these factors work significantly in favor of a GPU implementation of DG. Using a single US$400 Nvidia GTX 280 GPU, we accelerate a solver for Maxwell’s equations on a general 3D unstructured grid by a factor of 40 to 60 relative to a serial computation on a currentgeneration CPU. In many cases, our algorithms exhibit full use of the device’s available memory bandwidth. Example computations achieve and surpass 200 gigaflops/s of net applicationlevel floating point work. In this article, we describe and derive the techniques used to reach this level of performance. In addition, we present comprehensive data on the accuracy and runtime behavior of the method.
Computer Physics Communications 177 (2007) 298–306 www.elsevier.com/locate/cpc Quantum Monte Carlo on graphical processing units
, 2007
"... Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method ..."
Abstract
 Add to MetaCart
Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an endtoend QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30 × speed up, the overall application performs at up to 6 × faster relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision.