Results 1 - 10
of
11
Concurrent number cruncher: a gpu implementation of a general sparse linear solver
- Int. J. Parallel Emerg. Distrib. Syst
"... A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purp ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. Our implementation of the Jacobi-preconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which content with single precision.
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
- In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy
- ACM Trans. Math. Softw
"... By using a combination of 32-bit and 64-bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techni ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
By using a combination of 32-bit and 64-bit floating point arithmetic the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as
Implementation of float-float operators on graphics hardware
- In 7th conference on Real Numbers and Computers, RNC7
, 2006
"... The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point precision. This computational power is now being used for genera ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point precision. This computational power is now being used for general-purpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44-bit floating-point number format and its corresponding operations. An implementation is presented along with performance and accuracy results. 1
Mixed precision methods for convergent iterative schemes
- In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures
, 2006
"... Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Most error estimates of numerical schemes are derived in the field of real or complex numbers. From a computational point of view this assumes infinite precision. For the implementation on a computer, the infinite number fields are quantized into a finite set of
Quantum Monte Carlo on Graphical Processing Units
, 2007
"... Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an end-to-end QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30x speed up, the overall application performs at up to 6x relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision.
Contents lists available at ScienceDirect Journal of Computational Physics
"... journal homepage: www.elsevier.com/locate/jcp ..."
Real-Number Optimisation: A Speculative, Profile-Guided Approach
, 2007
"... From supercomputers for computational science to embedded processors in mobile phones, most important computing applications manipulate the set of real numbers, R. How these numbers are represented varies, with embedded applications picking fixed-point formats compatible with integer operations and ..."
Abstract
- Add to MetaCart
From supercomputers for computational science to embedded processors in mobile phones, most important computing applications manipulate the set of real numbers, R. How these numbers are represented varies, with embedded applications picking fixed-point formats compatible with integer operations and larger machines using IEEE-754 floating point or a close variant. A large body of work describes methods for optimising floating point representations using static analysis techniques, however these must always take a conservative approach if they intend to ensure correctness. Taking our inspiration from work on speculative execution and profile-guided compiler optimisations, we lay out a series of tools and techniques to produce optimised real-number representations. Our speculative approach aims for greater reductions in hardware area and execution time than with more conservative approaches, while providing fall-back options to ensure correctness in case of incorrect speculation. We describe a profiling tool for x86 binaries which reveals bucketised value ranges for floatingpoint operations within applications. A selection of profiling results for real-world scientific
2010 A HIGHLY RELIABLE GPU-BASED RAID SYSTEM
"... In this work, I have shown that current parity-based RAID levels are nearing the end of their usefulness. Further, the widely used parity-based hierarchical RAID levels are not capable of significantly improving reliability over their component parity-based levels without requiring massively increas ..."
Abstract
- Add to MetaCart
In this work, I have shown that current parity-based RAID levels are nearing the end of their usefulness. Further, the widely used parity-based hierarchical RAID levels are not capable of significantly improving reliability over their component parity-based levels without requiring massively increased hardware investment. In response, I have proposed k + m RAID, a family of RAID levels that allow m, the number of parity blocks per stripe, to vary based on the desired reliability of the volume. I have compared its failure rates to those of RAIDs 5 and 6, and RAIDs 1+0, 5+0, and 6+0 with varying numbers of sets. I have described how GPUs are architecturally well-suited to RAID computations, and have demonstrated the Gibraltar RAID library, a prototype library that performs RAID computations on GPUs. I have provided analyses of the library that show how evolutionary changes to GPU architecture, including the merge of GPUs and CPUs, can change the efficiency of coding operations. I have introduced a new memory layout and dispersal matrix arrangement, improving the efficiency of decoding to match that
The MathWorks
"... By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techn ..."
Abstract
- Add to MetaCart
By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. These ideas can be applied to sparse multifrontal and supernodal direct techniques and sparse iterative techniques such as Krylov subspace methods. The approach presented here can apply not only to conventional processors but also to exotic technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the Cell BE processor.

