Results 1 - 10
of
13
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
- In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Using GPUs to improve multigrid solver performance on a cluster
- J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This article explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Extra-precise iterative refinement for overdetermined least squares problems
, 2007
"... We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ε) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn 2) cost of QR factorization for problems of size m-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems. 1
MODIFIED GRAM–SCHMIDT (MGS), LEAST SQUARES, AND BACKWARD STABILITY OF MGS-GMRES
, 2006
"... The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation is modified Gram–Schmidt GMRES (MGS-GMRES). Here we show that MGS-GMRES is backward stable. The result depends on a more general result on the backward stability of a variant of the MGS algorithm applied to solving a linear least squares problem, and uses other new results on MGS and its loss of orthogonality, together with an important but neglected condition number, and a relation between residual norms and certain singular values.
REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING THE SUPERBLOCK FAMILY OF ALGORITHMS
, 2008
"... This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs bet ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such error-reducing algorithms, superblock requires no additional floating point operations and should be implementable with little to no performance loss, making it suitable for use as a performance-critical building block of a linear algebra kernel.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
Mixed-Precision Preconditioners in Parallel Domain Decomposition Solvers
"... Motivated by accuracy reasons, many large-scale scientific applications and industrial numerical simulation codes are fully implemented in 64-bit floating-point arithmetic. On the other hand, many recent processor architectures exhibit 32-bit computational power that is significantly higher than for ..."
Abstract
- Add to MetaCart
Motivated by accuracy reasons, many large-scale scientific applications and industrial numerical simulation codes are fully implemented in 64-bit floating-point arithmetic. On the other hand, many recent processor architectures exhibit 32-bit computational power that is significantly higher than for 64-bit. One recent and significant
Accelerating Scientific Computations with Mixed Precision Algorithms
, 2008
"... On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanc ..."
Abstract
- Add to MetaCart
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. 1
Objective: 1 Solve a linear system
, 2009
"... b ∈ R n Calculate an approximation ˜x ∈ R n of the exact solution x ∗ to a linear system A ∗ x = b 2 Simultaneously bound the error upon this approximation using interval arithmetic ∆x = x ∗ − ˜x Calculate a small interval e containing ∆x. N.H. Diep (ENS Lyon) Relaxed certifying method June 22, 200 ..."
Abstract
- Add to MetaCart
b ∈ R n Calculate an approximation ˜x ∈ R n of the exact solution x ∗ to a linear system A ∗ x = b 2 Simultaneously bound the error upon this approximation using interval arithmetic ∆x = x ∗ − ˜x Calculate a small interval e containing ∆x. N.H. Diep (ENS Lyon) Relaxed certifying method June 22, 2009 2 / 17Classical iterative re nements
Certification of a Numerical Result: Use of Interval Arithmetic and Multiple Precision
"... Abstract. Using floating-point arithmetic to solve a numerical problem yields a computed result, which is an approximation of the exact solution because of roundoff errors. In this paper, we present an approach to certify the computed solution. Here, ”certify ” means computing a guaranteed enclosure ..."
Abstract
- Add to MetaCart
Abstract. Using floating-point arithmetic to solve a numerical problem yields a computed result, which is an approximation of the exact solution because of roundoff errors. In this paper, we present an approach to certify the computed solution. Here, ”certify ” means computing a guaranteed enclosure of the error between the computed, approximate, result and the exact, unknown result. We discuss an iterative refinement method: classically, such methods aim at computing an approximation of the error and they add it to the previous result to improve its accuracy. We add two ingredients: interval arithmetic is used to get an enclosure of the error instead of an approximation, and multiple precision is used to reach higher accuracy. We exemplify this approach on the certification of the solution of a linear system. 1

