Results 1  10
of
28
Using GPUs to improve multigrid solver performance on a cluster
 J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPIbased Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Extraprecise iterative refinement for overdetermined least squares problems
, 2007
"... We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ε) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn 2) cost of QR factorization for problems of size mbyn. The extra precision calculation is facilitated by the new extendedprecision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems. 1
MODIFIED GRAM–SCHMIDT (MGS), LEAST SQUARES, AND BACKWARD STABILITY OF MGSGMRES
, 2006
"... The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
The generalized minimum residual method (GMRES) [Y. Saad and M. Schultz, SIAM J. Sci. Statist. Comput., 7 (1986), pp. 856–869] for solving linear systems Ax = b is implemented as a sequence of least squares problems involving Krylov subspaces of increasing dimensions. The most usual implementation is modified Gram–Schmidt GMRES (MGSGMRES). Here we show that MGSGMRES is backward stable. The result depends on a more general result on the backward stability of a variant of the MGS algorithm applied to solving a linear least squares problem, and uses other new results on MGS and its loss of orthogonality, together with an important but neglected condition number, and a relation between residual norms and certain singular values.
Reducing the influence of tiny normwise relative errors on performance profiles
 Manchester Institute for Mathematical Sciences, The University of Manchester
, 2011
"... Reports available from: And by contacting: ..."
Accelerating Scientific Computations with Mixed Precision Algorithms
, 2008
"... On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanc ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
On modern architectures, the performance of 32bit operations is often at least twice as fast as the performance of 64bit operations. By using a combination of 32bit and 64bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
REDUCING FLOATING POINT ERROR IN DOT PRODUCT USING THE SUPERBLOCK FAMILY OF ALGORITHMS
, 2008
"... This paper discusses both the theoretical and statistical errors obtained by various wellknown dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make tradeoffs bet ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
This paper discusses both the theoretical and statistical errors obtained by various wellknown dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make tradeoffs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such errorreducing algorithms, superblock requires no additional floating point operations and should be implementable with little to no performance loss, making it suitable for use as a performancecritical building block of a linear algebra kernel.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
Error Analysis of Various Forms of Floating Point Dot Products
, 2007
"... Abstract. This paper discusses both the theoretical and statistical errors obtained by various dot product algorithms. A host of linear algebra methods derive their error behavior directly from dot product. In particular, most high performance dense systems derive their performance and error behavio ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract. This paper discusses both the theoretical and statistical errors obtained by various dot product algorithms. A host of linear algebra methods derive their error behavior directly from dot product. In particular, most high performance dense systems derive their performance and error behavior overwhelmingly from matrix multiply, and matrix multiply’s error behavior is almost wholly attributable to the underlying dot product that it is built from (sparse problems usually have a similar relationship with matrixvector multiply, which can also be built from the dot product). With the expansion of standard workstations to 64bit memories and multicore processors, much larger calculations are possible on even simple desktop machines than ever before. Parallel machines built from these hugely expanded nodes can solve problems of almost unlimited size. Therefore, assumptions about limited problem size that used to bound the linear rise in worstcase error due to canonical dot products can no longer be assumed to be true today, and will certainly not be true in the near future. Therefore, this paper discusses several implementations of dot product, their theoretical and achieved error bounds, and their suitability for use as performancecritical building block linear algebra kernels. Key words. Dot product, inner product, error analysis, BLAS, ATLAS