Results 1  10
of
37
The Landscape of Parallel Computing Research: A View from Berkeley
 TECHNICAL REPORT, UC BERKELEY
, 2006
"... All rights reserved. ..."
SuperLU DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems
 ACM Trans. Mathematical Software
, 2003
"... We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and sc ..."
Abstract

Cited by 87 (17 self)
 Add to MetaCart
We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on largescale distributed machines.
Accurate Sum and Dot Product
 SIAM J. Sci. Comput
, 2005
"... Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or Kfold working precision, K 3. For twice the working precision our algorithms for summa ..."
Abstract

Cited by 64 (5 self)
 Add to MetaCart
Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or Kfold working precision, K 3. For twice the working precision our algorithms for summation and dot product are some 40 % faster than the corresponding XBLAS routines while sharing similar error estimates. Our algorithms are widely applicable because they require only addition, subtraction and multiplication of floating point numbers in the same working precision as the given data. Higher precision is unnecessary, algorithms are straight loops without branch, and no access to mantissa or exponent is necessary.
Error bounds from extra precise iterative refinement
 ACM Transactions on Mathematical Software
, 2006
"... We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illcondi ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illconditioned linear systems of equations. However two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard [5] has recently removed the first obstacle. To overcome the second obstacle, we show how a single application of iterative refinement can be used to compute an error bound in any norm at small cost, and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. We report extensive test results on over 6.2 million matrices of dimension 5, 10, 100, and 1000. As long as a normwise (resp. componentwise) condition number computed by the algorithm is less than 1/max{10, √ n}εw, the computed normwise (resp. componentwise) error bound is at most
Analysis and comparison of two general sparse solvers for distributed memory computers
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 2001
"... This paper provides a comprehensive study and comparison of two stateoftheart direct solvers for large sparse sets of linear equations on largescale distributedmemory computers. One is a multifrontal solver called MUMPS, the other is a supernodal solver called SuperLU. We describe the main algo ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
This paper provides a comprehensive study and comparison of two stateoftheart direct solvers for large sparse sets of linear equations on largescale distributedmemory computers. One is a multifrontal solver called MUMPS, the other is a supernodal solver called SuperLU. We describe the main algorithmic features of the two solvers and compare their performance characteristics with respect to uniprocessor speed, interprocessor communication, and memory requirements. For both solvers, preorderings for numerical stability and sparsity play an important role in achieving high parallel efficiency. We analyse the results with various ordering algorithms. Our performance analysis is based on data obtained from runs on a 512processor Cray T3E using a set of matrices from real applications. We also use regular 3D grid problems to study the scalability of the two solvers.
Using GPUs to improve multigrid solver performance on a cluster
 J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPIbased Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Compensated Horner Scheme
, 2005
"... Abstract. We present a compensated Horner scheme, that is an accurate and fast algorithm to evaluate univariate polynomials in floating point arithmetic. The accuracy of the computed result is similar to the one given by the Horner scheme computed in twice the working precision. This compensated Hor ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Abstract. We present a compensated Horner scheme, that is an accurate and fast algorithm to evaluate univariate polynomials in floating point arithmetic. The accuracy of the computed result is similar to the one given by the Horner scheme computed in twice the working precision. This compensated Horner scheme runs at least as fast as existing implementations producing the same output accuracy. We also propose to compute in pure floating point arithmetic a valid error estimate that bound the actual accuracy of the compensated evaluation. Numerical experiments involving illconditioned polynomials illustrate these results. All algorithms are performed at a given working precision and are portable assuming the floating point arithmetic satisfies the IEEE754 standard.
Extraprecise iterative refinement for overdetermined least squares problems
, 2007
"... We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We present the algorithm, error bounds, and numerical results for extraprecise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Björck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ε) unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn 2) cost of QR factorization for problems of size mbyn. The extra precision calculation is facilitated by the new extendedprecision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems. 1
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Accurate floatingpoint summation
, 2005
"... Given a vector of floatingpoint numbers with exact sum s, we present an algorithm for calculating a faithful rounding of s into the set of floatingpoint numbers, i.e. one of the immediate floatingpoint neighbors of s. If the s is a floatingpoint number, we prove that this is the result of our a ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Given a vector of floatingpoint numbers with exact sum s, we present an algorithm for calculating a faithful rounding of s into the set of floatingpoint numbers, i.e. one of the immediate floatingpoint neighbors of s. If the s is a floatingpoint number, we prove that this is the result of our algorithm. The algorithm adapts to the condition number of the sum, i.e. it is very fast for mildly conditioned sums with slowly increasing computing time proportional to the condition number. All statements are also true in the presence of underflow. Furthermore algorithms with Kfold accuracy are derived, where in that case the result is stored in a vector of K floatingpoint numbers. We also present an algorithm for rounding the sum s to the nearest floatingpoint number. Our algorithms are fast in terms of measured computing time because they neither require special operations such as access to mantissa or exponent, they contain no branch in the inner loop, nor do they require extra precision: The only operations used are standard floatingpoint addition, subtraction and multiplication in one working precision, for example double precision. Moreover, in contrast to other approaches, the algorithms are ideally suited for parallelization. We also sketch dot product algorithms with similar properties.