Results 1  10
of
67
The Landscape of Parallel Computing Research: A View from Berkeley
 TECHNICAL REPORT, UC BERKELEY
, 2006
"... ..."
SuperLU DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems
 ACM Trans. Mathematical Software
, 2003
"... We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and sc ..."
Abstract

Cited by 144 (18 self)
 Add to MetaCart
(Show Context)
We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on largescale distributed machines.
Accurate Sum and Dot Product
 SIAM J. Sci. Comput
, 2005
"... Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or Kfold working precision, K 3. For twice the working precision our algorithms for summa ..."
Abstract

Cited by 94 (15 self)
 Add to MetaCart
(Show Context)
Algorithms for summation and dot product of floating point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed in twice or Kfold working precision, K 3. For twice the working precision our algorithms for summation and dot product are some 40 % faster than the corresponding XBLAS routines while sharing similar error estimates. Our algorithms are widely applicable because they require only addition, subtraction and multiplication of floating point numbers in the same working precision as the given data. Higher precision is unnecessary, algorithms are straight loops without branch, and no access to mantissa or exponent is necessary.
Accurate floatingpoint summation part I: Faithful rounding
 SIAM J. Sci. Comput
"... Abstract. Given a vector of floatingpoint numbers with exact sum s, we present an algorithm for calculating a faithful rounding of s, i.e. the result is one of the immediate floatingpoint neighbors of s. If the sum s is a floatingpoint number, we prove that this is the result of our algorithm. Th ..."
Abstract

Cited by 40 (9 self)
 Add to MetaCart
(Show Context)
Abstract. Given a vector of floatingpoint numbers with exact sum s, we present an algorithm for calculating a faithful rounding of s, i.e. the result is one of the immediate floatingpoint neighbors of s. If the sum s is a floatingpoint number, we prove that this is the result of our algorithm. The algorithm adapts to the condition number of the sum, i.e. it is fast for mildly conditioned sums with slowly increasing computing time proportional to the logarithm of the condition number. All statements are also true in the presence of underflow. The algorithm does not depend on the exponent range. Our algorithm is fast in terms of measured computing time because it allows good instructionlevel parallelism, it neither requires special operations such as access to mantissa or exponent, it contains no branch in the inner loop, nor does it require some extra precision: The only operations used are standard floatingpoint addition, subtraction and multiplication in one working precision, for example double precision. Certain constants used in the algorithm are proved to be optimal.
Error bounds from extra precise iterative refinement
 ACM Transactions on Mathematical Software
, 2006
"... We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illcondi ..."
Abstract

Cited by 35 (5 self)
 Add to MetaCart
(Show Context)
We present the design and testing of an algorithm for iterative refinement of the solution of linear equations, where the residual is computed with extra precision. This algorithm was originally proposed in the 1960s [6, 22] as a means to compute very accurate solutions to all but the most illconditioned linear systems of equations. However two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard [5] has recently removed the first obstacle. To overcome the second obstacle, we show how a single application of iterative refinement can be used to compute an error bound in any norm at small cost, and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound. We report extensive test results on over 6.2 million matrices of dimension 5, 10, 100, and 1000. As long as a normwise (resp. componentwise) condition number computed by the algorithm is less than 1/max{10, √ n}εw, the computed normwise (resp. componentwise) error bound is at most
Using GPUs to improve multigrid solver performance on a cluster
 J. OF COMPUTATIONAL SCIENCE AND ENGINEERING
, 2008
"... This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requirin ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
(Show Context)
This article explores the coupling of coarse and finegrained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPIbased Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number
Analysis and comparison of two general sparse solvers for distributed memory computers
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 2001
"... This paper provides a comprehensive study and comparison of two stateoftheart direct solvers for large sparse sets of linear equations on largescale distributedmemory computers. One is a multifrontal solver called MUMPS, the other is a supernodal solver called SuperLU. We describe the main algo ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
This paper provides a comprehensive study and comparison of two stateoftheart direct solvers for large sparse sets of linear equations on largescale distributedmemory computers. One is a multifrontal solver called MUMPS, the other is a supernodal solver called SuperLU. We describe the main algorithmic features of the two solvers and compare their performance characteristics with respect to uniprocessor speed, interprocessor communication, and memory requirements. For both solvers, preorderings for numerical stability and sparsity play an important role in achieving high parallel efficiency. We analyse the results with various ordering algorithms. Our performance analysis is based on data obtained from runs on a 512processor Cray T3E using a set of matrices from real applications. We also use regular 3D grid problems to study the scalability of the two solvers.
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Integrated Multiquadric Radial Basis Function Approximation Methods
"... AbstractPromising numerical results using once and twice integrated radial basis functions have been recently presented. In this work we investigate the integrated radial basis function (IRBF) concept in greater detail, connect to the existing RBF theory, and make conjectures about the properties o ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
AbstractPromising numerical results using once and twice integrated radial basis functions have been recently presented. In this work we investigate the integrated radial basis function (IRBF) concept in greater detail, connect to the existing RBF theory, and make conjectures about the properties of IRBF approximation methods. The IRBF methods are used to solve PDEs. c ° 2006
Accurate floatingpoint summation part II: Sign, kfold faithful and rounding to nearest
 SIAM J. Sci. Comput
"... Abstract. In this Part II of this paper we first refine the analysis of errorfree vector transformations presented in Part I. Based on that we present an algorithm for calculating the roundedtonearest result of s:= pi for a given vector of floatingpoint numbers pi, as well as algorithms for dire ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
Abstract. In this Part II of this paper we first refine the analysis of errorfree vector transformations presented in Part I. Based on that we present an algorithm for calculating the roundedtonearest result of s:= pi for a given vector of floatingpoint numbers pi, as well as algorithms for directed rounding. A special algorithm for computing the sign of s is given, also working for huge dimensions. Assume a floatingpoint working precision with relative rounding error unit eps. We define and investigate a Kfold faithful rounding of a real number r. Basically the result is stored in a vector Resν of K nonoverlapping floatingpoint numbers such that Resν approximates r with relative accuracy epsK, and replacing ResK by its floatingpoint neighbors in Resν forms a lower and upper bound for r. For a given vector of floatingpoint numbers with exact sum s, we present an algorithm for calculating aKfold faithful rounding of s using solely the working precision. Furthermore, an algorithm for calculating a faithfully rounded result of the sum of a vector of huge dimension is presented. Our algorithms are fast in terms of measured computing time because they allow good instructionlevel parallelism, they neither require special operations such as access to mantissa or exponent, they contain no branch in the inner loop, nor do they require some extra precision: The only operations used are standard floatingpoint addition, subtraction and multiplication in one working precision, for example double precision. Certain constants used in the algorithms are proved to be optimal.