## Design, Implementation and Testing of Extended and Mixed Precision BLAS (2001)

### Cached

### Download Links

Citations: | 50 - 11 self |

### BibTeX

@MISC{Li01design,implementation,

author = {X. S. Li and J. W. Demmel and D. H. Bailey and G. Henry and Y. Hida and J. Iskandar and W. Kahan and S. Y. Kang and A. Kapur and M. C. Martin and B. J. Thompson and T. Tung and D. J. Yoo},

title = {Design, Implementation and Testing of Extended and Mixed Precision BLAS},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

### Citations

2197 | The art of computer programming - Knuth - 1973 |

1323 |
GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems
- Saad, Schultz
- 1986
(Show Context)
Citation Context ...f iterative refinement can be applied to other equation solving methods as well, both linear and nonlinear. The paper [49] replaces the solution of Ad = r via the LU factorization of A 15by GMRES(m) =-=[46]-=- to get the correction d, and solves a linear system arising from a discretized elliptic PDE as accurately as though the entire computation were done in double, but while doing nearly all the work in ... |

848 |
Accuracy and stability of numerical algorithms
- Higham
- 1996
(Show Context)
Citation Context ...ACK currently implements this iterative refinement algorithm entirely in the input/output precision, such as in routine xGERFS [2]. What it accomplishes, according to a well understood error-analysis =-=[16, 27]-=-, is essentially to replace the condition number κ(A) in the error bound by another possibly smaller one, which can be as small as minD κ(D · A), where the minimum is over all diagonal matrices D. In ... |

742 |
A set of level 3 basic linear algebra subprograms
- DONGARRA, DUCROZ, et al.
- 1990
(Show Context)
Citation Context ... . . . . . . 54 7 Conclusions and Future Work 54 A Error bound for inner product 59 2B Smith’s Complex Division Algorithm with Scaling 61 31 Introduction The Basic Linear Algebra Subprograms (BLAS) =-=[36, 22, 21]-=- have been widely used by both the linear algebra community and the application community to produce fast, portable and reliable software. Highly efficient machine-specific implementations of the BLAS... |

619 |
Mathematica: A System for Doing Mathematics by Computer �Addison-Wesley
- Wolfram
- 1988
(Show Context)
Citation Context ... 3 discusses a variety of implementation techniques for precision beyond double. In particular, our design goal does not include a general extended precision capability as in systems like Mathematica =-=[51]-=-, Maple [40] and others, where all variables, arithmetic operations, and intrinsic functions can have arbitrarily high precision. Instead, we are interested in enhancing conventional floating point al... |

532 |
Basic linear algebra subprograms for fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ... . . . . . . 54 7 Conclusions and Future Work 54 A Error bound for inner product 59 2B Smith’s Complex Division Algorithm with Scaling 61 31 Introduction The Basic Linear Algebra Subprograms (BLAS) =-=[36, 22, 21]-=- have been widely used by both the linear algebra community and the application community to produce fast, portable and reliable software. Highly efficient machine-specific implementations of the BLAS... |

526 | Applied numerical linear algebra - Demmel - 1997 |

446 | An extended set of Fortran basic linear algebra subroutines
- DONGARRA, DUCROZ, et al.
- 1988
(Show Context)
Citation Context ... . . . . . . 54 7 Conclusions and Future Work 54 A Error bound for inner product 59 2B Smith’s Complex Division Algorithm with Scaling 61 31 Introduction The Basic Linear Algebra Subprograms (BLAS) =-=[36, 22, 21]-=- have been widely used by both the linear algebra community and the application community to produce fast, portable and reliable software. Highly efficient machine-specific implementations of the BLAS... |

351 | ScaLAPACK Users’ Guide - Blackford, Choi, et al. - 1997 |

179 |
Variational iterative methods for nonsymmetric systems of linear equations
- Eisenstat, Elman, et al.
- 1983
(Show Context)
Citation Context ... in single and so running much faster. The authors of [49] point out that their observation could have been applied to any other efficient, restarted iterative solver besides GMRES(m), such as GCR(m) =-=[23]-=-, etc. The algorithm in [49] differs slightly from the one in Section 2.2 in that the right hand side b, the computed solution x, and the product A · x are kept in extended precision (double in their ... |

133 | Adaptive precision floating-point arithmetic and fast robust geometric predicates
- Shewchuk
- 1997
(Show Context)
Citation Context ...ing-point hardware capabilities are exploited fully. These implementation techniques can be extended further to tripled- and quadrupled-double, but with diminishing efficiency except in special cases =-=[11, 45, 47]-=-. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff [35, 9, 33]. We will not consider these very high precision imple... |

105 |
A Floating-point Technique for Extending the Available Precision
- Dekker
- 1971
(Show Context)
Citation Context ... an order of magnitude slower than double but usually considerably faster than Quadruple. Double-double and its variations have an extensive history, with implementation ideas going back to the 1960s =-=[44, 13, 29, 39, 45, 47, 11, 26, 6]-=-. See especially [45] for a history. Our implementation is based on Bailey’s [6], the main routines of which are shown in the Figures 5 through 7 below. For the routines with other combinations of inp... |

87 | SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems - Li, Demmel |

71 |
Compatibility of approximate solution of linear equations with given error bounds for coefficients and right hand sides
- Oettli, Prager
- 1964
(Show Context)
Citation Context ...hat there is no simple formulas for the smallest E, e, f and U satisfying (20), even in the absence of underflow. If we did not round x on output, which would make e = 0, then Oettli-Prager’s Theorem =-=[41]-=- would give a closed form formula for the smallest E and f satisfying (20), using the residual Tx− αb. This formula exists because E and f appear linearly in (20), and furthermore each row can be chos... |

66 | A Fortran-90 Based Multiprecision System
- Bailey
- 1995
(Show Context)
Citation Context ...g-point is built into automated algebra systems like Macsyma [28], Mathematica [51] and Maple [40], and provided to other programmers by packages like Richard Brent’s MP [10] and David Bailey’s MPFUN =-=[5]-=-. BigFloats are optimized for very high precision, and run so much slower than Double-double or Quadruple when delivering comparable precisions that BigFloats shall not be treated further in this docu... |

65 | Algorithms for arbitrary precision floating point arithmetic
- Priest
- 1991
(Show Context)
Citation Context ...ing-point hardware capabilities are exploited fully. These implementation techniques can be extended further to tripled- and quadrupled-double, but with diminishing efficiency except in special cases =-=[11, 45, 47]-=-. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff [35, 9, 33]. We will not consider these very high precision imple... |

48 | Fernando’s solution to Wilkinson’s problem: an application of double factorization. Linear Algebra and its Applications
- Parlett, Dhillon
- 1997
(Show Context)
Citation Context ...mbinations of Feature 3: Wider internal range, Feature 4: Extra-wide variables, andFeature 5: Exception handing. We begin with the recently released LAPACK code xSYEVR [2], which uses a new algorithm =-=[18, 17, 42, 43]-=- that is significantly faster than any previous algorithm, running in O(n 2 ) time to find all the eigenvalues and all the eigenvectors of an n-by-n symmetric tridiagonal matrix. The core of the new a... |

47 |
ANewO(N 2 ) Algorithm for the Symmetric Tridiagonal Eigenvalue/Eigenvector Problem
- Dhillon
- 2005
(Show Context)
Citation Context ...mbinations of Feature 3: Wider internal range, Feature 4: Extra-wide variables, andFeature 5: Exception handing. We begin with the recently released LAPACK code xSYEVR [2], which uses a new algorithm =-=[18, 17, 42, 43]-=- that is significantly faster than any previous algorithm, running in O(n 2 ) time to find all the eigenvalues and all the eigenvectors of an n-by-n symmetric tridiagonal matrix. The core of the new a... |

45 | Computer Solution of Linear Algebraic Systems - Forsythe, Moller - 1967 |

44 | Faster numerical algorithms via exception handling. In
- Demmel, Li
- 1993
(Show Context)
Citation Context ...m. This feature can let us use simpler and faster algorithms that rarely generate exceptions, and afterwards check for these exceptions and recompute the answer slowly and carefully only if necessary =-=[15]-=-. Unfortunately, languages and compilers do not support access to exception handling in a uniform way. The rest of this section lists a variety of compelling examples that exploit the arithmetic featu... |

35 | Underflow and the reliability of numerical software - Demmel - 1984 |

35 |
Accurate eigenvalues of a symmetric tri-diagonal matrix
- Kahan
- 1966
(Show Context)
Citation Context ... an order of magnitude slower than double but usually considerably faster than Quadruple. Double-double and its variations have an extensive history, with implementation ideas going back to the 1960s =-=[44, 13, 29, 39, 45, 47, 11, 26, 6]-=-. See especially [45] for a history. Our implementation is based on Bailey’s [6], the main routines of which are shown in the Figures 5 through 7 below. For the routines with other combinations of inp... |

34 |
IEEE Standard for Binary Floating Point Arithmetic, Std 754-1985 edition
- ANSIIEEE
- 1985
(Show Context)
Citation Context ...ting BLAS. For the internal precision, we allow Single, Double, Indigenous, orExtra. In our reference implementation we assume that Single and Double are the corresponding IEEE floating point formats =-=[3]-=-. Indigenous means the widest hardware-supported format available. Its intent is to let the machine run close to top speed, while being as accurate as possible. On some machines this would be a 64-bit... |

34 |
Correction d’une somme en arithmétique à virgule flottante
- Pichat
- 1972
(Show Context)
Citation Context ... an order of magnitude slower than double but usually considerably faster than Quadruple. Double-double and its variations have an extensive history, with implementation ideas going back to the 1960s =-=[44, 13, 29, 39, 45, 47, 11, 26, 6]-=-. See especially [45] for a history. Our implementation is based on Bailey’s [6], the main routines of which are shown in the Figures 5 through 7 below. For the routines with other combinations of inp... |

33 | Making sparse Gaussian elimination scalable by static pivoting
- Li, Demmel
- 1998
(Show Context)
Citation Context ...e to have an algorithm as parallelizable as sparse Cholesky but applicable to completely general matrices. To do this we have designed a version of parallel sparse Gaussian elimination called SuperLU =-=[37]-=- that avoids dynamic pivoting, and so permits the same optimizations as parallel sparse Cholesky. To retain numerical stability, we use a variety of techniques including 1. prepivoting large matrix en... |

30 | How Java’s Floating-Point Hurts Everyone Everywhere - Kahan, Darcy - 1998 |

29 | A new O(n ) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem - Dhillon - 1997 |

23 | Applied numerical linear algebra. SIAM - Demmel - 1997 |

19 |
A Fortran Multiple Precision Arithmetic Package
- Brent
- 1978
(Show Context)
Citation Context ... of software-simulated floating-point is built into automated algebra systems like Macsyma [28], Mathematica [51] and Maple [40], and provided to other programmers by packages like Richard Brent’s MP =-=[10]-=- and David Bailey’s MPFUN [5]. BigFloats are optimized for very high precision, and run so much slower than Double-double or Quadruple when delivering comparable precisions that BigFloats shall not be... |

16 | Efficient High Accuracy Solutions with GMRES(m - Turner, Walker - 1992 |

15 |
Maple V Programming Guide for Release 5
- Monagan, Geddes, et al.
- 1997
(Show Context)
Citation Context ... a variety of implementation techniques for precision beyond double. In particular, our design goal does not include a general extended precision capability as in systems like Mathematica [51], Maple =-=[40]-=- and others, where all variables, arithmetic operations, and intrinsic functions can have arbitrarily high precision. Instead, we are interested in enhancing conventional floating point algorithms wri... |

14 | Application of a new algorithm for the symmetric eigenproblem to computational quantum chemistry
- Dhillon, Fann, et al.
- 1997
(Show Context)
Citation Context ...mbinations of Feature 3: Wider internal range, Feature 4: Extra-wide variables, andFeature 5: Exception handing. We begin with the recently released LAPACK code xSYEVR [2], which uses a new algorithm =-=[18, 17, 42, 43]-=- that is significantly faster than any previous algorithm, running in O(n 2 ) time to find all the eigenvalues and all the eigenvectors of an n-by-n symmetric tridiagonal matrix. The core of the new a... |

13 | Current inverse iteration software can fail - Dhillon - 1998 |

13 |
A new approach to scientific computation
- Kulisch, Miranker
- 1983
(Show Context)
Citation Context ... with diminishing efficiency except in special cases [11, 45, 47]. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff =-=[35, 9, 33]-=-. We will not consider these very high precision implementations further here. Quadruple precision. This is a conventional 128-bit = 16-byte wide floating-point format with one field each of sign bit,... |

12 | PSPASES: scalable parallel direct solver library for sparse symmetric positive definite linear systems, in: User’s Manual D. Irony et al. / Future Generation Computer Systems xxx (2003) xxx–xxx for Version 1.0.3
- Joshi, Gupta, et al.
- 1997
(Show Context)
Citation Context ...alues of the nonzero matrix entries, or any intermediate results) to optimize load balance and fill-in, and the data structures and communication patterns can be determined statically and more simply =-=[25]-=-. We would like to have an algorithm as parallelizable as sparse Cholesky but applicable to completely general matrices. To do this we have designed a version of parallel sparse Gaussian elimination c... |

11 | Private communication - Kahan, Ivory - 1997 |

8 | IEEE Standard for Radix Independent Floating Point Arithmetic, Std 854-1987 edition - ANSIIEEE - 1987 |

8 |
The PHiPAC WWW home
- Bilmes, Asanovic, et al.
(Show Context)
Citation Context ...ended precision, but further improvements are possible, especially for mixed precision. We plan to expand our macros to automatically generate optimized code in a similar way as ATLAS [50] and PHiPAC =-=[7]-=-. And third, we will investigate the benefit of the new BLAS for more linear algebra algorithms and applications. (25) 54Acknowledgments Yulin Li, Weihua Shen, Berkat Tung and Ben Wanzo contributed c... |

6 |
A Fortran-90 double-double precision library. http://crd.lbl.gov /∼dhbailey/mpdist
- Bailey
(Show Context)
Citation Context ...upports any format wider than Double. So for our reference implementation, we use a technique called double-double in which extra precise numbers are represented by a pair of double precision numbers =-=[6]-=-, providing about 106 bits of precision. Section 3 discusses a variety of implementation techniques for precision beyond double. In particular, our design goal does not include a general extended prec... |

6 |
Quasi double precision in floating-point arithmetic
- Møller
- 1965
(Show Context)
Citation Context |

5 |
An implementation of the dqds algorithm (positive
- Parlett, Marques
- 1999
(Show Context)
Citation Context |

5 | Algorithms for arbitrary precision point arithmetic - Priest - 1991 |

5 |
Doubledouble floating point arithmetic. http://www-epidem.plantsci.cam.ac.uk /∼kbriggs/doubledouble.html
- Briggs
- 1998
(Show Context)
Citation Context ...ing-point hardware capabilities are exploited fully. These implementation techniques can be extended further to tripled- and quadrupled-double, but with diminishing efficiency except in special cases =-=[11, 45, 47]-=-. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff [35, 9, 33]. We will not consider these very high precision imple... |

5 | A new o(n2 ) algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem - Dhillon - 1997 |

4 |
ACRITH: High-Accuracy Arithmetic – An advanced tool for numerical computation
- Bleher, Roeder, et al.
(Show Context)
Citation Context ... with diminishing efficiency except in special cases [11, 45, 47]. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff =-=[35, 9, 33]-=-. We will not consider these very high precision implementations further here. Quadruple precision. This is a conventional 128-bit = 16-byte wide floating-point format with one field each of sign bit,... |

4 |
Matlab’s Loss is Nobody’s Gain. http://www.cs.berkeley.edu/∼wkahan/ MxMulEps.pdf
- Kahan
- 2006
(Show Context)
Citation Context ...uperior quality of results obtained by LAPACK and Matlab for large dimensional and ill-conditioned problems would distinguish hardware capable of extra-precise accumulation advantageously from others =-=[30]-=-. Versions of Matlab that ran on old 680x0-based Apple Macintoshes, and old versions of Matlab running on Wintel PCs, delivered matrix products more accurate thereon than Matlab used to deliver on ups... |

4 | eds.) : A New Approach to Scienti c Computation - Kulisch, Miranker - 1983 |

3 | Doubledouble point arithmetic. http://www-epidem.plantsci.cam.ac.uk /kbriggs/doubledouble.html - Briggs - 1998 |

3 |
Unix extended precision library for the pentium. http://www.cs.utk.edu /∼ghenry/distrib/archive.htm
- Henry
- 1998
(Show Context)
Citation Context ...id about a better mousetrap. Finally, we mention that G. Henry built a comprehensive suite of scalar extended precision operations for ASCI Red, which consists of Pentium Pro processors running Linux =-=[26]-=-. It includes object code in ELF format, and so is callable from C or Fortran on Linux-based Intel-inside workstations, and perhaps Solaris X86, though this remains to be determined. A typical routine... |

3 |
Anomalies in the IBMACRITH package
- Kahan, LeBlanc
(Show Context)
Citation Context ... with diminishing efficiency except in special cases [11, 45, 47]. These techniques have also been used to compute dot products (α ∑ i xiyi + βr) with all correct leading digits, despite all roundoff =-=[35, 9, 33]-=-. We will not consider these very high precision implementations further here. Quadruple precision. This is a conventional 128-bit = 16-byte wide floating-point format with one field each of sign bit,... |