## Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components (2006)

### Cached

### Download Links

- [www.mathematik.uni-dortmund.de]
- [numod.ins.uni-bonn.de]
- [www.mpi-inf.mpg.de]
- DBLP

### Other Repositories/Bibliography

Venue: | In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM |

Citations: | 10 - 2 self |

### BibTeX

@INPROCEEDINGS{Strzodka06pipelinedmixed,

author = {Robert Strzodka},

title = {Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components},

booktitle = {In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM},

year = {2006},

pages = {259--268}

}

### OpenURL

### Abstract

FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.

### Citations

132 | Efficient Solvers for Incompressible Flow Problems: An Algorithmic and
- Turek
- 1999
(Show Context)
Citation Context ...he Poisson Problem is a typical example of an elliptic PDE, also often encountered in time discretized parabolic and hyperbolic problems. For instance, projection schemes in Navier-Stokes simulations =-=[26]-=- require the solution of the Poisson Problem to obtain the pressure. Moreover, the solution of the Poisson Problem is often the most time consuming step in such simulations and therefore accelerations... |

93 |
A generalized conjugate gradient method for the numerical solution of elliptic PDE
- Concus, Golub, et al.
- 1976
(Show Context)
Citation Context ...me manner on the entire vector. Among the simpler methods the Conjugate Gradient (CG) algorithm, applicable to positive definite matrices, is very popular as it often delivers superlinear convergence =-=[3]-=-. In the setting of Finite Elements the positive definiteness is usually satisfied and we may exploit it by using the Conjugate Gradient method. The algorithm reads: qk = Apk, αk = ρk pk · qk uk+1 = u... |

87 | FPGAs vs. CPUs: Trends in peak floating-point performance
- Underwood
- 2004
(Show Context)
Citation Context ...Germany dominik.goeddeke@math.uni-dortmund.de sources for arithmetic operations on large number formats. Keith Underwood analyzes the resulting floating point performance trends of FPGAs against CPUs =-=[28]-=- and concludes that the peak performance of FPGAs already surpasses that of CPUs and even where actual application performance does not, it shall soon due to the much higher performance increases per ... |

66 |
der Vorst, Parallel numerical linear algebra, in: Acta Numerica
- Demmel, Heath, et al.
- 1993
(Show Context)
Citation Context ... here is very similar to the one proposed by Meurant [24]. For an overview of different CG variants see Dongarra and Eijkhout [9], for a broader discussion of parallel iterative solvers Demmel et al. =-=[5]-=-. In Section 3.2 we combine the pipelined CG with the iterative refinement technique resulting in a mixed precision pipelined CG. After an analysis of this iteration scheme, we develop a different ite... |

63 |
Floating-point sparse matrix-vector multiply for FPGAs
- deLorimier, DeHon
- 2005
(Show Context)
Citation Context ...ting point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the kernels of these operations (BLAS) [25] and both dense [10,31] and sparse =-=[4]-=- matrices have been studied. All these techniques offer valuable approaches to minimize the resource usage and maximize the throughput of scientific computations. These optimizations explore the struc... |

60 | Design, Implementation and Testing of Extended and Mixed Precision BLAS
- Li, Demmel, et al.
(Show Context)
Citation Context ... number format in the inner loop. However, operating mainly on floats rather than doubles also benefits CPUs. Li et al. discuss the extension of the popular BLAS library to mixed precision algorithms =-=[20]-=-. Turner and Walker [27] accelerate the solution of an elliptic PDE. Geddes and Zheng [14] solve different problems with a mixed precision Newton iteration. This paper, in particular, demonstrates tha... |

50 | 64-bit floating-point fpga matrix multiplication
- Dou, Vassiliadis, et al.
- 2005
(Show Context)
Citation Context ...2] with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the kernels of these operations (BLAS) [25] and both dense =-=[10,31]-=- and sparse [4] matrices have been studied. All these techniques offer valuable approaches to minimize the resource usage and maximize the throughput of scientific computations. These optimizations ex... |

50 | Sparse matrix-vector multiplication on FPGAs
- Zhuo, Prasanna
- 2005
(Show Context)
Citation Context ...2] with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the kernels of these operations (BLAS) [25] and both dense =-=[10,31]-=- and sparse [4] matrices have been studied. All these techniques offer valuable approaches to minimize the resource usage and maximize the throughput of scientific computations. These optimizations ex... |

37 |
Using floating-point arithmetic on FPGAs to accelerate scientific N-body simulations
- Lienhart, Kugel, et al.
- 2002
(Show Context)
Citation Context ...tual applications. Floating point FIR filters have been analyzed in detail [29], the Fast-Fourier-Transform has received particular attention [8, 19], and Lienhart et al. perform an N-body simulation =-=[22]-=- with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the kernels of these operations (BLAS) [25] and both dense [1... |

35 | A library of parameterized floating-point modules and their use
- Belanovic, Leeser
- 2002
(Show Context)
Citation Context ...floating point formats [7, 17, 21]. There exist also parameterized IP cores which offer particularly efficient implementations for a given architecture. Fully parameterizable floating point libraries =-=[1, 11]-=- allow extensive explorations of different precisions and methods for automatic optimization of the operand sizes have been proposed [12, 13]. Another option for FPGAs is to use logarithmic number sys... |

35 | Viktor Prasanna. Analysis of Highperformance Floating-point Arithmetic on FPGAs
- Govindu, Zhuo, et al.
- 2004
(Show Context)
Citation Context ...esources by adapting the number format to the application. Moreover, multiple tradeoffs between latency and area can be exploited which has already been extensively studied for floating point formats =-=[7, 17, 21]-=-. There exist also parameterized IP cores which offer particularly efficient implementations for a given architecture. Fully parameterizable floating point libraries [1, 11] allow extensive exploratio... |

28 | Error Bounds from Extra Precise Iterative Refinement
- Demmel, Hida, et al.
- 2005
(Show Context)
Citation Context ... Mixed Precision Iterative Refinement Iterative refinement techniques have already been introduced in 1966 by Wilkinson et al. [2]. For thorough information on these methods we refer to Demmel et al. =-=[6]-=- and Zielke and Drygalla [32]. The core idea of iterative refinement is to distinguish between different types of iterations in the iterative solver. Normally the steps are equal, i.e. in step k the s... |

28 | Floating point unit generation and evaluation for fpgas
- Liang, Tessier, et al.
- 2003
(Show Context)
Citation Context ...esources by adapting the number format to the application. Moreover, multiple tradeoffs between latency and area can be exploited which has already been extensively studied for floating point formats =-=[7, 17, 21]-=-. There exist also parameterized IP cores which offer particularly efficient implementations for a given architecture. Fully parameterizable floating point libraries [1, 11] allow extensive exploratio... |

21 | A flexible floating-point format for optimizing data-paths and operators in fpga based DSPs
- Dido, Geraudie, et al.
- 2002
(Show Context)
Citation Context ...esources by adapting the number format to the application. Moreover, multiple tradeoffs between latency and area can be exploited which has already been extensively studied for floating point formats =-=[7, 17, 21]-=-. There exist also parameterized IP cores which offer particularly efficient implementations for a given architecture. Fully parameterizable floating point libraries [1, 11] allow extensive exploratio... |

21 | Accelerating double precision FEM simulations with GPUs
- Goeddeke, Strzodka, et al.
(Show Context)
Citation Context ...ration. This paper, in particular, demonstrates that the algorithmic precision optimization can be successfully applied to various problems. A joint solver running on a graphics processor and the CPU =-=[15]-=- shows that diverse hardware combinations are suitable for mixed precision computations. As application performance is often limited by bandwidth rather than computational resources due to the socalle... |

20 | Automating customisation of floating-point designs
- Gaar, Luk, et al.
- 2002
(Show Context)
Citation Context ...rchitecture. Fully parameterizable floating point libraries [1, 11] allow extensive explorations of different precisions and methods for automatic optimization of the operand sizes have been proposed =-=[12, 13]-=-. Another option for FPGAs is to use logarithmic number systems which avoidsthe quadratic complexity of the multiplier but complicate the adder [18, 23]. Beside the optimizations on the arithmetic lev... |

20 | Unifying bit-width optimisation for fixed-point and floating-point designs
- Gaffar, Mencer, et al.
(Show Context)
Citation Context ...rchitecture. Fully parameterizable floating point libraries [1, 11] allow extensive explorations of different precisions and methods for automatic optimization of the operand sizes have been proposed =-=[12, 13]-=-. Another option for FPGAs is to use logarithmic number systems which avoidsthe quadratic complexity of the multiplier but complicate the adder [18, 23]. Beside the optimizations on the arithmetic lev... |

19 |
Complex conjugate gradient methods
- Joly, Meurant
- 1993
(Show Context)
Citation Context ...gh performance computing this problem has also been addressed as the global communications incur high latencies for a parallel computer. Our scheme here is very similar to the one proposed by Meurant =-=[24]-=-. For an overview of different CG variants see Dongarra and Eijkhout [9], for a broader discussion of parallel iterative solvers Demmel et al. [5]. In Section 3.2 we combine the pipelined CG with the ... |

18 |
Logarithmic number system and floating-point arithmetics on FPGA
- Matoušek, Tichy, et al.
- 2002
(Show Context)
Citation Context ...ization of the operand sizes have been proposed [12, 13]. Another option for FPGAs is to use logarithmic number systems which avoidsthe quadratic complexity of the multiplier but complicate the adder =-=[18, 23]-=-. Beside the optimizations on the arithmetic level the configurability of logic and data paths allows additional gains in actual applications. Floating point FIR filters have been analyzed in detail [... |

18 |
Ecient high accuracy solutions with GMRES(m
- TURNER, WALKER
- 1992
(Show Context)
Citation Context ...ner loop. However, operating mainly on floats rather than doubles also benefits CPUs. Li et al. discuss the extension of the popular BLAS library to mixed precision algorithms [20]. Turner and Walker =-=[27]-=- accelerate the solution of an elliptic PDE. Geddes and Zheng [14] solve different problems with a mixed precision Newton iteration. This paper, in particular, demonstrates that the algorithmic precis... |

15 | Exploiting fast hardware floating point in high precision computation
- GEDDES, ZHENG
- 2003
(Show Context)
Citation Context ...also benefits CPUs. Li et al. discuss the extension of the popular BLAS library to mixed precision algorithms [20]. Turner and Walker [27] accelerate the solution of an elliptic PDE. Geddes and Zheng =-=[14]-=- solve different problems with a mixed precision Newton iteration. This paper, in particular, demonstrates that the algorithmic precision optimization can be successfully applied to various problems. ... |

14 | A comparison of floating point and logarithmic number systems for FPGAs
- Haselman, Beauchamp, et al.
(Show Context)
Citation Context ...ization of the operand sizes have been proposed [12, 13]. Another option for FPGAs is to use logarithmic number systems which avoidsthe quadratic complexity of the multiplier but complicate the adder =-=[18, 23]-=-. Beside the optimizations on the arithmetic level the configurability of logic and data paths allows additional gains in actual applications. Floating point FIR filters have been analyzed in detail [... |

11 | Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform
- Fang, Chen, et al.
- 2002
(Show Context)
Citation Context ...floating point formats [7, 17, 21]. There exist also parameterized IP cores which offer particularly efficient implementations for a given architecture. Fully parameterizable floating point libraries =-=[1, 11]-=- allow extensive explorations of different precisions and methods for automatic optimization of the operand sizes have been proposed [12, 13]. Another option for FPGAs is to use logarithmic number sys... |

11 |
Genaue Lösung Linearer Gleichungssysteme
- Zielke, Drygalla
(Show Context)
Citation Context ...efinement Iterative refinement techniques have already been introduced in 1966 by Wilkinson et al. [2]. For thorough information on these methods we refer to Demmel et al. [6] and Zielke and Drygalla =-=[32]-=-. The core idea of iterative refinement is to distinguish between different types of iterations in the iterative solver. Normally the steps are equal, i.e. in step k the solver would take an input vec... |

9 |
Scientific computing beyond CPUs: FPGA implementations of common scientific kernels
- Smith, Vetter, et al.
- 2005
(Show Context)
Citation Context ...N-body simulation [22] with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the kernels of these operations (BLAS) =-=[25]-=- and both dense [10,31] and sparse [4] matrices have been studied. All these techniques offer valuable approaches to minimize the resource usage and maximize the throughput of scientific computations.... |

8 |
A scaleable FIR filter using 32-bit floating-point complex arithmetic on Reconfigurable computing machine
- Walters
- 1998
(Show Context)
Citation Context ...]. Beside the optimizations on the arithmetic level the configurability of logic and data paths allows additional gains in actual applications. Floating point FIR filters have been analyzed in detail =-=[29]-=-, the Fast-Fourier-Transform has received particular attention [8, 19], and Lienhart et al. perform an N-body simulation [22] with custom floating point numbers. In our context vector and matrix opera... |

7 |
The memory gap (keynote
- Wilkes
(Show Context)
Citation Context ...rdware combinations are suitable for mixed precision computations. As application performance is often limited by bandwidth rather than computational resources due to the socalled memory wall problem =-=[30]-=-, we should point out that the reduction of the operand size in the inner loop also reduces the bandwidth requirements. In linear algebra computations which have a very low computational intensity (ra... |

6 | A systolic FFT architecture for real time FPGA systems,” High Performance Embedded Computing Conference
- Jackson, Chan, et al.
- 2004
(Show Context)
Citation Context ...lity of logic and data paths allows additional gains in actual applications. Floating point FIR filters have been analyzed in detail [29], the Fast-Fourier-Transform has received particular attention =-=[8, 19]-=-, and Lienhart et al. perform an N-body simulation [22] with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the ke... |

5 |
Handbook series linear algebra: Solution of real and complex systems of linear equations
- Bowdler, Martin, et al.
- 1966
(Show Context)
Citation Context ...spired by a technique known as mixed precision iterative refinement. 1.2. Mixed Precision Iterative Refinement Iterative refinement techniques have already been introduced in 1966 by Wilkinson et al. =-=[2]-=-. For thorough information on these methods we refer to Demmel et al. [6] and Zielke and Drygalla [32]. The core idea of iterative refinement is to distinguish between different types of iterations in... |

5 |
An efficient architecture for ultra long FFTs in FPGAs and ASICs,” High Performance Embedded Computing Conference
- Dillon
- 2004
(Show Context)
Citation Context ...lity of logic and data paths allows additional gains in actual applications. Floating point FIR filters have been analyzed in detail [29], the Fast-Fourier-Transform has received particular attention =-=[8, 19]-=-, and Lienhart et al. perform an N-body simulation [22] with custom floating point numbers. In our context vector and matrix operations are of particular interest. Efficient implementations for the ke... |

3 | Finite-choice algorithm optimization in conjugate gradients. Computer Science Report UT-CS-03-502, Universtity of Tennesse, 2003. LAPACK Working Note 159
- Dongarra, Eijkhout
(Show Context)
Citation Context ...al communications incur high latencies for a parallel computer. Our scheme here is very similar to the one proposed by Meurant [24]. For an overview of different CG variants see Dongarra and Eijkhout =-=[9]-=-, for a broader discussion of parallel iterative solvers Demmel et al. [5]. In Section 3.2 we combine the pipelined CG with the iterative refinement technique resulting in a mixed precision pipelined ... |

1 |
Performance and accuracy of mixed precision solvers
- Göddeke, Strzodka, et al.
- 2006
(Show Context)
Citation Context ...h between them. Moreover, different iterative solvers can be employed in the two loops. For numerical examination of multigrid solvers in this context and an implementation on graphics processors see =-=[16]-=-. Here we study the pipelined Conjugate Gradient solver with a focus on an efficient hardware implementation, as multigrid methods generate a complex data-flow which is much harder to map to FPGAs. FP... |