Results 1  10
of
22
Performance and accuracy of hardwareoriented native, emulated and mixedprecision solvers in FEM simulations
 International Journal of Parallel, Emergent and Distributed Systems
"... In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs) [1]. In particul ..."
Abstract

Cited by 54 (12 self)
 Add to MetaCart
In a previous publication, we have examined the fundamental difference between computational precision and result accuracy in the context of the iterative solution of linear systems as they typically arise in the Finite Element discretization of Partial Differential Equations (PDEs) [1]. In particular, we evaluated mixed and emulatedprecision schemes on commodity graphics processors (GPUs), which at that time only supported computations in single precision. With the advent of graphics cards that natively provide double precision, this report updates our previous results. We demonstrate that with new coprocessor hardware supporting native double precision, such as NVIDIA’s G200 architecture, the situation does not change qualitatively for PDEs, and the previously introduced mixed precision schemes are still preferable to double precision alone. But the schemes achieve significant quantitative performance improvements with the more powerful hardware. In particular, we demonstrate that a Multigrid scheme can accurately solve a common test problem in Finite Element settings with one million unknowns in less than 0.1 seconds, which is truely outstanding performance. We support these conclusions by exploring the algorithmic design space enlarged by the availability of double precision directly in the hardware. 1 Introduction and
Reconfigurable computing: architectures and design methods.”
 Proceedings on IEEE Computers and Digital Techniques,
, 2005
"... ..."
(Show Context)
Accuracyguaranteed bitwidth optimization
 IEEE TRANS. COMP.AIDED DES. INTEG. CIR. SYS
, 2006
"... An automated static approach for optimizing bit widths of fixedpoint feedforward designs with guaranteed accuracy, called MiniBit, is presented. Methods to minimize both the integer and fraction parts of fixedpoint signals with the aim of minimizing the circuit area are described. For range analy ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
An automated static approach for optimizing bit widths of fixedpoint feedforward designs with guaranteed accuracy, called MiniBit, is presented. Methods to minimize both the integer and fraction parts of fixedpoint signals with the aim of minimizing the circuit area are described. For range analysis, the technique in this paper identifies the number of integer bits necessary to meet range requirements. For precision analysis, a semianalytical approach with analytical error models in conjunction with adaptive simulated annealing is employed to optimize the number of fraction bits. The analytical models make it possible to guarantee overflow/underflow protection and numerical accuracy for all inputs over the userspecified input intervals. Using a stream compiler for fieldprogrammable gate arrays (FPGAs), the approach in this paper is demonstrated with polynomial approximation, RGBtoYCbCr conversion, matrix multiplication, Bsplines, and discrete cosine transform placed and routed on a Xilinx Virtex4 FPGA. Improvements for a given design reduce the area and the latency by up to 26 % and 12%, respectively, over a design using optimum uniform fraction bit widths. Studies show that MiniBitoptimized designs are within 1 % of the area produced from the integer linear programming approach.
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Trustworthy Numerical Computation in Scala
"... Modern computing has adopted the floating point type as a default way to describe computations with real numbers. Thanks to dedicated hardware support, such computations are efficient on modern architectures, even in double precision. However, rigorous reasoning about the resulting programs remains ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Modern computing has adopted the floating point type as a default way to describe computations with real numbers. Thanks to dedicated hardware support, such computations are efficient on modern architectures, even in double precision. However, rigorous reasoning about the resulting programs remains difficult. This is in part due to a large gap between the finite floating point representation and the infiniteprecision realnumber semantics that serves as the developers’ mental model. Because programming languages do not provide support for estimating errors, some computations in practice are performed more and some less precisely than needed. We present a library solution for rigorous arithmetic computation. Our numerical data type library tracks a (double) floating point value, but also a guaranteed upper bound on the error between this value and the ideal value that would be computed in the realvalue semantics. Our implementation involves a set of linear approximations based on an extension of affine arithmetic. The derived approximations cover most of the standard mathematical operations, including trigonometric functions, and are more comprehensive than any publicly available ones. Moreover, while interval arithmetic rapidly yields overly pessimistic estimates, our approach remains precise for several computational tasks of interest. We evaluate the library on a number of examples from numerical analysis and physical simulations. We found it to be a useful tool for gaining confidence in the correctness of the computation.
Wordlength optimization for differentiable nonlinear systems
 ACM Trans. Des. Autom. Electron. Syst
"... This article introduces an automatic design procedure for determining the sensitivity of outputs in a digital signal processing design to small errors introduced by rounding or truncation of internal variables. The proposed approach can be applied to both linear and nonlinear designs. By analyzing t ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This article introduces an automatic design procedure for determining the sensitivity of outputs in a digital signal processing design to small errors introduced by rounding or truncation of internal variables. The proposed approach can be applied to both linear and nonlinear designs. By analyzing the resulting sensitivity values, the proposed procedure is able to determine an appropriate distinct wordlength for each internal variable in a fixedpoint hardware implementation. In addition, the poweroptimizing capabilities of wordlength optimization are studied. Application of the proposed procedure to adaptive filters and polynomial evaluation circuits realized in a Xilinx Virtex FPGA has resulted in area reductions of up to 80 % (mean 66%) combined with power reductions of up to 98 % (mean 87%) and speedup of up to 36 % (mean 20%) over common alternative design strategies.
Accuracy Guaranteed BitWidth Optimization
 IEEE TRANSACTIONS ON COMPUTERAIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
, 2005
"... We present MiniBit, an automated static approach for optimizing bitwidths of fixedpoint feedforward designs with guaranteed accuracy. Methods to minimize both the integer and fraction parts of fixedpoint signals with the aim of minimizing circuit area are described. For range analysis, our techni ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
We present MiniBit, an automated static approach for optimizing bitwidths of fixedpoint feedforward designs with guaranteed accuracy. Methods to minimize both the integer and fraction parts of fixedpoint signals with the aim of minimizing circuit area are described. For range analysis, our technique identifies the number of integer bits necessary to meet range requirements. For precision analysis, we employ a semianalytical approach with analytical error models in conjunction with adaptive simulated annealing to optimize the number of fraction bits. The analytical models enable us to guarantee overflow/underflow protection and numerical accuracy for all inputs over the userspecified input intervals. Using ASC, A Stream Compiler for fieldprogrammable gate arrays (FPGAs), we demonstrate our approach with polynomial approximation, RGB to YCbCr conversion, matrix multiplication, Bsplines and discrete cosine transform placedandrouted on a Xilinx Virtex4 FPGA. Improvements for a given design reduce area and latency by up to 26 % and 12 % respectively, over a design using optimum uniform fraction bitwidths. Studies show that MiniBit optimized designs are within 1 % of the area produced from integer linear programming approach.
Customisable hardware compilation
 In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA ’04
, 2004
"... Hardware compilers for highlevel languages are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for reconfigurable designs in particular. This paper explains how customisable frameworks for hardware compilation can enable rapid ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Hardware compilers for highlevel languages are increasingly recognised to be the key to reducing the productivity gap for advanced circuit development in general, and for reconfigurable designs in particular. This paper explains how customisable frameworks for hardware compilation can enable rapid design exploration, and reusable and extensible hardware optimisation. It describes such a framework, based on a parallel imperative language, which supports multiple levels of design abstraction, transformational development, optimisation by compiler passes, and metalanguage facilities. Our approach has been used in producing designs for applications such as signal and image processing, with different tradeoffs in performance and resource usage. 1
Optimising Performance of Quadrature Methods with Reduced Precision
"... Abstract. This paper presents a generic precision optimisation methodology for quadrature computation targeting reconfigurable hardware to maximise performance at a given error tolerance level. The proposed methodology optimises performance by considering integration grid density versus mantissa siz ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a generic precision optimisation methodology for quadrature computation targeting reconfigurable hardware to maximise performance at a given error tolerance level. The proposed methodology optimises performance by considering integration grid density versus mantissa size of floatingpoint operators. The optimisation provides the number of integration points and mantissa size with maximised throughput while meeting given error tolerance requirement. Three case studies show that the proposed reduced precision designs on a Virtex6 SX475T FPGA are up to 6 times faster than comparable FPGA designs with double precision arithmetic. They are up to 15.1 times faster and 234.9 times more energy efficient than an i7870 quadcore CPU, and are 1.2 times faster and 42.2 times more energy efficient than a Tesla C2070 GPU. 1
Adaptive range reduction for hardware function evaluation
 In Proc. IEEE Int’l Conf. on FieldProgrammable Technology
, 2004
"... Function evaluation f(x) typically consists of range reduction and the actual function evaluation on a small interval. In this paper, we investigate optimization of range reduction given the range and precision of x and f(x). For every function evaluation there exists a convenient interval such as [ ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Function evaluation f(x) typically consists of range reduction and the actual function evaluation on a small interval. In this paper, we investigate optimization of range reduction given the range and precision of x and f(x). For every function evaluation there exists a convenient interval such as [0,π/2) for sin(x). The adaptive range reduction method, which we propose in this work, involves deciding whether range reduction can be used effectively for a particular design. The decision depends on the function being evaluated, precision, and optimization metrics such as area, latency and throughput. In addition, the input and output range has an impact on the preferable function evaluation method such as polynomial, tablebased, or combinations of the two. We explore this vast design space of adaptive range reduction for fixedpoint sin(x), log(x) and √ x accurate to one unit in the last place using MATLAB and ASC, A Stream Compiler. These tools enable us to study over 1000 designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours ’ time. The final objective is to progress towards a fully automated library that provides optimal function evaluation hardware units given input/output range and precision. 1