Results 1 
9 of
9
A library of parameterizable floatingpoint cores for FPGAs and their application to scientific computing
 In Proc. of International Conference on Engineering Reconfigurable Systems and Algorithms
, 2005
"... Abstract — Advances in field programmable gate arrays (FPGAs), which are the platform of choice for reconfigurable computing, have made it possible to use FPGAs in increasingly many areas of computing, including complex scientific applications. These applications demand high performance and highpr ..."
Abstract

Cited by 21 (9 self)
 Add to MetaCart
Abstract — Advances in field programmable gate arrays (FPGAs), which are the platform of choice for reconfigurable computing, have made it possible to use FPGAs in increasingly many areas of computing, including complex scientific applications. These applications demand high performance and highprecision, floatingpoint arithmetic. Until now, most of the research has not focussed on compliance with IEEE standard 754, focusing instead upon custom formats and bitwidths. In this paper, we present doubleprecision floatingpoint cores that are parameterized by their degree of pipelining and the features of IEEE standard 754 that they implement. We then analyze the effects of supporting the standard when these cores are used in an FPGAbased accelerator for LennardJones force and potential calculations that are part of molecular dynamics (MD) simulations. I.
Exponential: Implementation TradeOffs for Hundred Bit Precision
, 2000
"... The development of processors has given rise to problems that need more than double precision arithmetic. Some of them are known to require very long multiple precision numbers, but for some others, doubling the available precision to reach about 100 bits is sufficient. We propose an insight on the ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
The development of processors has given rise to problems that need more than double precision arithmetic. Some of them are known to require very long multiple precision numbers, but for some others, doubling the available precision to reach about 100 bits is sufficient. We propose an insight on the development of a library for the exponential function. Since the hardware is able to perform all the arithmetic operations on 53 bits, our exponential has to be based on a polynomial or a rational approximation. Our routines
ABSTRACT Automating CustomPrecision Function Evaluation for Embedded Processors
"... Due to resource and power constraints, embedded processors often cannot afford dedicated floatingpoint units. For instance, the IBM PowerPC processor embedded in Xilinx VirtexII Pro FPGAs only supports emulated floatingpoint arithmetic, which leads to slow operation when floatingpoint arithmetic ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Due to resource and power constraints, embedded processors often cannot afford dedicated floatingpoint units. For instance, the IBM PowerPC processor embedded in Xilinx VirtexII Pro FPGAs only supports emulated floatingpoint arithmetic, which leads to slow operation when floatingpoint arithmetic is desired. This paper presents a customizable mathematical library using fixedpoint arithmetic for elementary function evaluation. We approximate functions via polynomial or rational approximations depending on the userdefined accuracy requirements. The data representation for the inputs and outputs are compatible with IEEE singleprecision and doubleprecision floatingpoint formats. Results show that our 32bit polynomial method achieves over 80 times speedup over the singleprecision mathematical library from Xilinx, while our 64bit polynomial method achieves over 30 times speedup.
HighPerformance Floating Point Divide
 In Proceedings of the Euromicro Symposium on Digital System Design
, 2001
"... In modern processors floating point divide operations often take 20 to 25 clock cycles, five times that of multiplication. Typically multiplicative algorithms with quadratic convergence are used for highperformance divide. A divide unit based on the multiplicative NewtonRaphson iteration is propos ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In modern processors floating point divide operations often take 20 to 25 clock cycles, five times that of multiplication. Typically multiplicative algorithms with quadratic convergence are used for highperformance divide. A divide unit based on the multiplicative NewtonRaphson iteration is proposed. This divide unit utilizes the higherorder NewtonRaphson reciprocal approximation to compute the quotient fast, efficiently and with high throughput. The divide unit achieves fast execution by computing the square, cube and higher powers of the approximation directly and much faster than the traditional approach with serial multiplications. Additionally, the second, third, and higherorder terms are computed simultaneously further reducing the divide latency. Significant hardware reductions have been identified that reduce the overall computation significantly and therefore, reduce the area required for implementation and the power consumed by the computation. The proposed hardware unit is designed to achieve the desired quotient precision in a single iteration allowing the unit to be fully pipelined for maximum throughput. 1
Small FPGA polynomial approximations with 3bit coefficients and lowprecision estimations of the powers of x
, 2005
"... ..."
Variable Precision Floating Point Division and Square Root
"... Division and square root are important operations in many high performance signal processing applications including matrix inversion, vector normalization, least squares lattice filters and Cholesky decomposition. We have implemented floating point division and square root designs for our VHDL varia ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Division and square root are important operations in many high performance signal processing applications including matrix inversion, vector normalization, least squares lattice filters and Cholesky decomposition. We have implemented floating point division and square root designs for our VHDL variable precision floating point library. These designs are implemented in VHDL and are designed to make efficient use of FPGA hardware. Both the division [1] and square root [2] algorithms are based on table lookup and Taylor series expansion. These algorithms are particularly wellsuited for implementation on an FPGA with embedded RAM and embedded multipliers such as the Altera Stratic and Xilinx Virtex2 devices. The division and square root components have been incorporated into the framework of our variable precision floatingpoint library. 1 Variable Precision FloatingPoint Library Our parameterized floatingpoint library is composed of three parts: format control, arithmetic operations, and format conversion. Format control includes modules denorm and rnd norm. The first is used for denormalizing (introduction of the implied one bit) and the second is used for rounding and normalizing. Format conversion includes modules fix2float and float2fix. The first is used
Powering by Table LookUp using a seconddegree minimax approximation with fused accumulation tree
, 2000
"... A new algorithm for the calculation of singleprecision floatingpoint powering (X p ) is proposed in this report. This algorithm employs table lookup and polynomial approximation, a seconddegree minimax approximation. The use of this polynomial approximation allows the employment of small ta ..."
Abstract
 Add to MetaCart
A new algorithm for the calculation of singleprecision floatingpoint powering (X p ) is proposed in this report. This algorithm employs table lookup and polynomial approximation, a seconddegree minimax approximation. The use of this polynomial approximation allows the employment of small tables to store the coefficients. Both unfolded and pipelined architectures are presented, and the results of a pre layout synthesis performed using CMOS 0.35 m technology are shown, achieving a 50% area reduction from linear approximation methods, and with improved speed over other seconddegree aproximation based algorithms. The unfolded architecture presented has a cycle time of about 11.2 ns. For the pipelined architecture, an operation frequency above 200 MHz has been achieved, with a latency of three cycles and a throughput of one result per cycle. 1 INTRODUCTION Powering function (X p ) is a very interesting function for applications such as computer 3D graphics and digital i...
like SIMD (Single Instruction Multiple Data) or
"... Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include doubleprecision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precisio ..."
Abstract
 Add to MetaCart
Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include doubleprecision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table lookups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table lookups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.
présentée et soutenue publiquement le 25/06/2008 par
, 2008
"... N ◦ attribué par la bibliothèque: 07ENSL0 465 ..."