Results 1  10
of
21
64bit floatingpoint FPGA matrix multiplication
 In ACM/SIGDA FieldProgrammable Gate Arrays
, 2005
"... We introduce a 64bit ANSI/IEEE Std 7541985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploi ..."
Abstract

Cited by 48 (6 self)
 Add to MetaCart
(Show Context)
We introduce a 64bit ANSI/IEEE Std 7541985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performancearea ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp1257 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.
Closing the gap: CPU and FPGA Trends in sustainable floatingpoint BLAS performance
"... Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks  as long as floatingpoint arithmetic is not required. Fueled by the advance of Moore's Law, FPGAs are rapidly reaching sufficient densities to enhance peak floatingpoint p ..."
Abstract

Cited by 46 (4 self)
 Add to MetaCart
Field programmable gate arrays (FPGAs) have long been an attractive alternative to microprocessors for computing tasks  as long as floatingpoint arithmetic is not required. Fueled by the advance of Moore's Law, FPGAs are rapidly reaching sufficient densities to enhance peak floatingpoint performance as well. The question, however, is how much of this peak performance can be sustained. This paper examines three of the basic linear algebra subroutine (BLAS) functions: vector dot product, matrixvector multiply, and matrix multiply. A comparison of microprocessors, FPGAs, and Reconfigurable Computing platforms is performed for each operation. The analysis highlights the amount of memory bandwidth and internal storage needed to sustain peak performance with FPGAs. This analysis considers the historical context of the last six years and is extrapolated for the next six years.
A tool for unbiased comparison between logarithmic and floatingpoint arithmetic
 LIP, École Normale Supérieure de
, 2004
"... arithmetic ..."
(Show Context)
Unifying BitWidth Optimisation for FixedPoint and FloatingPoint Designs
 In 12th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM04
, 2004
"... This paper presents a method that offers a uniform treatment for bitwidth optimisation of both fixedpoint and floatingpoint designs. Our work utilises automatic differentiation to compute the sensitivities of outputs to the bitwidth of the various operands in the design. This sensitivity analysis ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
(Show Context)
This paper presents a method that offers a uniform treatment for bitwidth optimisation of both fixedpoint and floatingpoint designs. Our work utilises automatic differentiation to compute the sensitivities of outputs to the bitwidth of the various operands in the design. This sensitivity analysis enables us to explore and compare fixedpoint and floatingpoint implementation for a particular design. As a result we can automate the selection of the optimal number representation for each variable in a design to optimize area and performance. We implement our method in the BitSize tool targeting reconfigurable architectures, which takes userdefined constraints to direct the optimisation procedure. We illustrate our approach using applications such as raytracing and function approximation. 1
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
Dual FixedPoint: An Efficient Alternative to FloatingPoint Computation
 In Proceedings of International Conference on Field Programmable Logic 2004
, 2004
"... Abstract. This paper presents a new data representation known as Dual FiXedpoint (DFX), which employs a single bit exponent to select two different fixedpoint scalings. DFX provides a compromise between conventional fixedpoint and floatingpoint representations. It has the implementation complexity ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Abstract. This paper presents a new data representation known as Dual FiXedpoint (DFX), which employs a single bit exponent to select two different fixedpoint scalings. DFX provides a compromise between conventional fixedpoint and floatingpoint representations. It has the implementation complexity similar to that of a fixedpoint system together with the improved dynamic range offered by a floatingpoint system. The benefit of using DFX over both fixedpoint and floatingpoint is demonstrated with an IIR filter implementation on a Xilinx Virtex II FPGA. 1
Groupalignment based accurate floatingpoint summation on fpgas
 In ERSA’06
, 2006
"... Floatingpoint summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floatingpoint arithmetic based summation algorithms may not always result in accura ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Floatingpoint summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floatingpoint arithmetic based summation algorithms may not always result in accurate solutions because of possible catastrophic cancellations. To make the situation worse, the sequence of consecutive additions will affect the final result, which makes it impossible to produce a unique solution for the same input dataset on different computer platforms with different software compilers. The emergence of highdensity reconfigurable hardware devices gives us an option to customize highperformance arithmetic units for our specific computing problems. In this paper, we design an FPGAbased hardware algorithm for accurate floatingpoint summations using group alignment technique. The corresponding fullpipelined summation unit is proven to provide similar or even better numerical errors than standard floatingpoint arithmetic. Moreover, it consumes much less RC resources as well as pipelining stages than existent designs, but achieves the optimal working speed at one summation per clock cycle with only moderate startup latency. This new technique can also be used to accelerate executions of other linear algebra subroutines on FPGAs and result in much more efficient and compact implementations without negative impact on computational performance or numerical accuracy.
1 A SCALABLE PRECISION ANALYSIS FRAMEWORK
"... Abstract—In embedded computing, typically some form of silicon area or power budget restricts the potential performance achievable. For algorithms with limited dynamic range, custom hardware accelerators manage to extract significant additional performance for such a budget via mapping operations in ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In embedded computing, typically some form of silicon area or power budget restricts the potential performance achievable. For algorithms with limited dynamic range, custom hardware accelerators manage to extract significant additional performance for such a budget via mapping operations in the algorithm to fixedpoint. However, for complex applications requiring floatingpoint computation, the potential performance improvement over software is reduced. Nonetheless, custom hardware can still customise the precision of floatingpoint operators, unlike software which is restricted to IEEE standard single or double precision, to increase the overall performance at the cost of increasing the error observed in the final computational result. Unfortunately, because it is difficult to determine if this error increase is tolerable, this task is rarely performed. We present a new analytical technique to calculate bounds on the range or relative error of output variables, enabling custom hardware accelerators to be tolerant of floating point errors by design. In contrast to existing tools that perform this task, our approach scales to larger examples and obtains tighter bounds, within a smaller execution time. Furthermore, it allows a user to trade the quality of bounds with execution time of the procedure, making it suitable for both small and largescale algorithms. I.
Wordlength Optimization Beyond Straight Line Code
"... The silicon area benefits that result from wordlength optimization have been widely reported by the FPGA community. However, to date, most approaches are restricted to straight line code, or code that can be converted into straight line code using techniques such as loopunrolling. In this paper, w ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The silicon area benefits that result from wordlength optimization have been widely reported by the FPGA community. However, to date, most approaches are restricted to straight line code, or code that can be converted into straight line code using techniques such as loopunrolling. In this paper, we take the first steps towards creating analytical techniques to optimize the precision used throughout custom FPGA accelerators for algorithms that contain loops with data dependent exit conditions. To achieve this, we build on ideas emanating from the software verification community to prove program termination. Our idea is to apply wordlength optimization techniques to find the minimum precision required to guarantee that a loop with data dependent exit conditions will terminate. Without techniques to analyze algorithms containing these types of loops, a hardware designer may elect to implement every arithmetic operator throughout a custom FPGAbased accelerator using IEEE754 standard single or double precision arithmetic. With this approach, the FPGA accelerator would have comparable accuracy to a software implementation. However, we show that using our new technique to create custom fixed and floating point designs, we can obtain silicon area savings of up to 50 % over IEEE standard single precision arithmetic, or 80 % over IEEE standard double precision arithmetic, at the same time as providing guarantees that the created hardware designs will work in practice.
MODELBASED PRECISION ANALYSIS AND OPTIMIZATION FOR DIGITAL SIGNAL PROCESSORS
"... Embedded signal processing has witnessed explosive growth in recent years in both scientific and consumer applications, driving the need for complex, highperformance signal processing systems that are largely application driven. In order to efficiently implement these systems on programmable platfo ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Embedded signal processing has witnessed explosive growth in recent years in both scientific and consumer applications, driving the need for complex, highperformance signal processing systems that are largely application driven. In order to efficiently implement these systems on programmable platforms such as digital signal processors (DSPs), it is important to analyze and optimize the application design from early stages of the design process. A key performance concern for designers is choosing the data format. In this work, we propose a systematic and efficient design flow involving modelbased design to analyze application data sets and precision requirements. We demonstrate this design flow with an exploration study into the required precision for eigenvalue decomposition (EVD) using the Jacobi algorithm. We demonstrate that with a high degree of structured analysis and automation, we are able to analyze the data set to derive an efficient data format, and optimize important parts of the algorithm with respect to precision. 1.