Results 1 - 10
of
17
A Fused Hybrid Floating-Point and Fixed-Point Dot-product for FPGAs ⋆
"... Abstract. Dot-products are one of the essential and recurrent building blocks in scientific computing, and often take-up a large proportion of the scientific acceleration circuitry. The acceleration of dot-products is very well suited for Field Programmable Gate Arrays (FPGAs) since these devices ca ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Dot-products are one of the essential and recurrent building blocks in scientific computing, and often take-up a large proportion of the scientific acceleration circuitry. The acceleration of dot-products is very well suited for Field Programmable Gate Arrays (FPGAs) since these devices can be configured to employ wide parallelism, deep pipelining and exploit highly efficient datapaths. In this paper we present a dotproduct implementation which operates using a hybrid floating-point and fixed-point number system. This design receives floating-point inputs, and generates a floating-point output. Internally it makes use of a configurable word-length fixed-point number system. The internal representation can be tuned to match the desired accuracy. Results using a high-end Xilinx FPGA and an order 150 dot-product demonstrate that, for equivalent accuracy metrics, it is possible to utilize 3.8 times fewer resources, operate at 1.62 times faster clock frequency, and achieve a significant reduction in latency when compared to a direct floating-point core based dot-product. Combining these results and utilizing the spare resources to instantiate more units in parallel, it is possible to achieve an overall speed-up of at least 5 times. 1
FPGA vs. GPU for Sparse Matrix Vector Multiply
"... Abstract—Sparse matrix-vector multiplication (SpMV) is a common operation in numerical linear algebra and is the computational kernel of many scientific applications. It is one of the original and perhaps most studied targets for FPGA acceleration. Despite this, GPUs, which have only recently gained ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—Sparse matrix-vector multiplication (SpMV) is a common operation in numerical linear algebra and is the computational kernel of many scientific applications. It is one of the original and perhaps most studied targets for FPGA acceleration. Despite this, GPUs, which have only recently gained both general-purpose programmability and native support for double precision floating-point arithmetic, are viewed by some as a more effective platform for SpMV and similar linear algebra computations. In this paper, we present an analysis comparing an existing GPU SpMV implementation to our own, novel FPGA implementation. In this analysis, we describe the challenges faced by any SpMV implementation, the unique approaches to these challenges taken by both FPGA and GPU implementations, and their relative performance for SpMV. I.
BLAS Comparison on FPGA, CPU and GPU
"... Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using double-precision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on state-of-the-art devices. On the FPGA, we have developed parameterized modular implementations for the dotproduct and Gaxpy or matrix-vector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a high-throughput accumulator to perform an efficient reduction of floating point values. To support scalability to large data-sets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms. I.
A High-Performance Double Precision Accumulator
"... Abstract—The accumulation operation A new = A old + X is required for many numerical methods. However, when using a floating-point adder with pipeline latency �, the data hazard that exists between Anew and Aold creates design challenges for situations where inputs must be delivered to the accumulat ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract—The accumulation operation A new = A old + X is required for many numerical methods. However, when using a floating-point adder with pipeline latency �, the data hazard that exists between Anew and Aold creates design challenges for situations where inputs must be delivered to the accumulator at a rate exceeding 1/�. Each of the techniques proposed to address this problem requires either static data scheduling or overly complex micro-architectures having multiple adders, a large amount of memory, or control overheads that force the accumulator to operate at a diminished speed relative to the adder on which it is based. In this paper we present a design for a double precision accumulator that achieves high performance without the need for data scheduling or an overly complex implementation. We achieve this by integrating a coalescing reduction circuit within the low-level design of a base-converting floating-point adder. When implemented on our Virtex-2 Pro 100 FPGA, our design achieves a speed of 170 MHz. I.
M.: Higher-Order Abstraction in Hardware Descriptions with CλaSH
- In: Proceedings of the 14th Conference on Digital System Design, USA, IEEE Computer Society
, 2011
"... Abstract—Synchronous hardware can be straightforwardly modelled as a function from input and (current) state to an updated state and output. The CλaSH compiler can translate such a transition function, described in a functional language, to synthesisable VHDL. Taking a hardware-oriented viewpoint, c ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Synchronous hardware can be straightforwardly modelled as a function from input and (current) state to an updated state and output. The CλaSH compiler can translate such a transition function, described in a functional language, to synthesisable VHDL. Taking a hardware-oriented viewpoint, components can then be seen as an instantiation of such a transition function. An abstraction called Arrows is used to directly model components by combining a transition function and its state. The abstraction also provides an uniform interface for composition, without losing the referential transparency offered by a functional description. Furthermore, readability of hardware designs is increased by the use of the γ-syntax, that automatically composes components according to the Arrow interface. The advantages of the Arrow abstraction and the γ-syntax are demonstrated by means of a realistic example circuit consisting of multiple components. This is a significant extension to CλaSH and enables many high level abstractions. Keywords-Functional Programming, Hardware description languages, Pipeline processing
1 A SCALABLE PRECISION ANALYSIS FRAMEWORK
"... Abstract—In embedded computing, typically some form of silicon area or power budget restricts the potential performance achievable. For algorithms with limited dynamic range, custom hardware accelerators manage to extract significant additional performance for such a budget via mapping operations in ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—In embedded computing, typically some form of silicon area or power budget restricts the potential performance achievable. For algorithms with limited dynamic range, custom hardware accelerators manage to extract significant additional performance for such a budget via mapping operations in the algorithm to fixed-point. However, for complex applications requiring floating-point computation, the potential performance improvement over software is reduced. Nonetheless, custom hardware can still customise the precision of floating-point operators, unlike software which is restricted to IEEE standard single or double precision, to increase the overall performance at the cost of increasing the error observed in the final computational result. Unfortunately, because it is difficult to determine if this error increase is tolerable, this task is rarely performed. We present a new analytical technique to calculate bounds on the range or relative error of output variables, enabling custom hardware accelerators to be tolerant of floating point errors by design. In contrast to existing tools that perform this task, our approach scales to larger examples and obtains tighter bounds, within a smaller execution time. Furthermore, it allows a user to trade the quality of bounds with execution time of the procedure, making it suitable for both small and large-scale algorithms. I.
BRINGING HIGH-PERFORMANCE RECONFIGURABLE COMPUTING TO EXACT COMPUTATIONS
"... Numerical non-robustness is a recurring phenomenon in scientific computing. It is primarily caused by numerical errors arising because of fixed-precision arithmetic in integer and/or floating-point computations. Exact computation, based on arbitrary-precision arithmetic, has been developed over the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Numerical non-robustness is a recurring phenomenon in scientific computing. It is primarily caused by numerical errors arising because of fixed-precision arithmetic in integer and/or floating-point computations. Exact computation, based on arbitrary-precision arithmetic, has been developed over the last decade as an emerging numerical computation paradigm in response to this problem of numerical non-robustness. Exact arithmetic, specifically arbitrary-precision arithmetic, has been traditionally implemented using efficient software libraries such as GNU Multi-Precision (GMP). However, this results in a slower arithmetic performance when compared to fixed-precision arithmetic. In this paper we present a first effort, to the best of our knowledge, of reconfigurable hardware support for arbitrary-precision arithmetic. The proposed hardware architectures are based on virtual convolution sche 1 duling which is derived from a formal representation of the problem. Targeting high performance and efficiency, dynamic (non-linear) pipelines techniques were exploited to eliminate the effects of deeply-pipelined operators. Referenced to GMP, our experiments showed promising results. 1.
D.: Modular design of fully pipelined accumulators
- In: FieldProgrammable Technology
, 2010
"... Abstract—Fast and efficient accumulation arithmetic circuits are critical for a broad range of scientific and embedded system applications. High throughput accumulation circuits are typically hand designed for specific vector lengths requiring the circuit to be modified when the lengths are changed. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Fast and efficient accumulation arithmetic circuits are critical for a broad range of scientific and embedded system applications. High throughput accumulation circuits are typically hand designed for specific vector lengths requiring the circuit to be modified when the lengths are changed. In this work we present a new design approach that can achieve low latency and near optimal throughput for input data vectors of arbitrary length. The flexibility of the design allows it to be used for both integer and floating-point operations. By providing a simple and efficient interface to the user and a modular architecture for the designer, the proposed technique has broad impact across a wide range of custom hardware designs. I.
A Scalable Approach for Automated Precision Analysis
"... The freedom over the choice of numerical precision is one of the key factors that can only be exploited throughout the datapath of an FPGA accelerator, providing the ability to trade the accuracy of the final computational result with the silicon area, power, operating frequency, and latency. Howeve ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
The freedom over the choice of numerical precision is one of the key factors that can only be exploited throughout the datapath of an FPGA accelerator, providing the ability to trade the accuracy of the final computational result with the silicon area, power, operating frequency, and latency. However, in order to tune the precision used throughout hardware accelerators automatically, a tool is required to verify that the hardware will meet an error or range specification for a given precision. Existing tools to perform this task typically suffer either from a lack of tightness of bounds or require a large execution time when applied to large scale algorithms; in this work, we propose an approach that can both scale to larger examples and obtain tighter bounds, within a smaller execution time, than the existing methods. The approach we describe also provides a user with the ability to trade the quality of bounds with execution time of the procedure, making it suitable within a word-length optimization framework for both small and large-scale algorithms. We demonstrate the use of our approach on instances of iterative algorithms to solve a system of linear equations. We show that because our approach can track how the relative error decreases with increasing precision, unlike the existing methods, we can use it to create smaller hardware with guaranteed numerical properties. This results in a saving of 25 % of the area in comparison to optimizing the precision using competing analytical techniques, whilst requiring a smaller execution time than the these methods, and saving almost 80 % of area in comparison to adopting IEEE double precision arithmetic.
An Integrated Reduction Technique for a Double Precision Accumulator
"... The accumulation operation, An+1 = A n + X, is perhaps one of the most fundamental and widely-used operations in numerical mathematics and digital signal processing. However, designing double-precision floating-point accumulators presents a unique set of challenges: double-precision addition is usua ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The accumulation operation, An+1 = A n + X, is perhaps one of the most fundamental and widely-used operations in numerical mathematics and digital signal processing. However, designing double-precision floating-point accumulators presents a unique set of challenges: double-precision addition is usually deeply pipelined and without special micro-architectural or data scheduling techniques, the data hazard that exists between A n+1 and A n requires that each new value of X delivered to the accumulator wait for the latency of the adder. There have been several techniques proposed for alleviating this problem, but each carries significant overheads and/or restrictions on input characteristics. In this paper we present a design for a double precision accumulator that requires no timing overhead relative to the underlying add operation. We achieve this by integrating a coalescing reduction circuit within the low-level design of a baseconverting floating-point adder. To demonstrate our accumulator design, we use it in a sparse matrix vector multiplication architecture, achieving a throughput of up to 3.7 GFLOPS.