Results 1  10
of
15
High Performance Linear Algebra Operations on Reconfigurable Systems
, 2005
"... FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for s ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for stateoftheart reconfigurable systems. We study three dataintensive operations: dot product, matrixvector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floatingpoint values, we develop a high performance reduction circuit. This circuit uses only one floatingpoint adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.
Highperformance reduction circuits using deeply pipelined operators on FPGAs,”
 IEEE Trans. Parallel Distrib. Syst.,
, 2007
"... ..."
(Show Context)
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware
 In ICPP ’05: Proceedings of the 2005 International Conference on Parallel Processing
, 2005
"... Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linea ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGAbased designs for several BLAS operations, including vector product, matrixvector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of onchip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx VirtexII Pro FPGA. 1
Groupalignment based accurate floatingpoint summation on fpgas
 In ERSA’06
, 2006
"... Floatingpoint summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floatingpoint arithmetic based summation algorithms may not always result in accura ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Floatingpoint summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floatingpoint arithmetic based summation algorithms may not always result in accurate solutions because of possible catastrophic cancellations. To make the situation worse, the sequence of consecutive additions will affect the final result, which makes it impossible to produce a unique solution for the same input dataset on different computer platforms with different software compilers. The emergence of highdensity reconfigurable hardware devices gives us an option to customize highperformance arithmetic units for our specific computing problems. In this paper, we design an FPGAbased hardware algorithm for accurate floatingpoint summations using group alignment technique. The corresponding fullpipelined summation unit is proven to provide similar or even better numerical errors than standard floatingpoint arithmetic. Moreover, it consumes much less RC resources as well as pipelining stages than existent designs, but achieves the optimal working speed at one summation per clock cycle with only moderate startup latency. This new technique can also be used to accelerate executions of other linear algebra subroutines on FPGAs and result in much more efficient and compact implementations without negative impact on computational performance or numerical accuracy.
BLAS Comparison on FPGA, CPU and GPU
"... Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using doubleprecision floating point on an FPGA, ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using doubleprecision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on stateoftheart devices. On the FPGA, we have developed parameterized modular implementations for the dotproduct and Gaxpy or matrixvector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a highthroughput accumulator to perform an efficient reduction of floating point values. To support scalability to large datasets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms. I.
HighPrecision BLAS on FPGAenhanced Computers
"... The emergence of highdensity reconfigurable hardware devices gives scientists and engineers an option to accelerating their numerical computing applications on lowcost but powerful “FPGAenhanced computers”. In this paper, we introduced our efforts towards improving the computational performance o ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The emergence of highdensity reconfigurable hardware devices gives scientists and engineers an option to accelerating their numerical computing applications on lowcost but powerful “FPGAenhanced computers”. In this paper, we introduced our efforts towards improving the computational performance of Basic Linear Algebra Subprograms (BLAS) by FPGAspecific algorithms/methods. Our study focus on three BLAS subroutines: floating point summation, matrixvector multiplication, and matrixmatrix multiplication. They represent all three levels of BLAS functionalities, and their sustained computational performances are either memory bandwidth bounded or computation bounded. By proposing the groupalignment based floatingpoint summation method and applying this technique to other subroutines, we significantly improved their sustained computational performance and reduced numerical errors with moderate FPGA resources consumed. Comparing with existing FPGAbased implementations, our designs are efficient and compact with improved numerical accuracy and stability. 1.
Jacobi load flow accelerator using FPGA
 Proceedings of the 37th Annual North American Power Symposium
, 2005
"... Abstract FullAC load flow is a crucial task in power system analysis. Solving fullAC load flow utilizes iterative numerical methods such as Jacobi, GaussSeidel or NewtonRaphson. NewtonRaphson is currently the preferred solver used in industrial applications such as Power World and PSS/E due t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract FullAC load flow is a crucial task in power system analysis. Solving fullAC load flow utilizes iterative numerical methods such as Jacobi, GaussSeidel or NewtonRaphson. NewtonRaphson is currently the preferred solver used in industrial applications such as Power World and PSS/E due to it faster convergence than either Jacobi or GaussSeidel. In this paper, we reexamine the Jacobi method for use in a fully pipelined hardware implementation using a Field Programmable Gate Array (FPGA) as an alternative to NewtonRaphson. Using benchmark data from representative power systems, we compare the operation counts of NewtonRaphson software to the proposed Jacobi FPGA hardware. Our studies show that Jacobi method implemented in an FPGA for a sufficiently large power system has the potential to be a state of the art fullAC load flow engine.
FPGABased, FloatingPoint Reduction Operations
"... Abstract: Floatingpoint reduction operations are a vital part of scientific computational kernels, such as vector dotproducts, discrete cosine transforms (DCT), and matrixmatrix multiplications. As FPGAs continue to gain popularity in custom and embedded computing platforms, implementations of t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract: Floatingpoint reduction operations are a vital part of scientific computational kernels, such as vector dotproducts, discrete cosine transforms (DCT), and matrixmatrix multiplications. As FPGAs continue to gain popularity in custom and embedded computing platforms, implementations of these applications in such platforms are desirable. Due to the inherently deep pipelines of highperformance floatingpoint cores in FPGAs, reduction circuits require special feedback and buffering schemes in order to realize full throughput. In this paper, we present our floatingpoint reduction architecture, clocked at more than 150 MHz targeting a Xilinx Virtex2 80004 FPGA.
An Improved Reduction Algorithm With Deeply Pipelined Operators
"... AbstractMany scientific applications involve reduction or accumulation operations on sequential data streams. Examples such as matrixvector multiplication include multiple inner product operations on different data sets. If the core operator of the reduction is deeply pipelined, which is usually ..."
Abstract
 Add to MetaCart
(Show Context)
AbstractMany scientific applications involve reduction or accumulation operations on sequential data streams. Examples such as matrixvector multiplication include multiple inner product operations on different data sets. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data cause data hazards in the pipeline and ask for a proper design. In this paper, we propose a modified design of the reduction operation based on Sips and Lin's method. We analyze the performance of the proposed design to prove the correctness of the timing and demonstrate its performance against previous methods.
Accelerating DTI Tractography using FPGAs
"... Diffusion Tensor Imaging (DTI) tractography in Magnetic Resonance Imaging (MRI) is a computationally intensive procedure, requiring on the order of tens of minutes to complete tractography of the entire brain. Tractography computations can be accelerated significantly by the use of reconfigurable ha ..."
Abstract
 Add to MetaCart
Diffusion Tensor Imaging (DTI) tractography in Magnetic Resonance Imaging (MRI) is a computationally intensive procedure, requiring on the order of tens of minutes to complete tractography of the entire brain. Tractography computations can be accelerated significantly by the use of reconfigurable hardware, such as Field Programmable Gate Arrays (FPGAs). Such acceleration has the potential to lead to realtime tractography, which would greatly facilitate onsite diagnosis and acquisition of additional scans while the patient is still inside the scanner. In this paper we report the development of an FPGA based architecture to accelerate DTI tractography. We identify computationally intensive kernels and design pipelined implementations. Our performance analysis based on the developed architecture gives on the order of 100x speedup over an optimized implementation in C of tractography on a stateoftheart processor. 1