Results 1  10
of
60
High Performance Linear Algebra Operations on Reconfigurable Systems
, 2005
"... FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for s ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for stateoftheart reconfigurable systems. We study three dataintensive operations: dot product, matrixvector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floatingpoint values, we develop a high performance reduction circuit. This circuit uses only one floatingpoint adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.
Highperformance reduction circuits using deeply pipelined operators on FPGAs,”
 IEEE Trans. Parallel Distrib. Syst.,
, 2007
"... ..."
(Show Context)
Pipelined mixed precision algorithms on FPGAs for fast and accurate PDE solvers from low precision components
 In IEEE Proceedings on Field–Programmable Custom Computing Machines (FCCM
, 2006
"... FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming more and more attractive for high precision scientific computations. One of the main problems in efficient resource utilization is the quadratically growing resource usage of multipliers depending on the operand size. Many research efforts have been devoted to the optimization of individual arithmetic and linear algebra operations. In this paper we take a higher level approach and seek to reduce the intermediate computational precision on the algorithmic level by optimizing the accuracy towards the final result of an algorithm. In our case this is the accurate solution of partial differential equations (PDEs). Using the Poisson Problem as a typical PDE example we show that most intermediate operations can be computed with floats or even smaller formats and only very few operations (e.g. 1%) must be performed in double precision to obtain the same accuracy as a full double precision solver. Thus the FPGA can be configured with many parallel float rather than few resource hungry double operations. To achieve this, we adapt the general concept of mixed precision iterative refinement methods to FPGAs and develop a fully pipelined version of the Conjugate Gradient solver. We combine this solver with different iterative refinement schemes and precision combinations to obtain resource efficient mappings of the pipelined algorithm core onto the FPGA. 1.
A hybrid approach for mapping conjugate gradient onto an FPGAaugmented reconfigurable supercomputer
 in Proceedings of the 14th IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM’06
, 2006
"... Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine generalpurpose processors (GPPs) with fieldprogrammable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, applicationspecific processors. The ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine generalpurpose processors (GPPs) with fieldprogrammable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, applicationspecific processors. These exciting supercomputers allow endusers to create custom computing architectures aimed at the computationally intensive parts of each problem. This report describes a parameterized, parallelized, deeply pipelined, dualFPGA, IEEE754 64bit floatingpoint design for accelerating the conjugate gradient (CG) iterative method on an FPGAaugmented RC. The FPGAbased elements are developed via a hybrid approach that uses a highlevel language (HLL)tohardware description language (HDL) compiler in conjunction with custombuilt, VHDLbased, floatingpoint components. A reference version of the design is implemented on a contemporary RC. Actual run time performance data compare the FPGAaugmented CG to the softwareonly version and show that the FPGAbased version runs 1.3 times faster than the software version. Estimates show that the design can achieve a 4 fold speedup on a nextgeneration RC.
Designing Scalable FPGABased Reduction Circuits Using Pipelined FloatingPoint Cores
, 2005
"... The use of pipelined floatingpoint arithmetic cores to create highperformance FPGAbased computational kernels has introduced a new class of problems that do not exist when using singlecycle arithmetic cores. In particular, the data hazards associated with pipelined floatingpoint reduction circu ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
(Show Context)
The use of pipelined floatingpoint arithmetic cores to create highperformance FPGAbased computational kernels has introduced a new class of problems that do not exist when using singlecycle arithmetic cores. In particular, the data hazards associated with pipelined floatingpoint reduction circuits can limit the scalability or severely reduce the performance of an otherwise highperformance computational kernel. The inability to efficiently execute the reduction in hardware coupled with memory bandwidth issues may even negate the performance gains derived from hardware acceleration of the kernel. In this paper we introduce a method for developing scalable floatingpoint reduction circuits that run in optimal time while requiring only Θ(lg(n)) space and a single pipelined floatingpoint unit. Using a Xilinx VirtexII Pro as the target device, we implement reference instances of our reduction method and present the FPGA design statistics supporting our scalability claims.
Architectural Modifications to Enhance the FloatingPoint Performance of FPGAs
"... Abstract—With the density of FPGAs steadily increasing, FPGAs have reached the point where they are capable of implementing complex floatingpoint applications. However, their generalpurpose nature has limited the use of FPGAs in scientific applications that require floatingpoint arithmetic due to ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Abstract—With the density of FPGAs steadily increasing, FPGAs have reached the point where they are capable of implementing complex floatingpoint applications. However, their generalpurpose nature has limited the use of FPGAs in scientific applications that require floatingpoint arithmetic due to the large amount of FPGA resources that floatingpoint operations still require. This paper considers three architectural modifications that make floatingpoint operations more efficient on FPGAs. The first modification embeds floatingpoint multiplyadd units in an island style FPGA. While offering a dramatic reduction in area and improvement in clock rate, these embedded units have the potential to waste significant silicon for nonfloatingpoint applications. The next two modifications target a major component of IEEE compliant floatingpoint computations: variable length shifters. The first alternative to LUTs for implementing the variable length shifters is a coarsegrained approach: embedded variable length shifters in the FPGA fabric. These shifters offer a significant reduction in area with a modest increase in clock rate and a relatively small potential for wasted silicon. The next alternative is a finegrained approach: adding a 4:1 multiplexer unit inside the slices, in parallel to the 4LUTs. While this offers the smallest overall area improvement, it does offer a significant improvement in clock rate with only a trivial increase in the size of the CLB. Index Terms—Reconfigurable architecture, FloatingPoint arithmetic, FPGA
HighPerformance Designs for Linear Algebra Operations on Reconfigurable Hardware
"... Abstract—Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using fieldprogrammable gate arrays ( ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated. With the rapid advances in technology, hardware acceleration of linear algebra applications using fieldprogrammable gate arrays (FPGAs) has become feasible. In this paper, we propose FPGAbased designs for several basic linear algebra operations, including dot product, matrixvector multiplication, matrix multiplication, and matrix factorization. By identifying the parameters for each operation, we analyze the tradeoffs and propose a highperformance design. In the implementations of the designs, the values of the parameters are determined according to the hardware constraints, such as the available chip area, the size of available memory, the memory bandwidth, and the number of I/O pins. The proposed designs are implemented on Xilinx VirtexII Pro FPGAs. Experimental results show that our designs scale with the available hardware resources. Also, the performance of our designs compares favorably with that of generalpurpose processorbased designs. We also show that, with faster floatingpoint units and larger devices, the performance of our designs increases accordingly. Index Terms—Reconfigurable hardware, computations on matrices, parallel algorithms. Ç 1
Sampling from the Multivariate Gaussian Distribution using Reconfigurable Hardware
 In FieldProgrammable Custom Computing Machines, 2007. FCCM
, 2007
"... The multivariate Gaussian distribution models random processes as vectors of Gaussian samples with a fixed correlation matrix. Such distributions are useful for modelling realworld multivariate timeseries such as equity returns, where the returns for businesses in the same sector are likely to be ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
The multivariate Gaussian distribution models random processes as vectors of Gaussian samples with a fixed correlation matrix. Such distributions are useful for modelling realworld multivariate timeseries such as equity returns, where the returns for businesses in the same sector are likely to be correlated. Generating random samples from such a distribution presents a computational challenge due to the dense matrixvector multiplication needed to introduce the required correlations. This paper proposes a hardware architecture for generating random vectors, utilising the embedded block RAMs and multipliers found in contemporary FPGAs. The approach generates a new n dimensional random vector every n clock cycles, and has a raw generation rate over 200 times that of a single Opteron 2.2GHz using an optimised BLAS package for linear algebra computation. The generation architecture is an ideal source for both software simulations connected via high bandwidth connection, and for completely FPGA based simulations. Practical performance is explored in a case study in DeltaGamma ValueatRisk, where a standalone Virtex4 xc4vsx55 solution at 400 MHz is 33 times faster than a quad Opteron 2.2GHz SMP. The FPGA solution also scales well for larger problem sizes, allowing larger portfolios to be simulation. 1.
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
"... Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectivel ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectively utilize both the FPGAs and processors in the reconfigurable computing systems. Based on a highlevel computational model, we propose designs for floatingpoint matrix multiplication and block LU decomposition. In our designs, the workload of an application is partitioned between the FPGAs and processors in a balanced way; the FPGAs and processors work cooperatively without data hazards or memory access conflicts. Experimental results on Cray XD1 show that with one Xilinx XC2VP50 FPGA (a relatively small device available in XD1) and an AMD 2.2 GHz processor, our designs achieve up to 1.4X/2X speedup over the design that employs AMD processors/FPGAs only. The performance of our designs scales with the number of nodes. Moreover, our designs achieve higher performance when improved floatingpoint units or larger devices are used.
Portable and Scalable FPGABased Acceleration of a Direct Linear System Solver
"... FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGAbased scientific computation would become far more practica ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGAbased scientific computation would become far more practical if there were hardware libraries that were portable to any FPGA with performance that could scale with the resources of the FPGA. To illustrate this idea we have implemented one common supercomputing function: the LU factorization method for solving linear systems. This dissertation discusses issues in making the design both portable and scalable. The design is automatically generated to match the FPGA’s capabilities and external memory through the use of parameters. We compared the performance of the design on the FPGA to a single processor core and found that it performs 2.2 times faster, and that the energy dissipated per computation is 5 times less. ii Acknowledgements I would like to thank Professor Jonathan Rose and Vaughn Betz for their guidance and motivation during the last two years. I have learnt a lot during this time, not just about