Results 1  10
of
15
Highperformance reduction circuits using deeply pipelined operators on FPGAs,”
 IEEE Trans. Parallel Distrib. Syst.,
, 2007
"... ..."
(Show Context)
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
"... Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectivel ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectively utilize both the FPGAs and processors in the reconfigurable computing systems. Based on a highlevel computational model, we propose designs for floatingpoint matrix multiplication and block LU decomposition. In our designs, the workload of an application is partitioned between the FPGAs and processors in a balanced way; the FPGAs and processors work cooperatively without data hazards or memory access conflicts. Experimental results on Cray XD1 show that with one Xilinx XC2VP50 FPGA (a relatively small device available in XD1) and an AMD 2.2 GHz processor, our designs achieve up to 1.4X/2X speedup over the design that employs AMD processors/FPGAs only. The performance of our designs scales with the number of nodes. Moreover, our designs achieve higher performance when improved floatingpoint units or larger devices are used.
Sparse matrix computations on reconfigurable hardware
 Computer
, 2007
"... Using a highlevellanguage to hardwaredescriptionlanguage compiler and some novel architectures and algorithms to map two wellknown doubleprecision floatingpoint sparse matrix iterativelinearequation solvers—the Jacobi and conjugate gradient methods—onto a reconfigurable computer achieves mo ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Using a highlevellanguage to hardwaredescriptionlanguage compiler and some novel architectures and algorithms to map two wellknown doubleprecision floatingpoint sparse matrix iterativelinearequation solvers—the Jacobi and conjugate gradient methods—onto a reconfigurable computer achieves more than a twofold speedup over software. Researchers at the US Army Engineer Research and Development Center and the University of Southern California are focusing on algorithms and architectures to facilitate highperformance, reconfigurable computerbased scientific computing.
Evaluation of a highlevellanguage methodology for highperformance reconfigurable computers,” Application specific Systems
 Architectures and Processors, 2007. ASAP. IEEE International Conf. on
, 2007
"... Abstract ..."
(Show Context)
Implementation of simulation algorithms in FPGA for real time simulation of electrical networks with power electronics devices
 IEEE Int. Conf. on Reconfigurable Computing and FPGAs
, 2006
"... Abstract ..."
(Show Context)
Hardware/Software CoDesign for Matrix Computations on Reconfigurable Computing Systems
, 2007
"... Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for scientific computations. However, the coexistence of the processors and the FPGAs in s ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for scientific computations. However, the coexistence of the processors and the FPGAs in such systems also poses new challenges to application developers. In this paper, we investigate a design model for hybrid designs, that is, designs that utilize both the processors and the FPGAs. The model characterizes a reconfigurable computing system using various system parameters, including the floatingpoint computing power of the processor and the FPGA, the number of nodes, the memory bandwidth and the network bandwidth. Using the model, we investigate hardware/software codesign for two computationally intensive applications: matrix factorization and allpairs shortestpaths problem. Our designs balance the load between the processor and the FPGA, as well as overlap the computation time with memory transfer time and network communication time. The proposed designs are implemented on 6 nodes in a Cray XD1 chassis. Our implementations achieve 20 GFLOPS and 6.6 GFLOPS for these two applications, respectively.
An Efficient Sparse Conjugate Gradient Solver Using a Beneš Permutation Network
"... AbstractThe conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the ..."
Abstract
 Add to MetaCart
(Show Context)
AbstractThe conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrixvector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach.
HARDWARE ACCELERATION FOR SPARSE FOURIER IMAGE RECONSTRUCTION
, 2010
"... Several supercomputer vendors now offer reconfigurable computing (RC) systems, combining generalpurpose processors with fieldprogrammable gate arrays (FPGAs). The FPGAs can be configured as custom computing architectures for the computationally intensive parts of each application. In this paper we ..."
Abstract
 Add to MetaCart
Several supercomputer vendors now offer reconfigurable computing (RC) systems, combining generalpurpose processors with fieldprogrammable gate arrays (FPGAs). The FPGAs can be configured as custom computing architectures for the computationally intensive parts of each application. In this paper we present an RCbased hardware accelerator for an important medical imaging algorithm: iterative sparse Fourier image reconstruction. We transform the algorithm to exploit massive parallelism available in the FPGA fabric. Our design allows different ways of chaining custom pipelined vector engines, so that different computations can be carried out without reconfiguration overhead. Actual runtime performance data show that we achieve up to 10 times speedup compared to the softwareonly version. The design is estimated to provide even more speedup on a nextgeneration RC platform.
Optimising Memory Bandwidth Use and Performance for MatrixVector Multiplication in Iterative Methods
"... Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative metho ..."
Abstract
 Add to MetaCart
(Show Context)
Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over general purpose processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelisation of the matrixvector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; ElKurdi et al. 2006]. However, whilst the performance of matrixvector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of onchip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; ElKurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This paper proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimise the RAM use, in order to both increase the performance and retain this performance for larger order matrices.
AN FPGABASED IMPLEMENTATION OF THE MINRES ALGORITHM
"... Due to continuous improvements in the resources available on FPGAs, it is becoming increasingly possible to accelerate floating point algorithms. The solution of a system of linear equations forms the basis of many problems in engineering and science, but its calculation is highly time consuming. Th ..."
Abstract
 Add to MetaCart
(Show Context)
Due to continuous improvements in the resources available on FPGAs, it is becoming increasingly possible to accelerate floating point algorithms. The solution of a system of linear equations forms the basis of many problems in engineering and science, but its calculation is highly time consuming. The minimum residual algorithm (MINRES) is one method to solve this problem, and is highly effective provided the matrix exhibits certain characteristics. This paper examines an IEEE 754 single precision floating point implementation of the MINRES algorithm on an FPGA. It demonstrates that through parallelisation and heavy pipelining of all floating point components it is possible to achieve a sustained performance of up to 53 GFLOPS on the Virtex5330T. This compares favourably to other hardware implementations of floating point matrix inversion algorithms, and corresponds to an improvement of nearly an order of magnitude compared to a software implementation. 1.