Results 1  10
of
29
An FPGAspecific Approach to FloatingPoint Accumulation and SumofProducts
 FIELDPROGRAMMABLE TECHNOLOGY, IEEE
, 2008
"... This article studies two common situations where the flexibility of FPGAs allows one to design applicationspecific floatingpoint operators which are more efficient and more accurate than those offered by processors and GPUs. First, for applications involving the addition of a large number of float ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
(Show Context)
This article studies two common situations where the flexibility of FPGAs allows one to design applicationspecific floatingpoint operators which are more efficient and more accurate than those offered by processors and GPUs. First, for applications involving the addition of a large number of floatingpoint values, an adhoc accumulator is proposed. By tailoring its parameters to the numerical requirements of the application, it can be made arbitrarily accurate, at an area cost comparable to that of a standard floatingpoint adder, and at a higher frequency. The second example is the sumofproduct operation, which is the building block of matrix computations. A novel architecture is proposed that feeds the previous accumulator out of a floatingpoint multiplier whose rounding logic has been removed, again improving the area/accuracy tradeoff. These architectures are implemented within the FloPoCo generator, freely available under the LGPL.
G.A.: A High Throughput FPGABased Floating Point Conjugate Gradient Implementation
 ARC 2008. LNCS
, 2008
"... Abstract. As Field Programmable Gate Arrays (FPGAs) have reached capacities beyond millions of equivalent gates, it becomes possible to accelerate floatingpoint scientific computing applications. One type of calculation that is commonplace in scientific computation is the solution of systems of lin ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
Abstract. As Field Programmable Gate Arrays (FPGAs) have reached capacities beyond millions of equivalent gates, it becomes possible to accelerate floatingpoint scientific computing applications. One type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient algorithm. In this paper we present a parallel hardware Conjugate Gradient implementation. The implementation is particularly suited for accelerating multiple small to medium sized dense systems of linear equations. Through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n 2)cyclesfor a software implementation to Θ(n). I/O requirements are scalable and converge to a constant value with the increase of matrix order. Results on a VirtexII6000 demonstrate sustained performance of 5 GFLOPS and projected results on a Virtex5330 indicate sustained performance of 35 GFLOPS. The former result is comparable to highend CPUs, whereas the latter represents a significant speedup. 1
Tutorial paper: Parallel architectures for model predictive control
 In: Proc. of the European Control Conference 2009
, 2009
"... Abstract — This tutorial paper surveys recent developments in parallel computer architecture, focusing on the fieldprogrammable gate array and the graphics processor. We aim to illustrate the potential of these architectures for the type of highspeed numerical computation required in online optimi ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
Abstract — This tutorial paper surveys recent developments in parallel computer architecture, focusing on the fieldprogrammable gate array and the graphics processor. We aim to illustrate the potential of these architectures for the type of highspeed numerical computation required in online optimization for model predictive control. While significant performance advantages can be gained by migrating existing control algorithms to these processor architectures, in order to realise their full potential, further research is needed at the boundary of control theory, digital electronics, and computer architecture. We survey some of the open questions in this area. I.
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
"... Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectivel ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
Recently, reconfigurable computing systems have been built which employ FieldProgrammable Gate Arrays (FPGAs) as hardware accelerators for generalpurpose processors. These systems provide new opportunities for highperformance computing. In this paper, we investigate hybrid designs that effectively utilize both the FPGAs and processors in the reconfigurable computing systems. Based on a highlevel computational model, we propose designs for floatingpoint matrix multiplication and block LU decomposition. In our designs, the workload of an application is partitioned between the FPGAs and processors in a balanced way; the FPGAs and processors work cooperatively without data hazards or memory access conflicts. Experimental results on Cray XD1 show that with one Xilinx XC2VP50 FPGA (a relatively small device available in XD1) and an AMD 2.2 GHz processor, our designs achieve up to 1.4X/2X speedup over the design that employs AMD processors/FPGAs only. The performance of our designs scales with the number of nodes. Moreover, our designs achieve higher performance when improved floatingpoint units or larger devices are used.
Sparse matrix computations on reconfigurable hardware
 Computer
, 2007
"... Using a highlevellanguage to hardwaredescriptionlanguage compiler and some novel architectures and algorithms to map two wellknown doubleprecision floatingpoint sparse matrix iterativelinearequation solvers—the Jacobi and conjugate gradient methods—onto a reconfigurable computer achieves mo ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Using a highlevellanguage to hardwaredescriptionlanguage compiler and some novel architectures and algorithms to map two wellknown doubleprecision floatingpoint sparse matrix iterativelinearequation solvers—the Jacobi and conjugate gradient methods—onto a reconfigurable computer achieves more than a twofold speedup over software. Researchers at the US Army Engineer Research and Development Center and the University of Southern California are focusing on algorithms and architectures to facilitate highperformance, reconfigurable computerbased scientific computing.
Promises and Pitfalls of Reconfigurable Supercomputing
"... Reconfigurable supercomputing (RSC) combines programmable logic chips with high performance microprocessors, all communicating over a high bandwidth, low latency interconnection network. Reconfigurable hardware has demonstrated an order of magnitude speedup on computeintensive kernels in science an ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Reconfigurable supercomputing (RSC) combines programmable logic chips with high performance microprocessors, all communicating over a high bandwidth, low latency interconnection network. Reconfigurable hardware has demonstrated an order of magnitude speedup on computeintensive kernels in science and engineering. However, translating high level algorithms to programmable hardware is a formidable barrier to the use of these resources by scientific programmers. A librarybased approach has been suggested, so that the software application can call standard library functions that have been optimized for hardware. The potential benefits of this approach are evaluated on several large scientific supercomputing applications. It is found that hardware linear algebra libraries would be of little benefit to the applications analyzed. To maximize performance of supercomputing applications on RSC, it is necessary to identify kernels of high computational density that can be mapped to hardware, carefully partition software and hardware to reduce communications overhead, and optimize memory bandwidth on the FPGAs. Two case studies that follow this approach are summarized, and, based on experience with these applications, directions for future reconfigurable supercomputing architectures are outlined. 1.
Highperformance and parameterized matrix factorization on FPGAs
 In Proceedings of the 2006 International Conference on Field Programmable Logic and Applications
, 2007
"... FPGAs have become an attractive choice for scientific computing. In this paper, we propose a high performance design for LU decomposition, a key kernel in many scientific and engineering applications. Our design achieves the optimal performance for LU decomposition using the available hardware resou ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
FPGAs have become an attractive choice for scientific computing. In this paper, we propose a high performance design for LU decomposition, a key kernel in many scientific and engineering applications. Our design achieves the optimal performance for LU decomposition using the available hardware resources. The design is parameterized. Thus, it can be easily adapted to various hardware constraints. Experimental results show that our design achieves high performance and offers good scalability. Our implementation on a Xilinx VirtexII Pro XC2VP100 achieves superior sustained floatingpoint performance over existing FPGAbased implementations and optimized libraries on the stateoftheart processors. 1.
PerformanceEnergy Tradeoffs for Matrix Multiplication on FPGABased MixedMode Chip Multiprocessors
 in Proceedings of the 8th International Symposium on Quality Electronic Design, 2007
"... AbstractChip multiprocessing has demonstrated to be a promising approach in microprocessor design. With ever increasing concerns for energy consumption, performanceenergy tradeoffs are often necessary, especially in the design of realtime embedded systems. This paper presents our performance and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
AbstractChip multiprocessing has demonstrated to be a promising approach in microprocessor design. With ever increasing concerns for energy consumption, performanceenergy tradeoffs are often necessary, especially in the design of realtime embedded systems. This paper presents our performance and energy study on an inhouse developed FPGAbased mixedmode chip multiprocessor, where the SIMD (SingleInstruction, MultipleData), MIMD (MultipleInstruction, MultipleData) and MSIMD (MultipleSIMD) computing modes can exist simultaneously in one system. We propose performanceenergy tradeoff techniques based on the observation that SIMD and MIMD task executions involve substantially different amounts of computation and communication, which result in different time and energy behavior and provide us with opportunities to realize various performanceenergy objectives. Generalized matrixmatrix multiplication (MMM) is employed as an example to illustrate our analysis. Experimental results on a Xilinx Virtex II XC2V60005 FPGA demonstrate the effectiveness of the proposed approach. I.
BLAS Comparison on FPGA, CPU and GPU
"... Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using doubleprecision floating point on an FPGA, ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—High Performance Computing (HPC) or scientific codes are being executed across a wide variety of computing platforms from embedded processors to massively parallel GPUs. We present a comparison of the Basic Linear Algebra Subroutines (BLAS) using doubleprecision floating point on an FPGA, CPU and GPU. On the CPU and GPU, we utilize standard libraries on stateoftheart devices. On the FPGA, we have developed parameterized modular implementations for the dotproduct and Gaxpy or matrixvector multiplication. In order to obtain optimal performance for any aspect ratio of the matrices, we have designed a highthroughput accumulator to perform an efficient reduction of floating point values. To support scalability to large datasets, we target the BEE3 FPGA platform. We use performance and energy efficiency as metrics to compare the different platforms. Results show that FPGAs offer comparable performance as well as 2.7 to 293 times better energy efficiency for the test cases that we implemented on all three platforms. I.
A model for matrix multiplication performance on FPGAs
 in Proc. International Conference on Field Programmable Logic and Applications (FPL
"... Computations involving matrices form the kernel of a large spectrum of computationally demanding applications for which FPGAs have been utilized as accelerators. Their performance is related to their underlying architectural and system parameters such as computational resources, memory and I/O ba ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Computations involving matrices form the kernel of a large spectrum of computationally demanding applications for which FPGAs have been utilized as accelerators. Their performance is related to their underlying architectural and system parameters such as computational resources, memory and I/O bandwidth. A simple analytic model that gives an estimate of the performance of FPGAbased sparse matrixvector and matrixmatrix multiplication is presented, dense matrix multiplication being a special case. The efficiency of existing implementations are compared to the model and performance trends for future technologies examined. 1.