Results 1  10
of
11
Portable and Scalable FPGABased Acceleration of a Direct Linear System Solver
"... FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGAbased scientific computation would become far more practica ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGAbased scientific computation would become far more practical if there were hardware libraries that were portable to any FPGA with performance that could scale with the resources of the FPGA. To illustrate this idea we have implemented one common supercomputing function: the LU factorization method for solving linear systems. This dissertation discusses issues in making the design both portable and scalable. The design is automatically generated to match the FPGA’s capabilities and external memory through the use of parameters. We compared the performance of the design on the FPGA to a single processor core and found that it performs 2.2 times faster, and that the energy dissipated per computation is 5 times less. ii Acknowledgements I would like to thank Professor Jonathan Rose and Vaughn Betz for their guidance and motivation during the last two years. I have learnt a lot during this time, not just about
Stateoftheart in heterogeneous computing
, 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of finegrained parallelism in highperformance computing, as wel ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of finegrained parallelism in highperformance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the stateoftheart in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of stateoftheart techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
FloatingPoint Exponentiation Units for Reconfigurable Computing
"... The high performance and capacity of current FPGAs makes them suitable as acceleration coprocessors. This article studies the implementation, for such accelerators, of the floatingpoint power function x y as defined by the C99 and IEEE 7542008 standards, generalized here to arbitrary exponent and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The high performance and capacity of current FPGAs makes them suitable as acceleration coprocessors. This article studies the implementation, for such accelerators, of the floatingpoint power function x y as defined by the C99 and IEEE 7542008 standards, generalized here to arbitrary exponent and mantissa sizes. Lastbit accuracy at the smallest possible cost is obtained thanks to a careful study of the various subcomponents: a floatingpoint logarithm, a modified floatingpoint exponential, and a truncated floatingpoint multiplier. A parameterized architecture generator in the opensource FloPoCo project is presented in details and evaluated.
FPGA Accelerator for FloatingPoint Matrix Multiplication
 IET COMPUTERS & DIGITAL TECHNIQUES
, 2012
"... This study treats architecture and implementation of a FPGA accelerator for doubleprecision floatingpoint matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
This study treats architecture and implementation of a FPGA accelerator for doubleprecision floatingpoint matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for fullduplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99 % of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGAspecific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.
Automatic Tailoring of Configurable Vector Processors for Scientific Computations
"... Abstract — Rehosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete rewrite of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a l ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract — Rehosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete rewrite of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a legacy code without a long and expensive rehosting effort. The approach automatically tailors a parameterized, configurable vector processor design to an input problem, and produces an instance of the processor. Experimental data shows the processors perform competitively when compared with a diverse set of contemporary high performance computing alternatives.
A Universal FPGAbased Floatingpoint Matrix Processor for Mobile Systems ∗
"... Abstract—FPGAbased acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and p ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—FPGAbased acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and propose a universal floatingpoint matrix processor. The processor supports multiple matrixmatrix operations (Level 3 BLAS) and the matrix size is unlimited. The key component of the processor is a shared matrix cache which enables onchip communication between different accelerators. This structure reduces the external memory bandwidth requirement and improves the overall performance. Considering the performance of the whole system, an asynchronous instruction execution mechanism is further proposed in the hardwaresoftware interface so as to reduce the workload of the master processor. We demonstrate the system using a DE3 develop board and achieve a computing performance of about 19 GFLOPS. Experiments show the proposed processor achieves higher performance and energy efficiency than some stateoftheart embedded processors including ARM cortex A9 and NIOS II/f softcore processor. The performance of the processor is even comparable to some desktop processors. I.
Custom Optimization Algorithms for Efficient Hardware Implementation
, 2013
"... The focus is on realtime optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently avail ..."
Abstract
 Add to MetaCart
(Show Context)
The focus is on realtime optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resourceconstrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interiorpoint methods and a range of firstorder constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interiorpoint methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special character
A Low Complexity Scaling Method for the Lanczos Kernel in FixedPoint Arithmetic
"... Abstract—We consider the problem of enabling fixedpoint implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixedpoint arithmetic presents additional design challenges compared to float ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—We consider the problem of enabling fixedpoint implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixedpoint arithmetic presents additional design challenges compared to floatingpoint arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a nontrivial task. For these types of algorithms the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of wellknown methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixedpoint implementations of the modified problem can be chosen to be at least as good as a floatingpoint implementation, if necessary. The approach is evaluated on fieldprogrammable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floatingpoint to fixedpoint computation.
Reviewed by
"... In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me ..."
Abstract
 Add to MetaCart
(Show Context)
In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me, sparking my interest in forensic psychology. Those tapes were by Charles Patrick Ewing, the author of Trials of a Forensic Psychologist: A Casebook. Ewing, a law school professor at the State University of New York, University at Buffalo, is one of the most renowned forensic psychologists in the United States, the author of a number of books, countless articles, and editor of a respected forensic psychology journal. As is evident from Trials, his expertise is widely sought around the country in highprofile criminal trials. And the man can write! The accounts of the 10 trials in this book are compelling. I can think of few if any forensic psychology books that I have read for pleasure or that I cannot put down, but Trials is indeed one. Instead of reading it out of a sense of duty or responsibility to keep up with developments in the field, I found myself reading it for sheer enjoyment. The accounts of the 10 trials that Ewing presents are gripping narratives that easily kept my interest. Perhaps the book's closest analogue is not another forensic
1A POWER EFFICIENT LINEAR EQUATION SOLVER ON A MULTIFPGA ACCELERATOR
"... This paper presents an approach to explore a commercial multiFPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Blockbased RightHandSide solver algorithm is described and a novel data ..."
Abstract
 Add to MetaCart
(Show Context)
This paper presents an approach to explore a commercial multiFPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Blockbased RightHandSide solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes, and matrix sizes is proposed. These architectures have been implemented on a multiFPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floatingpoint data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (Performance/Watt) metric. FPGA system is about eleven times more power efficient then the compute node of a cluster. In recent years Field Programmable Gate Array (FPGA) based high performance computing systems [1][2] have gained attention due to their unique ability to support customized circuits for accelerating compute intensive applications. Research groups worldwide have explored the use of single and multi