Results 1 - 10
of
11
Portable and Scalable FPGA-Based Acceleration of a Direct Linear System Solver
"... FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGA-based scientific computation would become far more practica ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGA-based scientific computation would become far more practical if there were hardware libraries that were portable to any FPGA with performance that could scale with the resources of the FPGA. To illustrate this idea we have implemented one common supercomputing function: the LU factorization method for solving linear systems. This dissertation discusses issues in making the design both portable and scalable. The design is automatically generated to match the FPGA’s capabilities and external memory through the use of parameters. We compared the performance of the design on the FPGA to a single processor core and found that it performs 2.2 times faster, and that the energy dissipated per computation is 5 times less. ii Acknowledgements I would like to thank Professor Jonathan Rose and Vaughn Betz for their guidance and motivation during the last two years. I have learnt a lot during this time, not just about
State-of-the-art in heterogeneous computing
, 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as wel ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
Floating-Point Exponentiation Units for Reconfigurable Computing
"... The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and mantissa sizes. Last-bit accuracy at the smallest possible cost is obtained thanks to a careful study of the various subcomponents: a floating-point logarithm, a modified floating-point exponential, and a truncated floating-point multiplier. A parameterized architecture generator in the open-source FloPoCo project is presented in details and evaluated.
FPGA Accelerator for Floating-Point Matrix Multiplication
- IET COMPUTERS & DIGITAL TECHNIQUES
, 2012
"... This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns t ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99 % of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.
Automatic Tailoring of Configurable Vector Processors for Scientific Computations
"... Abstract — Re-hosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete re-write of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a l ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — Re-hosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete re-write of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a legacy code without a long and expensive re-hosting effort. The approach automatically tailors a parameterized, configurable vector processor design to an input problem, and produces an instance of the processor. Experimental data shows the processors perform competitively when compared with a diverse set of contemporary high performance computing alternatives.
A Universal FPGA-based Floating-point Matrix Processor for Mobile Systems ∗
"... Abstract—FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and p ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and propose a universal floating-point matrix processor. The processor supports multiple matrix-matrix operations (Level 3 BLAS) and the matrix size is unlimited. The key component of the processor is a shared matrix cache which enables on-chip communication between dif-ferent accelerators. This structure reduces the external memory bandwidth requirement and improves the overall performance. Considering the performance of the whole system, an asyn-chronous instruction execution mechanism is further proposed in the hardware-software interface so as to reduce the workload of the master processor. We demonstrate the system using a DE3 develop board and achieve a computing performance of about 19 GFLOPS. Experiments show the proposed processor achieves higher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOS II/f soft-core processor. The performance of the processor is even comparable to some desktop processors. I.
Custom Optimization Algorithms for Efficient Hardware Implementation
, 2013
"... The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently avail ..."
Abstract
- Add to MetaCart
(Show Context)
The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special character-
A Low Complexity Scaling Method for the Lanczos Kernel in Fixed-Point Arithmetic
"... Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to float ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to floating-point arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a non-trivial task. For these types of algorithms the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a floating-point implementation, if necessary. The approach is evaluated on field-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floating-point to fixed-point computation.
Reviewed by
"... In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me ..."
Abstract
- Add to MetaCart
(Show Context)
In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me, sparking my interest in forensic psychology. Those tapes were by Charles Patrick Ewing, the author of Trials of a Forensic Psychologist: A Casebook. Ewing, a law school professor at the State University of New York, University at Buffalo, is one of the most renowned forensic psychologists in the United States, the author of a number of books, countless articles, and editor of a respected forensic psychology journal. As is evident from Trials, his expertise is widely sought around the country in high-profile criminal trials. And the man can write! The accounts of the 10 trials in this book are compelling. I can think of few if any forensic psychology books that I have read for pleasure or that I cannot put down, but Trials is indeed one. Instead of reading it out of a sense of duty or responsibility to keep up with developments in the field, I found myself reading it for sheer enjoyment. The accounts of the 10 trials that Ewing presents are gripping narratives that easily kept my interest. Perhaps the book's closest analogue is not another forensic
1A POWER EFFICIENT LINEAR EQUATION SOLVER ON A MULTI-FPGA ACCELERATOR
"... This paper presents an approach to explore a commercial multi-FPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Block-based Right-Hand-Side solver algorithm is described and a novel data ..."
Abstract
- Add to MetaCart
(Show Context)
This paper presents an approach to explore a commercial multi-FPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Block-based Right-Hand-Side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes, and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floating-point data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (Performance/Watt) metric. FPGA system is about eleven times more power efficient then the compute node of a cluster. In recent years Field Programmable Gate Array (FPGA) based high performance computing systems [1][2] have gained attention due to their unique ability to support customized circuits for accelerating compute intensive applications. Research groups worldwide have explored the use of single and multi-