• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

High-performance designs for linear algebra operations on reconfigurable hardware (0)

by L Zhuo, V Prasanna
Venue:IEEE Transactions on Computers
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 11
Next 10 →

Portable and Scalable FPGA-Based Acceleration of a Direct Linear System Solver

by Wei Zhang, Wei Zhang
"... FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGA-based scientific computation would become far more practica ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
FPGAs are becoming an attractive platform for accelerating many computations including scientific applications. However, their adoption has been limited by the large development cost and short life span of FPGA designs. We believe that FPGA-based scientific computation would become far more practical if there were hardware libraries that were portable to any FPGA with performance that could scale with the resources of the FPGA. To illustrate this idea we have implemented one common supercomputing function: the LU factorization method for solving linear systems. This dissertation discusses issues in making the design both portable and scalable. The design is automatically generated to match the FPGA’s capabilities and external memory through the use of parameters. We compared the performance of the design on the FPGA to a single processor core and found that it performs 2.2 times faster, and that the energy dissipated per computation is 5 times less. ii Acknowledgements I would like to thank Professor Jonathan Rose and Vaughn Betz for their guidance and motivation during the last two years. I have learnt a lot during this time, not just about

State-of-the-art in heterogeneous computing

by Andre R. Brodtkorb , Christopher Dyken , Trond R. Hagen , Jon M. Hjelmervik , Olaf O. Storaasli , 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as wel ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
(Show Context)

Citation Context

...ode for heterogeneous architectures (CBEA) [39]. Molecular Dynamics (FPGA) [105]; Dense Linear Algebra Achieving peak performance (CBEA) [41, 106, 107]; Matrix multiplication, LU decomposition (FPGA) =-=[108]-=-; Use of registers instead of shared memory (GPU) [109]; Linpack (CBEA) [110]; Mixed precision (FPGA) [52]. Sparse Linear Algebra Blocking (CBEA) [111]; Data structures (GPU) [112]; Fast Fourier Trans...

Floating-Point Exponentiation Units for Reconfigurable Computing

by Florent de Dinechin, Pedro Echeverría, Marisa López-vallejo, Bogdan Pasca
"... The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
The high performance and capacity of current FPGAs makes them suitable as acceleration co-processors. This article studies the implementation, for such accelerators, of the floating-point power function x y as defined by the C99 and IEEE 754-2008 standards, generalized here to arbitrary exponent and mantissa sizes. Last-bit accuracy at the smallest possible cost is obtained thanks to a careful study of the various subcomponents: a floating-point logarithm, a modified floating-point exponential, and a truncated floating-point multiplier. A parameterized architecture generator in the open-source FloPoCo project is presented in details and evaluated.

FPGA Accelerator for Floating-Point Matrix Multiplication

by Ž. Jovanović, V. Milutinović - IET COMPUTERS & DIGITAL TECHNIQUES , 2012
"... This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns t ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99 % of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.
(Show Context)

Citation Context

...rocessing elements (PE), with memories for buffering input (and possibly output) data spread equally across all PEs (Fig. 1). Although the PE connection pattern in the form of a tree is also possible =-=[13]-=-, the linear list has the advantage of a much more regular structure, which allows simpler routing between PEs and consequently the higher clock frequency. After the initial latency, a list of n PEs m...

Automatic Tailoring of Configurable Vector Processors for Scientific Computations

by D. Rutishauser, M. Jones
"... Abstract — Re-hosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete re-write of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a l ..."
Abstract - Add to MetaCart
Abstract — Re-hosting legacy codes optimized for a platform such as a vector supercomputer often requires a complete re-write of the original code. This work provides a framework and approach to use configurable computing resources in place of a vector supercomputer towards the implementation of a legacy code without a long and expensive re-hosting effort. The approach automatically tailors a parameterized, configurable vector processor design to an input problem, and produces an instance of the processor. Experimental data shows the processors perform competitively when compared with a diverse set of contemporary high performance computing alternatives.
(Show Context)

Citation Context

...oad Size Clock (MHz) FPGA Device GFLOPS per proc. System Price Price per GFLOPS FP GFLOPS Limit Hybrid HPC Convey HC-1 [6] Matrix Mult. order 16K 150 Virtex™5 LX 330 19.0 $13K $171 14.4 132 Cray XD-1 =-=[10]-=- Matrix Mult. order 16K 110 Virtex™2 Pro vp 50 2.0 $100K [5] $8.3K 8.8 23 Direct Hardware Implementation Systolic Array [11] 2D Cavity Flow 48 X 48 grid 106 Stratix®2 EP 2S180 18.0 $7.7K [12] $425 20....

A Universal FPGA-based Floating-point Matrix Processor for Mobile Systems ∗

by Wenqiang Wang, Kaiyuan Guo, Mengyuan Gu, Yuchun Ma, Yu Wang
"... Abstract—FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and p ..."
Abstract - Add to MetaCart
Abstract—FPGA-based acceleration of matrix operations is a promising solution in mobile systems. However, most related work focuses on a certain operation instead of a complete system. In this paper, we explore the possibility of integrating multiple matrix accelerators with a master processor and propose a universal floating-point matrix processor. The processor supports multiple matrix-matrix operations (Level 3 BLAS) and the matrix size is unlimited. The key component of the processor is a shared matrix cache which enables on-chip communication between dif-ferent accelerators. This structure reduces the external memory bandwidth requirement and improves the overall performance. Considering the performance of the whole system, an asyn-chronous instruction execution mechanism is further proposed in the hardware-software interface so as to reduce the workload of the master processor. We demonstrate the system using a DE3 develop board and achieve a computing performance of about 19 GFLOPS. Experiments show the proposed processor achieves higher performance and energy efficiency than some state-of-the-art embedded processors including ARM cortex A9 and NIOS II/f soft-core processor. The performance of the processor is even comparable to some desktop processors. I.
(Show Context)

Citation Context

...ect accelerators as the basic computing units. Although the accelerators for a certain matrix operation have been widely researched, the integration of different accelerators is seldom considered. In =-=[18]-=-, several different linear algebra operations are discussed, but each of them is still designed separately. This motivates us to explore how to build an efficient hardware computing platform which can...

Custom Optimization Algorithms for Efficient Hardware Implementation

by Juan Luis Jerez , 2013
"... The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently avail ..."
Abstract - Add to MetaCart
The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special character-
(Show Context)

Citation Context

...loops with very tight real-time requirements [2]. Beyond streaming applications, FPGAs have also been recently proposed for efficient floating-point implementations of basic linear algebra operations =-=[212,245]-=-. FPGAs are traditionally programmed using hardware description languages such as VHDL or Verilog [210]. Hardware design flows rely on slow error-prone tools that require low-level hardware expertise....

A Low Complexity Scaling Method for the Lanczos Kernel in Fixed-Point Arithmetic

by Juan L. Jerez, Student Member, George A. Constantinides, Senior Member, Eric C. Kerrigan
"... Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to float ..."
Abstract - Add to MetaCart
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to floating-point arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a non-trivial task. For these types of algorithms the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a floating-point implementation, if necessary. The approach is evaluated on field-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floating-point to fixed-point computation.
(Show Context)

Citation Context

...re’s law has continued to promote FPGAs to a level where it has become possible to provide substantial acceleration over microprocessors by directly implementing floating-point linear algebra kernels =-=[14]-=-–[17], floating-point operations remain expensive to implement, mainly because there is no hard support in the FPGA fabric to facilitate the normalisation and denormalisation operations required befor...

Reviewed by

by Charles Patrick Ewing, Philip H. Witt
"... In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me ..."
Abstract - Add to MetaCart
In 1983, as a young psychologist and well before I had testified in court, I listened to a set of audiotapes describing guidelines for being an expert witness. I was greatly impressed by the clarity and concrete, practical focus of those audiotapes, and those tapes served as an early catalyst for me, sparking my interest in forensic psychology. Those tapes were by Charles Patrick Ewing, the author of Trials of a Forensic Psychologist: A Casebook. Ewing, a law school professor at the State University of New York, University at Buffalo, is one of the most renowned forensic psychologists in the United States, the author of a number of books, countless articles, and editor of a respected forensic psychology journal. As is evident from Trials, his expertise is widely sought around the country in high-profile criminal trials. And the man can write! The accounts of the 10 trials in this book are compelling. I can think of few if any forensic psychology books that I have read for pleasure or that I cannot put down, but Trials is indeed one. Instead of reading it out of a sense of duty or responsibility to keep up with developments in the field, I found myself reading it for sheer enjoyment. The accounts of the 10 trials that Ewing presents are gripping narratives that easily kept my interest. Perhaps the book's closest analogue is not another forensic
(Show Context)

Citation Context

...that we propose (Figure 6) is based uniquely on a small number of linear computations, which are particularly suited for ultrafast (millisecond scale) implementation on reconfigurable hardware chips (=-=Zhuo and Prasanna, 2008-=-; Sadrozinski and Wu, 2011) or on GPU architectures (Owens et al., 2008; Volkov and Demmel, 2008) on which FFT algorithms can be efficiently implemented (Bhattacharyya et al., 2010). As a matter of fa...

1A POWER EFFICIENT LINEAR EQUATION SOLVER ON A MULTI-FPGA ACCELERATOR

by Arvind Sudarsanam, Thomas Hauser, Aravind Dasu, Seth Young
"... This paper presents an approach to explore a commercial multi-FPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Block-based Right-Hand-Side solver algorithm is described and a novel data ..."
Abstract - Add to MetaCart
This paper presents an approach to explore a commercial multi-FPGA system as high performance accelerator and the problem of solving a LU decomposed linear system of equations using forward and back substitution is addressed. Block-based Right-Hand-Side solver algorithm is described and a novel data flow and memory architectures that can support arbitrary data types, block sizes, and matrix sizes is proposed. These architectures have been implemented on a multi-FPGA system. Capabilities of the accelerator system are pushed to its limits by implementing the problem for double precision complex floating-point data. Detailed timing data is presented and augmented with data from a performance model proposed in this paper. Performance of the accelerator system is evaluated against that of a state of the art low power Beowulf cluster node running an optimized LAPACK implementation. Both systems are compared using the power efficiency (Performance/Watt) metric. FPGA system is about eleven times more power efficient then the compute node of a cluster. In recent years Field Programmable Gate Array (FPGA) based high performance computing systems [1][2] have gained attention due to their unique ability to support customized circuits for accelerating compute intensive applications. Research groups worldwide have explored the use of single and multi-
(Show Context)

Citation Context

... to a very specific sub-set of matrices and not applicable for a generic matrix and hence has limited practical utility. Edman and Owall [11] also targeted only triangular matrices. Zhuo and Prasanna =-=[12]-=- propose a methodology to implement various linear algebra algorithms on a particular FPGA device. This paper also proposes a performance prediction model that incorporates (a) implementation paramete...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University