Results 11  20
of
60
An FPGABased FloatingPoint Jacobi Iterative Solver
"... Within the parallel computing domain, field programmable gate arrays (FPGA) are no longer restricted to their traditional role as substitutes for applicationspecific integrated circuits–as hardware “hidden ” from the end user. Several high performance computing vendors offer parallel reconfigurable ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Within the parallel computing domain, field programmable gate arrays (FPGA) are no longer restricted to their traditional role as substitutes for applicationspecific integrated circuits–as hardware “hidden ” from the end user. Several high performance computing vendors offer parallel reconfigurable computers employing userprogrammable FPGAs. These exciting new architectures allow endusers to, in effect, create reconfigurable coprocessors targeting the computationally intensive parts of each problem. The increased capability of contemporary FPGAs coupled with the embarrassingly parallel nature of the Jacobi iterative method make the Jacobi method an ideal candidate for hardware acceleration. This paper introduces a parameterized design for a deeply pipelined, highly parallelized IEEE 64bit floatingpoint version of the Jacobi method. A Jacobi circuit is implemented using a Xilinx VirtexII Pro as the target FPGA device. Implementation statistics and performance estimates are presented.
Sparse matrixvector multiplication for finite element method matrices on FPGAs
 FieldProgrammable Custom Computing Machines, Annual IEEE Symposium on
, 2006
"... We present an architecture and an implementation of an FPGAbased sparse matrixvector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from Finite Element Method (FEM) applications. The architecture is based on a pipelined linear array of processi ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
We present an architecture and an implementation of an FPGAbased sparse matrixvector multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from Finite Element Method (FEM) applications. The architecture is based on a pipelined linear array of processing elements (PEs). A hardwareoriented matrix “striping” scheme is developed which reduces the number of required processing elements. Our current 8 PE prototype achieves a peak performance of 1.76 GFLOPS and a sustained performance of 1.5 GFLOPS with 8 GB/s of memory bandwidth. The SMVMpipeline uses 30 % of the logic resources and 40 % of the memory resources of a Stratix S80 FPGA. By virtue of the local interconnect between the PEs, the SMVMpipeline obtain scalability features that is only limited by FPGA resources instead of the communication overhead. 1.
Architectural modifications to improve floatingpoint unit efficiency in FPGAs
 in FPGAs,” in IEEE Symposium on FieldProgrammable Custom Computing Machines
, 2006
"... FPGAs have reached densities that can implement floatingpoint applications, but floatingpoint operations still require a large amount of FPGA resources. One major component of IEEE compliant floatingpoint computations is variable length shifters. They account for over 30 % of a doubleprecision flo ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
FPGAs have reached densities that can implement floatingpoint applications, but floatingpoint operations still require a large amount of FPGA resources. One major component of IEEE compliant floatingpoint computations is variable length shifters. They account for over 30 % of a doubleprecision floatingpoint adder and 25 % of a doubleprecision multiplier. This paper introduces two alternatives for implementing these shifters. One alternative is a coarsegrained approach: embedding variable length shifters in the FPGA fabric. These units provide significant area savings with a modest clock rate improvement over existing architectures. Another alternative is a finegrained approach: adding a 4:1 multiplexer inside the slices, in parallel to the LUTs. While providing a more modest area savings, these multiplexers provide a significant boost in clock rate with a small impact on the FPGA fabric. 1.
FPGA vs. GPU for Sparse Matrix Vector Multiply
"... Abstract—Sparse matrixvector multiplication (SpMV) is a common operation in numerical linear algebra and is the computational kernel of many scientific applications. It is one of the original and perhaps most studied targets for FPGA acceleration. Despite this, GPUs, which have only recently gained ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Sparse matrixvector multiplication (SpMV) is a common operation in numerical linear algebra and is the computational kernel of many scientific applications. It is one of the original and perhaps most studied targets for FPGA acceleration. Despite this, GPUs, which have only recently gained both generalpurpose programmability and native support for double precision floatingpoint arithmetic, are viewed by some as a more effective platform for SpMV and similar linear algebra computations. In this paper, we present an analysis comparing an existing GPU SpMV implementation to our own, novel FPGA implementation. In this analysis, we describe the challenges faced by any SpMV implementation, the unique approaches to these challenges taken by both FPGA and GPU implementations, and their relative performance for SpMV. I.
Reconfigurable Fixed Point Dense and Sparse MatrixVector Multiply/Add Unit
 In Proceedings of the IEEE International Conference on ApplicationSpecific Systems, Architectures, and Processors (ASAP’06
, 2006
"... In this paper, we propose a reconfigurable hardware accelerator for fixedpointmatrixvectormultiply/add operations, capable to work on dense and sparse matrices formats. The prototyped hardware unit accommodates 4 dense or sparse matrix inputs and performs computations in a space parallel design ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a reconfigurable hardware accelerator for fixedpointmatrixvectormultiply/add operations, capable to work on dense and sparse matrices formats. The prototyped hardware unit accommodates 4 dense or sparse matrix inputs and performs computations in a space parallel design achieving 4 multiplications and up to 12 additions at 120 MHz over an xc2vp1006 FPGA device, reaching a throughput of 1.9 GOPS. A total of 11 units can be integrated in the same FPGA chip, achieving a performance of 21 GOPS. 1.
Evaluation of a highlevellanguage methodology for highperformance reconfigurable computers,” Application specific Systems
 Architectures and Processors, 2007. ASAP. IEEE International Conf. on
, 2007
"... Abstract ..."
(Show Context)
From silicon to science. The long road to production reconfigurable supercomputing, in: ACM transactions on reconfigurable technology and systems
 Article
"... The field of high performance computing (HPC) currently abounds with excitement about the potential of a broad class of things called accelerators. And, yet, few accelerator based systems are being deployed in general purpose HPC environments. Why is that? This article explores the challenges that ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The field of high performance computing (HPC) currently abounds with excitement about the potential of a broad class of things called accelerators. And, yet, few accelerator based systems are being deployed in general purpose HPC environments. Why is that? This article explores the challenges that accelerators face in the HPC world, with a specific focus on FPGA based systems. We begin with an overview of the characteristics and challenges of typical HPC systems and applications and discuss why FPGAs have the potential to have a significant impact. The bulk of the article is focused on twelve specific areas where FPGA researchers can make contributions to hasten the adoption of FPGAs in HPC environments.
Linear Extractors for Extracting Randomness from Noisy
 Sources”, IEEE International Symposium on Information Theory
, 2011
"... Abstract—Linear transformations have many applications in information theory, like data compression and errorcorrecting codes design. In this paper, we study the power of linear transformations in randomness extraction, namely linear extractors, as another important application. Comparing to most e ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Linear transformations have many applications in information theory, like data compression and errorcorrecting codes design. In this paper, we study the power of linear transformations in randomness extraction, namely linear extractors, as another important application. Comparing to most existing methods for randomness extraction, linear extractors (especially those constructed with sparse matrices) are computationally fast and can be simply implemented with hardware like FPGAs, which makes them very attractive in practical use. We mainly focus on simple, efficient and sparse constructions of linear extractors. Specifically, we demonstrate that random matrices can generate random bits very efficiently from a variety of noisy sources, including noisy coin sources, bitfixing sources, noisy (hidden) Markov sources, as well as their mixtures. It shows that lowdensity random matrices have almost the same efficiency as highdensity random matrices when the input sequence is long, which provides a way to simplify hardware/software implementation. Note that although we constructed matrices with randomness, they are deterministic (seedless) extractors once we constructed them, the same construction can be used for any number of times without using any seeds. Another way to construct linear extractors is based on generator matrices of primitive BCH codes. This method is more explicit, but less practical due to its computational complexity and dimensional constraints. I.
A HighPerformance Double Precision Accumulator
"... Abstract—The accumulation operation A new = A old + X is required for many numerical methods. However, when using a floatingpoint adder with pipeline latency �, the data hazard that exists between Anew and Aold creates design challenges for situations where inputs must be delivered to the accumulat ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The accumulation operation A new = A old + X is required for many numerical methods. However, when using a floatingpoint adder with pipeline latency �, the data hazard that exists between Anew and Aold creates design challenges for situations where inputs must be delivered to the accumulator at a rate exceeding 1/�. Each of the techniques proposed to address this problem requires either static data scheduling or overly complex microarchitectures having multiple adders, a large amount of memory, or control overheads that force the accumulator to operate at a diminished speed relative to the adder on which it is based. In this paper we present a design for a double precision accumulator that achieves high performance without the need for data scheduling or an overly complex implementation. We achieve this by integrating a coalescing reduction circuit within the lowlevel design of a baseconverting floatingpoint adder. When implemented on our Virtex2 Pro 100 FPGA, our design achieves a speed of 170 MHz. I.
Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads
"... Abstract—Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC2ex machine. We describe how this compilation approach differs from other C to HDL compilers. We use the compiler to generate a multithreaded sparse matrix vector multiplication kernel and compare its performance to existing FPGA, and highly optimized software implementations. I.