Results 1 
8 of
8
TITAN: ENABLING LARGE AND COMPLEX BENCHMARKS IN ACADEMIC CAD
"... Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmark ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmarks are both small and simple. In this paper we present Titan, a hybrid CAD flow that addresses these issues. The flow uses Altera’s Quartus II FPGA CAD software to perform HDL synthesis and a conversion tool to translate the result into the academic BLIF format. Using this flow we created the Titan23 benchmark set, which consists of 23 large (90K1.8M block) benchmark circuits covering a wide range of application domains. Using the Titan23 benchmarks and a detailed model of Altera’s Stratix IV architecture we compared the performance and quality of VPR and Quartus II targeting the same architecture. We found that VPR is at least 2.7 × slower, uses 5.1 × more memory and 2.6 × more wire compared to Quartus II. Finally, we identified that VPR’s focus on achieving a dense packing is responsible for a large portion of the wire length gap. 1.
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning
"... Abstract—Common practice for large FPGA design projects is to divide subprojects into separate synthesis partitions to allow incremental recompilation as each subproject evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global opti ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Common practice for large FPGA design projects is to divide subprojects into separate synthesis partitions to allow incremental recompilation as each subproject evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for highspeed tiled designs composed of duplicated components and hence having multilocalities (multiple instances of equivalent logic), a designer can use partitioning to preserve multilocality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide highlevel information to the CAD tool about preserving multilocalities in a design, without lowlevel micromanaging of the design description or CAD tool settings; (ii) by preserving multilocalities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%. I.
Optimising Memory Bandwidth Use and Performance for MatrixVector Multiplication in Iterative Methods
"... Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative metho ..."
Abstract
 Add to MetaCart
(Show Context)
Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over general purpose processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelisation of the matrixvector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; ElKurdi et al. 2006]. However, whilst the performance of matrixvector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of onchip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; ElKurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This paper proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimise the RAM use, in order to both increase the performance and retain this performance for larger order matrices.
2012 IEEE 20th International Symposium on FieldProgrammable Custom Computing Machines Fixed Point Lanczos: Sustaining TFLOPequivalent Performance in FPGAs for Scientific Computing
"... Abstract—We consider the problem of enabling fixedpoint implementations of linear algebra kernels to match the strengths of the fieldprogrammable gate array (FPGA). Algorithms for solving linear equations, finding eigenvalues or finding singular values are typically nonlinear and recursive making t ..."
Abstract
 Add to MetaCart
Abstract—We consider the problem of enabling fixedpoint implementations of linear algebra kernels to match the strengths of the fieldprogrammable gate array (FPGA). Algorithms for solving linear equations, finding eigenvalues or finding singular values are typically nonlinear and recursive making the problem of establishing analytical bounds on variable dynamic range nontrivial. Current approaches fail to provide tight bounds for this type of algorithms. We use as a case study one of the most important kernels in scientific computing, the Lanczos iteration, which lies at the heart of well known methods such as conjugate gradient and minimum residual, and we show how we can modify the algorithm to allow us to apply standard linear algebra analysis to prove tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixedpoint implementations of the modified problem can be chosen to be at least as good as a double precision floating point implementation. Using this approach it is possible to get sustained FPGA performance very close to the peak generalpurpose graphics processing unit (GPGPU) performance in FPGAs of comparable size when solving a single problem. If there are several independent problems to solve simultaneously it is possible to exceed the peak floatingpoint performance of a GPGPU, obtaining approximately 1, 2 or 4 TFLOPs for error tolerances of 10 −7, 10 −5 and 10 −3, respectively, in a large Virtex 7 FPGA. KeywordsField programmable gate arrays; Scientific computing; High performance computing; Iterative algorithms; Fixedpoint arithmetic; I.
FineGrained Interconnect Synthesis
"... One of the key challenges for the FPGA industry going forward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and focuses ..."
Abstract
 Add to MetaCart
(Show Context)
One of the key challenges for the FPGA industry going forward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and focuses on the interconnect needs in the finegrained (subIPblock) design space. Here there are several issues that prior research and tools do not address well: the need to have fixed, deterministic latency between communicating units (to enable highperformance local communication without the area overheads of latencyinsensitivity), and the ability to avoid generating unnecessary arbitration hardware when the application design can avoid it. Using a design example, our tool generates interconnect that requires 72 % fewer lines of specification code than a handwritten Verilog implementation, which is a 33 % overall reduction for the entire application. The resulting system, while requiring 4 % more total functional and interconnect area, achieves the same performance. We also show a quantitative and qualitative advantages against an existing commercial interconnect synthesis tool, over which we achieve a 25 % performance advantage and 17%/57 % logic/memory area savings.
Custom Optimization Algorithms for Efficient Hardware Implementation
, 2013
"... The focus is on realtime optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently avail ..."
Abstract
 Add to MetaCart
(Show Context)
The focus is on realtime optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resourceconstrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interiorpoint methods and a range of firstorder constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interiorpoint methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special character
A Low Complexity Scaling Method for the Lanczos Kernel in FixedPoint Arithmetic
"... Abstract—We consider the problem of enabling fixedpoint implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixedpoint arithmetic presents additional design challenges compared to float ..."
Abstract
 Add to MetaCart
Abstract—We consider the problem of enabling fixedpoint implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixedpoint arithmetic presents additional design challenges compared to floatingpoint arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a nontrivial task. For these types of algorithms the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of wellknown methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixedpoint implementations of the modified problem can be chosen to be at least as good as a floatingpoint implementation, if necessary. The approach is evaluated on fieldprogrammable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floatingpoint to fixedpoint computation.
Study of Scatter/Gather & InterNeighbour Traffics Under FPGA Constraints
, 2012
"... We evaluate the scattergather and interneighbour communication traffic for torus and tree topologies under FPGA constraints. To do this, we created a cycle accurate Simulator, Traffic Generator and implemented the Blocked LU Decomposition and Sudoku Constraint Propagation traffic algorithms. We ..."
Abstract
 Add to MetaCart
(Show Context)
We evaluate the scattergather and interneighbour communication traffic for torus and tree topologies under FPGA constraints. To do this, we created a cycle accurate Simulator, Traffic Generator and implemented the Blocked LU Decomposition and Sudoku Constraint Propagation traffic algorithms. We have found the torus to be a better candidate for interneighbour traffic. Provided we connect the memCntrl intelligently to minimize the distance to it, scattergather traffic between compute nodes and memCntrl does not favour either topologies. 1