Results 1 -
8 of
8
TITAN: ENABLING LARGE AND COMPLEX BENCHMARKS IN ACADEMIC CAD
"... Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmark ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmarks are both small and simple. In this paper we present Titan, a hybrid CAD flow that addresses these issues. The flow uses Altera’s Quartus II FPGA CAD software to perform HDL synthesis and a conversion tool to translate the result into the academic BLIF format. Using this flow we created the Titan23 benchmark set, which consists of 23 large (90K-1.8M block) benchmark circuits covering a wide range of application domains. Using the Titan23 benchmarks and a detailed model of Altera’s Stratix IV architecture we compared the performance and quality of VPR and Quartus II targeting the same architecture. We found that VPR is at least 2.7 × slower, uses 5.1 × more memory and 2.6 × more wire compared to Quartus II. Finally, we identified that VPR’s focus on achieving a dense packing is responsible for a large portion of the wire length gap. 1.
Maximizing Speed and Density of Tiled FPGA Overlays via Partitioning
"... Abstract—Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global opti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve per-formance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating fre-quency from 284 up to 437MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%. I.
Optimising Memory Bandwidth Use and Performance for Matrix-Vector Multiplication in Iterative Methods
"... Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative metho ..."
Abstract
- Add to MetaCart
(Show Context)
Computing the solution to a system of linear equations is a fundamental problem in scientific computing, and its acceleration has drawn wide interest in the FPGA community [Morris et al. 2006; Zhang et al. 2008; Zhuo and Prasanna 2006]. One class of algorithms to solve these systems, iterative methods, has drawn particular interest, with recent literature showing large performance improvements over general purpose processors (GPPs) [Lopes and Constantinides 2008]. In several iterative methods, this performance gain is largely a result of parallelisation of the matrix-vector multiplication, an operation that occurs in many applications and hence has also been widely studied on FPGAs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006]. However, whilst the performance of matrix-vector multiplication on FPGAs is generally I/O bound [Zhuo and Prasanna 2005], the nature of iterative methods allows the use of on-chip memory buffers to increase the bandwidth, providing the potential for significantly more parallelism [deLorimier and DeHon 2005]. Unfortunately, existing approaches have generally only either been capable of solving large matrices with limited improvement over GPPs [Zhuo and Prasanna 2005; El-Kurdi et al. 2006; deLorimier and DeHon 2005], or achieve high performance for relatively small matrices [Lopes and Constantinides 2008; Boland and Constantinides 2008]. This paper proposes hardware designs to take advantage of symmetrical and banded matrix structure, as well as methods to optimise the RAM use, in order to both increase the performance and retain this performance for larger order matrices.
2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines Fixed Point Lanczos: Sustaining TFLOP-equivalent Performance in FPGAs for Scientific Computing
"... Abstract—We consider the problem of enabling fixedpoint implementations of linear algebra kernels to match the strengths of the field-programmable gate array (FPGA). Algorithms for solving linear equations, finding eigenvalues or finding singular values are typically nonlinear and recursive making t ..."
Abstract
- Add to MetaCart
Abstract—We consider the problem of enabling fixedpoint implementations of linear algebra kernels to match the strengths of the field-programmable gate array (FPGA). Algorithms for solving linear equations, finding eigenvalues or finding singular values are typically nonlinear and recursive making the problem of establishing analytical bounds on variable dynamic range non-trivial. Current approaches fail to provide tight bounds for this type of algorithms. We use as a case study one of the most important kernels in scientific computing, the Lanczos iteration, which lies at the heart of well known methods such as conjugate gradient and minimum residual, and we show how we can modify the algorithm to allow us to apply standard linear algebra analysis to prove tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a double precision floating point implementation. Using this approach it is possible to get sustained FPGA performance very close to the peak general-purpose graphics processing unit (GPGPU) performance in FPGAs of comparable size when solving a single problem. If there are several independent problems to solve simultaneously it is possible to exceed the peak floating-point performance of a GPGPU, obtaining approximately 1, 2 or 4 TFLOPs for error tolerances of 10 −7, 10 −5 and 10 −3, respectively, in a large Virtex 7 FPGA. Keywords-Field programmable gate arrays; Scientific computing; High performance computing; Iterative algorithms; Fixed-point arithmetic; I.
Fine-Grained Interconnect Synthesis
"... One of the key challenges for the FPGA industry going for-ward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and fo-cuses ..."
Abstract
- Add to MetaCart
(Show Context)
One of the key challenges for the FPGA industry going for-ward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and fo-cuses on the interconnect needs in the fine-grained (sub-IP-block) design space. Here there are several issues that prior research and tools do not address well: the need to have fixed, deterministic latency between communicating units (to enable high-performance local communication without the area overheads of latency-insensitivity), and the ability to avoid generating un-necessary arbitration hardware when the application design can avoid it. Using a design example, our tool generates interconnect that requires 72 % fewer lines of specification code than a hand-written Verilog implemen-tation, which is a 33 % overall reduction for the entire appli-cation. The resulting system, while requiring 4 % more total functional and interconnect area, achieves the same perfor-mance. We also show a quantitative and qualitative advan-tages against an existing commercial interconnect synthesis tool, over which we achieve a 25 % performance advantage and 17%/57 % logic/memory area savings.
Custom Optimization Algorithms for Efficient Hardware Implementation
, 2013
"... The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently avail ..."
Abstract
- Add to MetaCart
(Show Context)
The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special character-
A Low Complexity Scaling Method for the Lanczos Kernel in Fixed-Point Arithmetic
"... Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to float ..."
Abstract
- Add to MetaCart
Abstract—We consider the problem of enabling fixed-point implementation of linear algebra kernels on low cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to floating-point arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a non-trivial task. For these types of algorithms the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a floating-point implementation, if necessary. The approach is evaluated on field-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floating-point to fixed-point computation.
Study of Scatter/Gather & Inter-Neighbour Traffics Under FPGA Constraints
, 2012
"... We evaluate the scatter-gather and inter-neighbour com-munication traffic for torus and tree topologies under FPGA constraints. To do this, we created a cycle ac-curate Simulator, Traffic Generator and implemented the Blocked LU Decomposition and Sudoku Constraint Prop-agation traffic algorithms. We ..."
Abstract
- Add to MetaCart
(Show Context)
We evaluate the scatter-gather and inter-neighbour com-munication traffic for torus and tree topologies under FPGA constraints. To do this, we created a cycle ac-curate Simulator, Traffic Generator and implemented the Blocked LU Decomposition and Sudoku Constraint Prop-agation traffic algorithms. We have found the torus to be a better candidate for inter-neighbour traffic. Provided we connect the memCntrl intelligently to minimize the dis-tance to it, scatter-gather traffic between compute nodes and memCntrl does not favour either topologies. 1