Results 1 - 10
of
21
Instruction Generation for Hybrid Reconfigurable Systems
- ACM Transactions on Design Automation of Electronic Systems
, 2001
"... Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
Building Blocks (ABBs), or instructions available from a given hardware library. The customized data path generated from many ABBs was referred to as an application specific unit (ASU). Cathedral's synthesis targeted ASUs, which could be executed in very few clock cycles. This goal was achieved via manual clustering of necessary operations into more compact operations, essentially a form of template construction. Whereas our template generation and matching algorithms are automated, the definition of clusters in Cathedral was a manual operation, mainly clustering loop and function bodies. Their results demonstrated an expected reduction of critical path length as well as interconnect as a result of clustering.
Instruction Generation and Regularity Extraction for Reconfigurable Processors
- In CASES
, 2002
"... The increasing demand for complex and specialized embedded hardware must be met by processors which are optimized for performance, yet are also extremely flexible. In our work, we explore the tradeoff between flexibility and performance in the domain of reconfigurable processor design. Specifically, ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
The increasing demand for complex and specialized embedded hardware must be met by processors which are optimized for performance, yet are also extremely flexible. In our work, we explore the tradeoff between flexibility and performance in the domain of reconfigurable processor design. Specifically, we seek to identify regularly occurring, computation-heavy patterns in an application or set of applications. These patterns become candidates for hard-logic implementation, potentially embedded in the flexible reconflgurable fabric as special optimized instructions. In this work we present an extension to previous work in instruction generation: an algorithm that identifies parallel templates. We discuss the advantages of parallel templates, and prove the correctness of our algorithm. We introduce an All-Pairs Common Slack Graph (APCSG) as an effective tool for parallel template generation. Finally, we demonstrate the effectiveness of our algorithm on several applications' dataflow graphs, reducing latency on average by 51.98%, without unreasonably increasing chip area.
Bitwidth Cognizant Architecture Synthesis of Custom Hardware Accelerators
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2001
"... applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key iss ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit and at what time each operation executes. If operations are assigned to function units with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient widthconscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System
- In Proc. of the 38th Annual International Symposium on Microarchitecture
, 2005
"... Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or producing compact code. In the context of a hardware synthesis system where the schedule is used to determine various components of the hardware, including datapath, storag ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Scheduling algorithms used in compilers traditionally focus on goals such as reducing schedule length and register pressure or producing compact code. In the context of a hardware synthesis system where the schedule is used to determine various components of the hardware, including datapath, storage, and interconnect, the goals of a scheduler change drastically. In addition to achieving the traditional goals, the scheduler must proactively make decisions to ensure efficient hardware is produced. This paper proposes two exact solutions for cost sensitive modulo scheduling, one based on an integer linear programming formulation and another based on branch-and-bound search. To achieve reasonable compilation times, decomposition techniques to break down the complex scheduling problem into phase ordered sub-problems are proposed. The decomposition techniques work either by partitioning the dataflow graph into smaller subgraphs and optimally scheduling the subgraphs, or by splitting the scheduling problem into two phases, time slot and resource assignment. The effectiveness of cost sensitive modulo scheduling in minimizing the costs of function units, register structures, and interconnection wires are evaluated within a fully automatic synthesis system for loop accelerators. The cost sensitive modulo scheduler increases the efficiency of the resulting hardware significantly compared to both traditional cost unaware and greedy cost aware modulo schedulers.
ShiftQ: A bufferred interconnect for custom loop accelerators
, 2001
"... ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing of register and interconnect resources. This paper describes the ShiftQ schema and a method to automatically synthesize them from modulo-scheduled loops. Wealsoevaluate the cost savings by comparing them against traditional storage and interconnect mechanisms.
Local watermarks: methodology and application to behavioral synthesis
- IEEE International Conference on Computer-Aided Design
, 1999
"... Recently, a number of techniques for IP protection have been introduced that rely on a selection of a global solution to an optimization problem according to a unique user-specific digital signature. Although such techniques may provide convincing proof of authorship with low hardware overhead, they ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Recently, a number of techniques for IP protection have been introduced that rely on a selection of a global solution to an optimization problem according to a unique user-specific digital signature. Although such techniques may provide convincing proof of authorship with low hardware overhead, they fail to protect parts of design, do not provide an easy procedure for watermark detection, and are not capable of detecting the watermark when the design or its part is aug-mented in another larger design. Since these demands are of the highest interest for the IP business, we introduce lo-calized watermarking as an IP protection technique that enables these features while satisfying the demand for low-cost and transparency. We propose a set of protocols that implement the new watermarking methodology at the oper-ation scheduling design level. We have demonstrated that the difficulty of erasing or finding another signature in the synthesized design can be made arbitrarily computationally difficult. The watermarking method has been tested on a set of real-life benchmarks where high likelihood of author-ship has been achieved with negligible overhead in solution quality. 1
Elimination of Redundant Memory Traffic in High-Level Synthesis
- IEEE Trans. on Comp-aided Design
, 1996
"... This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This transformation is suited to memory-intensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper presents a new transformation for the scheduling of memory-access operations in High-Level Synthesis. This transformation is suited to memory-intensive applications with synthesized designs containing a secondary store accessed by explicit instructions. Such memory-intensive behaviors are commonly observed in video compression, image convolution, hydrodynamics and mechatronics. Our transformation removes load and store instructions which become redundant or unnecessary during the transformation of loops. The advantage of this reduction is the decrease of secondary memory bandwidth demands. This technique is implemented in our Percolation-Based Scheduler which we used to conduct experiments on a suite of memory-intensive benchmarks. Our results demonstrate a significant reduction in the number of memory operations and an increase in performance on these benchmarks. 1 Introduction Traditionally, one of the goals in High-Level Synthesis (HLS) is the optimization of storage req...
A Floorplan Based Methodology for Data-Path Synthesis of Sub-micron ASICs
- IEICE Trans. Inf. & Syst
, 1996
"... This paper proposes a novel methodology for automated data-path synthesis of such circuits and outlines algorithms to support it. In contrast to other approaches, we formulate interconnect area/delay optimizations as high-level synthesis transformations and use them during the synthesis to minimize ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper proposes a novel methodology for automated data-path synthesis of such circuits and outlines algorithms to support it. In contrast to other approaches, we formulate interconnect area/delay optimizations as high-level synthesis transformations and use them during the synthesis to minimize the impact of wiring on circuit characteristics. Experiments with FIR filter implementations show that such formulation jointly with "on the fly" module generation and performance-driven floorplanning provides more than a 30% reduction in wiring delay for deep sub-micron designs.
Behavioral optimization using the manipulation of timing constraints
, 1995
"... Abstract — We introduce a transformation, named rephasing, that manipulates the timing parameters in control-data-flow graphs (CDFG’s) during the high-level synthesis of data-pathintensive applications. Timing parameters in such CDFG’s include the sample period, the latencies between input–output pa ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract — We introduce a transformation, named rephasing, that manipulates the timing parameters in control-data-flow graphs (CDFG’s) during the high-level synthesis of data-pathintensive applications. Timing parameters in such CDFG’s include the sample period, the latencies between input–output pairs, the relative times at which corresponding samples become available on different inputs, and the relative times at which the corresponding samples become available at the delay nodes. While some of the timing parameters may be constrained by performance requirements, or by the interface to the external world, others remain free to be chosen during the process of high-level synthesis. Traditionally high-level synthesis systems for data-pathintensive applications either have assumed that all the relative times, called phases, when corresponding samples are available at input and delay nodes are zero (i.e., all input and delay node samples enter at the initial cycle of the schedule) or have automatically assigned values to these phases as part of the data-path allocation/scheduling step in the case of newer schedulers that use techniques like overlapped scheduling to generate complex time shapes. Rephasing, however, manipulates the values of these phases as an algorithm transformation before the scheduling/allocation stage. The advantage of this approach is that phase values can be chosen to transform and optimize the algorithm for explicit metrics such as area, throughput, latency, and power. Moreover, the rephasing transformation can be combined with other transformations such as algebraic transformations. We have developed techniques for using rephasing to optimize a variety of design metrics, and our results show significant improvements in several design metrics. We have also investigated the relationship and interaction of rephasing with other high-level synthesis tasks. Index Terms—Behavioral synthesis, transformations. I.
Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines
, 2006
"... In this paper, we present a methodology for designing a pipeline of accelerators for an application. The application is modeled using sequential C language with simple stylizations. The synthesis of the accelerator pipeline involves designing loop accelerators for individual kernels, instantiating b ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper, we present a methodology for designing a pipeline of accelerators for an application. The application is modeled using sequential C language with simple stylizations. The synthesis of the accelerator pipeline involves designing loop accelerators for individual kernels, instantiating buffers for arrays used in the application, and hooking up these building blocks to form a pipeline. A compiler-based system automatically synthesizes loop accelerators for individual kernels at varying performance levels. An integer linear program formulation which simultaneously optimizes the cost of loop accelerators and the cost of memory buffers is proposed to compose the loop accelerators to form an accelerator pipeline for the whole application. Cases studies for some applications, including FMRadio and Beamformer, are presented to illustrate our design methodology. Experiments show significant cost savings are achieved through hardware sharing, while achieving the prescribed throughput requirements.

