Results 1 - 10
of
19
High-Level Synthesis of Nonprogrammable Hardware Accelerators
- JOURNAL OF VLSI SIGNAL PROCESSING
, 2000
"... The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of cu ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors, their controller, local memory, and interfaces. The system also modifies the user's application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICO-N designs are slightly more costly than hand-designed accelerators with the same performance.
EPIC: An architecture for instruction-level parallel processors
, 2000
"... VLIW architecture, instruction-level parallelism, MultiOp, nonunit assumed latencies, NUAL, rotating register files, unbundled branches, control speculation, speculative opcodes, exception tag, predicated execution, fully-resolved predicates, wired-OR and wired-AND compare opcodes, prioritized loads ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
VLIW architecture, instruction-level parallelism, MultiOp, nonunit assumed latencies, NUAL, rotating register files, unbundled branches, control speculation, speculative opcodes, exception tag, predicated execution, fully-resolved predicates, wired-OR and wired-AND compare opcodes, prioritized loads and stores, data speculation, cache specifiers, precise interrupts, NUAL-freeze and NUALdrain semantics, delay buffers, replay buffers, EQ and LEQ semantics, latency stalling, MultiOp-P and MultiOp-S semantics, dynamic translation,
Conservation Cores: Reducing the Energy of Mature Computations
"... Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by reducing the per-computation power requirements and allowing more computations to execute under the same power budget. To pursue this goal, this paper introduces conservation cores. Conservation cores, or c-cores, are specialized processors that focus on reducing energy and energy-delay instead of increasing performance. This focus on energy makes c-cores an excellent match for many applications that would be poor candidates for hardware acceleration (e.g., irregular integer codes). We present a toolchain for automatically synthesizing c-cores from application source code and demonstrate that they can significantly reduce energy and energy-delay for a wide range of applications. The c-cores support patching, a form of targeted reconfigurability, that allows them to adapt to new versions of the software they target. Our results show that conservation cores can reduce energy consumption by up to 16.0 × for functions and by up to 2.1 × for whole applications, while patching can extend the useful lifetime of individual c-cores to match that of conventional processors.
Synthesis of Custom Processors based on Extensible Platforms
- In ICCAD
, 2002
"... E#ciency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeo# between e#- ciency and flexibility, while keeping design turnaround times short. Current extensible processor design flows ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
E#ciency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeo# between e#- ciency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions. In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its e#ciency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Further, we employ a two-stage process, wherein a limited number of promising instruction candidates are first selected, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint.
Data remapping for design space optimization of embedded memory systems
- ACM Transactions in Embedded Computing Systems
, 2003
"... In this article, we present a novel linear time algorithm for data remapping, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
In this article, we present a novel linear time algorithm for data remapping, that is, (i) lightweight; (ii) fully automated; and (iii) applicable in the context of pointer-centric programming languages with dynamic memory allocation support. All previous work in this area lacks one or more of these features. We proceed to demonstrate a novel application of this algorithm as a key step in optimizing the design of an embedded memory system. Specifically, we show that by virtue of locality enhancements via data remapping, we may reduce the memory subsystem needs of an application by 50%, and hence concomitantly reduce the associated costs in terms of size, power, and dollar-investment (61%). Such a reduction overcomes key hurdles in designing highperformance embedded computing solutions. Namely, memory subsystems are very desirable from a performance standpoint, but their costs have often limited their use in embedded systems. Thus, our innovative approach offers the intriguing possibility of compilers playing a significant role in exploring and optimizing the design space of a memory subsystem for an embedded design. To this end and in order to properly leverage the improvements afforded by a compiler optimization, we identify a range of measures for quantifying the cost-impact of popular notions of locality, prefetching, regularity of memory access and others. The proposed methodology will
Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems
, 1999
"... Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, we focus on an embedded system consisting of the following components: a VLIW processor, instruction cache, data cache, and second-level unified cache. A hierarchical approach of partitioning the system into its constituent components and evaluating each component individually is utilized. The performance of each processor is evaluated independent of its memory hierarchy, and each of the caches is simulated using the traces from a single reference processor. Since the changes in the processor architecture do indeed affect the address traces and thus the performance of the memory hierarchy, the overall performance is inaccurate. To overcome this error, the changes in the processor architecture are modeled as a dilation of the reference processor's address trace, where each instruction block in the trace is conceptually stretched out by the dilation coefficient. This approach provides a projected cache performance that more accurately accounts for changes in the processor architecture. In order to understand the accuracy of the dilation model, we separate the possible errors that the model introduces and quantify these errors on a set of benchmarks. The results show the dilation model is effective for most of the design space and facilitates efficient automatic design.
Bitwidth Cognizant Architecture Synthesis of Custom Hardware Accelerators
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2001
"... applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key iss ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit and at what time each operation executes. If operations are assigned to function units with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient widthconscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.
Custom-Instruction Synthesis for Extensible-Processor Platforms
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2004
"... Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flo ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions. In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Furthermore, we employ a two-stage process, wherein a limited number of promising instruction candidates are first short-listed using efficient selection criteria, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint. We have evaluated the proposed techniques using a state-of-the-art extensible processor platform, in the context of a commercial design flow. Experiments with several benchmark programs indicate that custom processors synthesized using automa...
Embedded Computing: New Directions in Architecture and Automation
- In 7th International Conference on High-Performance Computing (HiPC2000
, 2000
"... this report, we elaborate on these claims and provide, as an example, an overview of PICO, the architecture synthesis system that the authors and their colleagues have been developing over the past five years ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
this report, we elaborate on these claims and provide, as an example, an overview of PICO, the architecture synthesis system that the authors and their colleagues have been developing over the past five years
ShiftQ: A bufferred interconnect for custom loop accelerators
, 2001
"... ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing of register and interconnect resources. This paper describes the ShiftQ schema and a method to automatically synthesize them from modulo-scheduled loops. Wealsoevaluate the cost savings by comparing them against traditional storage and interconnect mechanisms.

