Results 1 - 10
of
29
A High-Performance Microarchitecture with Hardware-Programmable Functional Units
- in Proceedings of the 27th Annual International Symposium on Microarchitecture
, 1994
"... This paper explores a novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications. Through a coupling of compile-time analysis routines and hardware synthesis tools, we automatically configure a given set of t ..."
Abstract
-
Cited by 171 (1 self)
- Add to MetaCart
This paper explores a novel way to incorporate hardware-programmable resources into a processor microarchitecture to improve the performance of general-purpose applications. Through a coupling of compile-time analysis routines and hardware synthesis tools, we automatically configure a given set of the hardware-programmable functional units (PFUs) and thus augment the base instruction set architecture so that it better meets the instruction set needs of each application. We refer to this new class of general-purpose computers as PRogrammable Instruction Set Computers (PRISC). Although similar in concept, the PRISC approach differs from dynamically programmable microcode because in PRISC we define entirely-new primitive datapath operations. In this paper, we concentrate on the microarchitectural design of the simplest form of PRISC---a RISC microprocessor with a single PFU that only evaluates combinational functions. We briefly discuss the operating system and the programming language co...
The Chimera Reconfigurable Functional Unit
, 2004
"... By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host proce ..."
Abstract
-
Cited by 149 (16 self)
- Add to MetaCart
By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor’s register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.
Bitwidth Analysis with Application to Silicon Compilation
, 2000
"... This paper introduces Bitwise, a compiler that minimizes the bitwidth --- the number of bits used to representeach operand --- for both integers and pointers in a program. By propagating static information both forward and backward in the program dataflowgraph,Bitwise frees the programmer from decla ..."
Abstract
-
Cited by 80 (0 self)
- Add to MetaCart
This paper introduces Bitwise, a compiler that minimizes the bitwidth --- the number of bits used to representeach operand --- for both integers and pointers in a program. By propagating static information both forward and backward in the program dataflowgraph,Bitwise frees the programmer from declaring bitwidth invariants in cases where the compiler can determine bitwidths automatically. We find a rich opportunity for bitwidth reduction in modern multimedia and streaming application workloads. For new architectures that support sub-word quantities, we expect that our bitwidth reductions will savepower and increase processor performance. This paper
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggres ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically−scheduled superscalar processor. Chimaera is capable of performing 9−input/1−output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control−flow into RFU operations, and (3) supporting a more powerful fine−grain data−parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22 % of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub−shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4−way out−of−order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
Configuration Prefetch for Single Context Reconfigurable Coprocessors
, 1998
"... Current reconfigurable systems suffer from a significant overhead due to the time it takes to reconfigure their hardware. In order to deal with this overhead, and increase the power of reconfigurable systems, it is important to develop hardware and software systems to reduce or eliminate this delay. ..."
Abstract
-
Cited by 48 (8 self)
- Add to MetaCart
Current reconfigurable systems suffer from a significant overhead due to the time it takes to reconfigure their hardware. In order to deal with this overhead, and increase the power of reconfigurable systems, it is important to develop hardware and software systems to reduce or eliminate this delay. In this paper we propose one technique for significantly reducing the reconfiguration latency: the prefetching of configurations. By loading a configuration into the reconfigurable logic in advance of when it is needed, we can overlap the reconfiguration with useful computation. We demonstrate the power of this technique, and propose an algorithm for automatically adding prefetch operations into reconfigurable applications. This results in a significant decrease in the reconfiguration overhead for these applications. 1 Introduction When FPGAs were first introduced in the mid 1980s they were viewed as a technology for replacing standard gate arrays for some applications. In these first gener...
BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations
- IN PROCEEDINGS OF THE EUROPAR 2000 EUROPEAN CONFERENCE ON PARALLEL COMPUTING
, 2000
"... We present a compiler algorithm called BitValue, which can discover unused and constant bits in dusty-deck C programs. BitValue uses forward and backward dataflow analyses, generalizing constant-folding and dead-code detection at the bit-level. This algorithm enables compiler optimizations targeting ..."
Abstract
-
Cited by 41 (7 self)
- Add to MetaCart
We present a compiler algorithm called BitValue, which can discover unused and constant bits in dusty-deck C programs. BitValue uses forward and backward dataflow analyses, generalizing constant-folding and dead-code detection at the bit-level. This algorithm enables compiler optimizations targeting special processor architectures for computing on non-standard bitwidths. Using this algorithm we show that up to 36% of the computed bytes are thrown away; also, we show that on average 26.8% of the values computed require 16 bits or less (for programs from SpecINT95 and Mediabench). A compiler for reconfigurable hardware uses this algorithm to achieve substantial reductions (up to 20-fold) in the size of the synthesized circuits.
DPGA Utilization and Application
, 1996
"... Dynamically Programmable Gate Arrays (DPGAs) are programmable arrays which allow the strategic reuse of limited resources. In so doing, DPGAs promise greater capacity, and in some cases higher performance, than conventional programmable device architectures where all array resources are dedicate ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
Dynamically Programmable Gate Arrays (DPGAs) are programmable arrays which allow the strategic reuse of limited resources. In so doing, DPGAs promise greater capacity, and in some cases higher performance, than conventional programmable device architectures where all array resources are dedicated to a single function for an entire operational epoch. This paper examines several usage patterns for DPGAs including temporal pipelining, utility functions, multiple function accommodation, and state-dependent logic. In the process, it offers insight into the application and technology space where DPGA-style reuse techniques are most beneficial. Acknowledgments: This research is supported by the Advanced Research Projects Agency of the Department of Defense under Rome Labs contract number F30602-94-C-0252. An abbreviated version of this paper appears in Proceedings of the 1996 International Symposium on Field Programmable Gate Arrays. Copyright c fl1996 by the Association for Comput...
HW/SW partitioning and code generation of embedded control applications on a reconfigurable architecture platform
- In Proceedings of the 10th International Workshop on Hardware/Software Codesign
, 2002
"... This paper studies the usage of a reconfigurable architecture platform for embedded control applications aimed at improving real time performance. The hw/sw codesign methodology from POLIS is used. It starts from high-level specifications, optimizes an intermediate model of computation (Extended Fin ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
This paper studies the usage of a reconfigurable architecture platform for embedded control applications aimed at improving real time performance. The hw/sw codesign methodology from POLIS is used. It starts from high-level specifications, optimizes an intermediate model of computation (Extended Finite State Machines) and derives both hardware and software, based on performance constraints. We study a particular architecture platform, which consists of a general purpose processor core, augmented with a reconfigurable function unit and data-path to improve run time performance. A new mapping flow and algorithms to partition hardware and software are proposed to generate implementation that best utilizes this architecture. Encouraging preliminary results are shown for automotive electronic control examples.
Compiling application-specific hardware
- Montpellier (La Grande-Motte
, 2002
"... Abstract. In this paper we describe ASH, an architectural framework for implementing Application-Specific Hardware. ASH is based on automatic hardware synthesis from high-level languages. The generated circuits use only localized computation structures; in consequence, we expect these circuits to be ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Abstract. In this paper we describe ASH, an architectural framework for implementing Application-Specific Hardware. ASH is based on automatic hardware synthesis from high-level languages. The generated circuits use only localized computation structures; in consequence, we expect these circuits to be fast, to use little power and to scale well with program complexity. We present in detail CASH, a scalable compiler framework for ASH, which generates hardware from programs written in C. Our compiler exploits instruction level parallelism by using aggressive speculation and dynamic scheduling. Based on this compilation scheme, we evaluate the computational resources necessary for implementing complex integer-based programs, and we suggest architectural features that would support the ASH framework. 1
The MOLEN rm-Coded Processor
- In Proc. 11th Intl. Conference FPL-2001
, 2001
"... In this paper, we introduce the MOLEN -coded processor which comprises hardwired and microcoded reconfigurable units. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. The reconf ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
In this paper, we introduce the MOLEN -coded processor which comprises hardwired and microcoded reconfigurable units. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. The reconfiguration of the hardware and the execution on the reconfigured hardware are performed by -microcode (an extension of the classical microcode to allow reconfiguration capabilities). We include fixed and pageable microcode hardware features to extend the flexibility and improve the performance. The scheme allows partial reconfiguration and includes caching mechanisms for non-frequently used reconfiguration and execution microcode. Using simulations, we establish the performance potential of the proposed processor assuming the JPEG and MPEG-2 benchmarks, the ALTERA APEX20K boards for the implementation, and a hardwired superscalar processor. After implementation, cycle time estimations and normalization, our simulations indicate that the execution cycles of the superscalar machine can be reduced by 30% for the JPEG benchmark and by 32% for the MPEG-2 benchmark using the proposed processor organization.

