Results 1 - 10
of
514
The Chimera Reconfigurable Functional Unit
, 2004
"... By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host proce ..."
Abstract
-
Cited by 149 (16 self)
- Add to MetaCart
By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor’s register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.
A Chip-Multiprocessor Architecture with Speculative Multithreading
- IEEE Transactions on Computers
, 1999
"... Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-perform ..."
Abstract
-
Cited by 112 (13 self)
- Add to MetaCart
Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-performance microprocessors. The issue rate of these microprocessors has continued to increase over the past few years, with today's high-performance superscalar processors such as the Compaq Alpha 21264 [4], IBM PowerPC [16], Intel Pentium-Pro [3] or MIPS R10000 [19] able to issue up to four instructions per cycle.
NanoFabrics: Spatial Computing Using Molecular Electronics
"... The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by a ..."
Abstract
-
Cited by 110 (9 self)
- Add to MetaCart
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by an alternative to CMOS-based computing, chemically assembled electronic nanotechnology (CAEN). In this paper we outline how CAEN-based computing can become a reality. We briefly describe recent work in CAEN and how CAEN will affect computer architecture. We show how the inherently reconfigurable nature of CAEN devices can be exploited to provide high-density chips with defect tolerance at significantly reduced manufacturing costs. We develop a layered abstract architecture for CAEN-based computing devices and we present preliminary results which indicate that such devices will be competitive with CMOS circuits.
CommBench - A Telecommunications Benchmark for Network Processors
- IN PROC. OF IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS
, 2000
"... This paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark ..."
Abstract
-
Cited by 102 (17 self)
- Add to MetaCart
This paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark
A Design Space Evaluation of Grid Processor Architectures
, 2001
"... In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on trad ..."
Abstract
-
Cited by 100 (31 self)
- Add to MetaCart
In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Pro-grams are executed by mapping blocks of statically scheduled instruc-tions to the ALU array and executing them dynamically in dataflow or-der This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary val-ues back to the register file, avoiding most of the large, unscalable structures that limit the scalability of conventional architectures. Fi-nally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the 1PC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 11 IPC across nine SPEC CPU2000 and Mediabench benchmarks.
On-line Scheduling of Hard Real-Time Tasks on Variable Voltage Processor
, 1998
"... We consider the problem of scheduling the mixed workload of both sporadic (on-line) and periodic (off-line) tasks on variable voltage processor to optimize power consumption while ensuring that all periodic tasks meet their deadlines and to accept as many sporadic tasks, which can be guaranteed to m ..."
Abstract
-
Cited by 98 (6 self)
- Add to MetaCart
We consider the problem of scheduling the mixed workload of both sporadic (on-line) and periodic (off-line) tasks on variable voltage processor to optimize power consumption while ensuring that all periodic tasks meet their deadlines and to accept as many sporadic tasks, which can be guaranteed to meet their deadlines, as possible. The proposed efficient algorithms result in the scheduling solutions, which are very close to the minimum bound achievable with the dynamically variable voltage approach. The effectiveness of the proposed algorithms is shown on extensive experiments with real-life design examples. 1
Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints
, 2003
"... Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried ..."
Abstract
-
Cited by 95 (23 self)
- Add to MetaCart
Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried out in the customised extensions. A few algorithms exist but are severely limited in the type of operation clusters they can choose and hence reduce significantly the effectiveness of specialisation. In this paper we introduce a more general algorithm which selects maximal-speedup convex subgraphs of the application dataflow graph under fundamental microarchitectural constraints, and which improves significantly on the state of the art.
NetBench: A benchmarking Suite for Network Processors
- In ICCAD
, 2001
"... Abstract — In this study we introduce NetBench, a benchmarking suite for network processors. NetBench contains a total of 9 applications that are representative of commercial applications for network processors. These applications are from all levels of packet processing; Small, low-level code fragm ..."
Abstract
-
Cited by 79 (9 self)
- Add to MetaCart
Abstract — In this study we introduce NetBench, a benchmarking suite for network processors. NetBench contains a total of 9 applications that are representative of commercial applications for network processors. These applications are from all levels of packet processing; Small, low-level code fragments as well as large application level programs are included in the suite. Using SimpleScalar simulator we study the NetBench programs in detail and characterize the network processor workloads. We also compare key characteristics such as instructions per cycle, instruction distribution, branch prediction accuracy, and cache behavior with the programs from MediaBench. Although the aimed architectures are similar for MediaBench and NetBench suites, we show that these workloads have significantly different characteristics. Hence a separate benchmarking suite for network processors is a necessity. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized. 1.
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggres ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically−scheduled superscalar processor. Chimaera is capable of performing 9−input/1−output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control−flow into RFU operations, and (3) supporting a more powerful fine−grain data−parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22 % of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub−shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4−way out−of−order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors
, 2001
"... Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register a ..."
Abstract
-
Cited by 75 (1 self)
- Add to MetaCart
Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multiported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals.

