Results 1 - 10
of
80
NanoFabrics: Spatial Computing Using Molecular Electronics
"... The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by a ..."
Abstract
-
Cited by 110 (9 self)
- Add to MetaCart
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by an alternative to CMOS-based computing, chemically assembled electronic nanotechnology (CAEN). In this paper we outline how CAEN-based computing can become a reality. We briefly describe recent work in CAEN and how CAEN will affect computer architecture. We show how the inherently reconfigurable nature of CAEN devices can be exploited to provide high-density chips with defect tolerance at significantly reduced manufacturing costs. We develop a layered abstract architecture for CAEN-based computing devices and we present preliminary results which indicate that such devices will be competitive with CMOS circuits.
PipeRench: A Reconfigurable Architecture and Compiler
- Computer
, 2000
"... efficiency because of forced serialization of intrinsically parallel operations; wasted space (small data elements do not use the processor's wide data path); and excessive instruction bandwidth for regular, dataflow-dominated computations on large data sets. A system using a reconfigurable f ..."
Abstract
-
Cited by 88 (0 self)
- Add to MetaCart
efficiency because of forced serialization of intrinsically parallel operations; wasted space (small data elements do not use the processor's wide data path); and excessive instruction bandwidth for regular, dataflow-dominated computations on large data sets. A system using a reconfigurable fabric such as PipeRench can . increase flexibility, . decrease design time, . extend system life, . reduce part count, . reduce development and maintenance costs, and . reduce chip fabrication costs through defect tolerance. PROBLEMS WITH COMMERCIAL FPGAS Initial performance results with FPGAs were impressive. However, commercial FPGAs have inherent shortcomings, which heretofore made reconfigurable computing impractical for mainstream computing: . Logic granularity. FPGAs are designed for logic replacement. The functional units' granularity is optimized to replace random logic, not to perform multimedia computations. Reconfigurable com
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggres ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically−scheduled superscalar processor. Chimaera is capable of performing 9−input/1−output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control−flow into RFU operations, and (3) supporting a more powerful fine−grain data−parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22 % of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub−shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4−way out−of−order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
- In ISCA
, 2004
"... This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reason ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources -- including logic, wires, and pins -- in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport.
Reconfigurable Computing for Digital Signal Processing: A Survey
- Journal of VLSI Signal Processing
, 2000
"... Steady advances in VLSI technology and design tools have extensively expanded the application domain of digital signal processing over the past decade. While application-specific integrated circuits (ASICs) and programmable digital signal processors (PDSPs) remain the implementation mechanisms of ch ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
Steady advances in VLSI technology and design tools have extensively expanded the application domain of digital signal processing over the past decade. While application-specific integrated circuits (ASICs) and programmable digital signal processors (PDSPs) remain the implementation mechanisms of choice for many DSP applications, increasingly new system implementations based on reconfigurable computing are being considered. These flexible platforms, which offer the functional efficiency of hardware and the programmability of software, are quickly maturing as the logic capacity of programmable devices follow Moore's Law and advanced automated design techniques become available. As initial reconfigurable technologies have emerged, new academic and commercial efforts have been initiated to support power optimization, cost reduction, and enhanced run-time performance. This paper presents a survey of academic research and commercial development in reconfigurable computing for DSP systems o...
BitValue Inference: Detecting and Exploiting Narrow Bitwidth Computations
- IN PROCEEDINGS OF THE EUROPAR 2000 EUROPEAN CONFERENCE ON PARALLEL COMPUTING
, 2000
"... We present a compiler algorithm called BitValue, which can discover unused and constant bits in dusty-deck C programs. BitValue uses forward and backward dataflow analyses, generalizing constant-folding and dead-code detection at the bit-level. This algorithm enables compiler optimizations targeting ..."
Abstract
-
Cited by 41 (7 self)
- Add to MetaCart
We present a compiler algorithm called BitValue, which can discover unused and constant bits in dusty-deck C programs. BitValue uses forward and backward dataflow analyses, generalizing constant-folding and dead-code detection at the bit-level. This algorithm enables compiler optimizations targeting special processor architectures for computing on non-standard bitwidths. Using this algorithm we show that up to 36% of the computed bytes are thrown away; also, we show that on average 26.8% of the values computed require 16 bits or less (for programs from SpecINT95 and Mediabench). A compiler for reconfigurable hardware uses this algorithm to achieve substantial reductions (up to 20-fold) in the size of the synthesized circuits.
Spatial Computation
- in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2004
"... This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the ..."
Abstract
-
Cited by 37 (10 self)
- Add to MetaCart
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.
Stream Computations Organized for Reconfigurable Execution (SCORE): Introduction and Tutorial
- in Proceedings of the International Conference on Field-Programmable Logic and Applications
, 2000
"... A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying computational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computation Organized for Recon ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
A primary impediment to wide-spread exploitation of reconfigurable computing is the lack of a unifying computational model which allows application portability and longevity without sacrificing a substantial fraction of the raw capabilities. We introduce SCORE (Stream Computation Organized for Reconfigurable Execution), a streambased compute model which virtualizes reconfigurable computing resources (compute, storage, and communication) by dividing a computation up into fixed-size "pages" and time-multiplexing the virtual pages on available physical hardware. Consequently, SCORE applications can scale up or down automatically to exploit a wide range of hardware sizes. We hypothesize that the SCORE model will ease development and deployment of reconfigurable applications and expand the range of applications which can benefit from reconfigurable execution. Further, we believe that a well engineered SCORE implementation can be efficient, wasting little of the capabilities of the raw hardw...
A Media-Enhanced Vector Architecture for Embedded Memory Systems
, 1999
"... Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representat...
Conservation Cores: Reducing the Energy of Mature Computations
"... Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Growing transistor counts, limited power budgets, and the breakdown of voltage scaling are currently conspiring to create a utilization wall that limits the fraction of a chip that can run at full speed at one time. In this regime, specialized, energy-efficient processors can increase parallelism by reducing the per-computation power requirements and allowing more computations to execute under the same power budget. To pursue this goal, this paper introduces conservation cores. Conservation cores, or c-cores, are specialized processors that focus on reducing energy and energy-delay instead of increasing performance. This focus on energy makes c-cores an excellent match for many applications that would be poor candidates for hardware acceleration (e.g., irregular integer codes). We present a toolchain for automatically synthesizing c-cores from application source code and demonstrate that they can significantly reduce energy and energy-delay for a wide range of applications. The c-cores support patching, a form of targeted reconfigurability, that allows them to adapt to new versions of the software they target. Our results show that conservation cores can reduce energy consumption by up to 16.0 × for functions and by up to 2.1 × for whole applications, while patching can extend the useful lifetime of individual c-cores to match that of conventional processors.

