Results 1 - 10
of
33
NanoFabrics: Spatial Computing Using Molecular Electronics
"... The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by a ..."
Abstract
-
Cited by 110 (9 self)
- Add to MetaCart
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by an alternative to CMOS-based computing, chemically assembled electronic nanotechnology (CAEN). In this paper we outline how CAEN-based computing can become a reality. We briefly describe recent work in CAEN and how CAEN will affect computer architecture. We show how the inherently reconfigurable nature of CAEN devices can be exploited to provide high-density chips with defect tolerance at significantly reduced manufacturing costs. We develop a layered abstract architecture for CAEN-based computing devices and we present preliminary results which indicate that such devices will be competitive with CMOS circuits.
Reducing power density through activity migration
- In Proceedings of the International Symposium on Low Power Electronics and Design
, 2003
"... Power dissipation is unevenly distributed in modern microprocessors leading to localized hot spots with significantly greater die temperature than surrounding cooler regions. Excessive junction temperature reduces reliability and can lead to catastrophic failure. We examine the use of activity migra ..."
Abstract
-
Cited by 99 (1 self)
- Add to MetaCart
Power dissipation is unevenly distributed in modern microprocessors leading to localized hot spots with significantly greater die temperature than surrounding cooler regions. Excessive junction temperature reduces reliability and can lead to catastrophic failure. We examine the use of activity migration which reduces peak junction temperature by moving computation between multiple replicated units. Using a thermal model that includes the temperature dependence of leakage power, we show that sustainable power dissipation can be increased by nearly a factor of two for a given junction temperature limit. Alternatively, peak die temperature can be reduced by 12.4 o C at the same clock frequency. The model predicts that migration intervals of around 20–200 µs are required to achieve the maximum sustainable power increase. We evaluate several different forms of replication and migration policy control.
Performance Issues of Enterprise Level Web Proxies
- IN PROCEEDINGS OF THE SIGMETRICS CONFERENCE ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS
, 1997
"... Enterprise level web proxies relay world-wide web traffic between private networks and the Internet. They improve security, save network bandwidth, and reduce network latency. While the performance of web proxies has been analyzed based on synthetic workloads, little is known about their performance ..."
Abstract
-
Cited by 68 (1 self)
- Add to MetaCart
Enterprise level web proxies relay world-wide web traffic between private networks and the Internet. They improve security, save network bandwidth, and reduce network latency. While the performance of web proxies has been analyzed based on synthetic workloads, little is known about their performance on real workloads. In this paper we present a study of two web proxies (CERN and Squid) executing real workloads on Digital's Palo Alto Gateway. We demonstrate that the simple CERN proxy architecture outperforms all but the latest version of Squid and continues to outperform cacheless configurations. For the measured load levels the Squid proxy used at least as many CPU, memory, and disk resources as CERN, in some configurations significantly more resources. At higher load levels the resource utilization requirements will cross and Squid will be the one using fewer resources. Lastly we found that cache hit rates of around 30% had very little effect on the requests service time.
Spatial Computation
- in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2004
"... This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the ..."
Abstract
-
Cited by 37 (10 self)
- Add to MetaCart
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.
Compiling application-specific hardware
- Montpellier (La Grande-Motte
, 2002
"... Abstract. In this paper we describe ASH, an architectural framework for implementing Application-Specific Hardware. ASH is based on automatic hardware synthesis from high-level languages. The generated circuits use only localized computation structures; in consequence, we expect these circuits to be ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Abstract. In this paper we describe ASH, an architectural framework for implementing Application-Specific Hardware. ASH is based on automatic hardware synthesis from high-level languages. The generated circuits use only localized computation structures; in consequence, we expect these circuits to be fast, to use little power and to scale well with program complexity. We present in detail CASH, a scalable compiler framework for ASH, which generates hardware from programs written in C. Our compiler exploits instruction level parallelism by using aggressive speculation and dynamic scheduling. Based on this compilation scheme, we evaluate the computational resources necessary for implementing complex integer-based programs, and we suggest architectural features that would support the ASH framework. 1
Design, Implementation, and Evaluation of Optimizations in a Just-In-Time Compiler
- In Proceedings of the ACM SIGPLAN JavaGrande Conference
, 1999
"... The Java language incurs a runtime overhead for exception checks and object accesses without an interior pointer in order to ensure safety. It also requires type inclusion test, dynamic class loading, and dynamic method calls in order to ensure flexibility. A "JustIn -Time" (JIT) compiler generates ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
The Java language incurs a runtime overhead for exception checks and object accesses without an interior pointer in order to ensure safety. It also requires type inclusion test, dynamic class loading, and dynamic method calls in order to ensure flexibility. A "JustIn -Time" (JIT) compiler generates native code from Java byte code at runtime. It must improve the runtime performance without compromising the safety and flexibility of the Java language. We designed and implemented effective optimizations for the JIT compiler, such as exception check elimination, common subexpression elimination, simple type inclusion test, method inlining, and resolution of dynamic method call. We evaluate the performance benefits of these optimizations based on various statistics collected using SPECjvm98 and two JavaSoft applications with byte code sizes ranging from 20000 to 280000 bytes. Each optimization contributes to an improvement in the performance of the programs. 1. Introduction Java [1] is a ...
Optimizing Memory Accesses For Spatial Computation
, 2003
"... In this paper we present the internal representation and optimizations used by the CASH compiler for improving the memory parallelism of pointer-based programs. CASH uses an SSA-based representation for memory, which compactly summarizes both control-flow- and dependence information. In CASH, memory ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
In this paper we present the internal representation and optimizations used by the CASH compiler for improving the memory parallelism of pointer-based programs. CASH uses an SSA-based representation for memory, which compactly summarizes both control-flow- and dependence information. In CASH, memory optimization is a four-step process: (1) first an initial, relatively coarse, representation of memory dependences is built; (2) next, unnecessary memory dependences are removed using dependence tests; (3) third, redundant memory operations are removed (4) finally, parallelism is increased by pipelining memory accesses in loops. While the first three steps above are very general, the loop pipelining transformations are particularly applicable for spatial computation, which is the primary target of CASH. The redundant
A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions
- In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
, 2000
"... The need for accurate conditional-branch prediction is well known: mispredictions waste large numbers of cycles, inhibit outof -order execution, and waste power on mis-speculated computation. Prior work on branch-predictor organization has focused mainly on how to reduce conflicts in the branch-pred ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
The need for accurate conditional-branch prediction is well known: mispredictions waste large numbers of cycles, inhibit outof -order execution, and waste power on mis-speculated computation. Prior work on branch-predictor organization has focused mainly on how to reduce conflicts in the branch-predictor structures, while relatively little work has explored other causes of mispredictions. Some prior work has identified other categories of mispredictions, but this paper organizes these categories into a broad taxonomy of misprediction types. Using the taxonomy, this paper goes on to show that other categories---especially wronghistory mispredictions---are often more important than conflicts. This is true even if just a very simple conflict-reduction technique is used. Based on these observations, this paper proposes alloying local and global history together in a two-level branch predictor structure. This simple technique, a generalization of the bi-mode predictor, attacks wrong-histo...
Tartan: Evaluating spatial computation for whole program execution
- In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2006
"... Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Spatial Computing (SC) has been shown to be an energy-efficient model for implementing program kernels. In this paper we explore the feasibility of using SC for more than small kernels. To this end, we evaluate the performance and energy efficiency of entire applications on Tartan, a general-purpose architecture which integrates a reconfigurable fabric (RF) with a superscalar core. Our compiler automatically partitions and compiles an application into an instruction stream for the core and a configuration for the RF. We use a detailed simulator to capture both timing and energy numbers for all parts of the system. Our results indicate that a hierarchical RF architecture, designed around a scalable interconnect, is instrumental in harnessing the benefits of spatial computation. The interconnect uses static configuration and routing at the lower levels and a packet-switched, dynamically-routed network at the top level. Tartan is most energyefficient when almost all of the application is mapped to the RF, indicating the need for the RF to support most general-purpose programming constructs. Our initial investigation reveals that such a system can provide, on average, an order of magnitude improvement in energy-delay compared to an aggressive superscalar core on single-threaded workloads.
Using on-chip event counters for high-resolution, real-time temperature measurements
- In Thermal and Thermomechanical Phenomena in Electronics Systems, 2006
, 2006
"... This paper proposes a technique to use on-chip event- or performance-counters to augment, or even replace, traditional analog CMOS temperature sensors. Using activity data from the performance counters, energy consumption that consequently causes heat dissipation can be tracked. Simple regression an ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper proposes a technique to use on-chip event- or performance-counters to augment, or even replace, traditional analog CMOS temperature sensors. Using activity data from the performance counters, energy consumption that consequently causes heat dissipation can be tracked. Simple regression analysis permits us to find a relation between activity data and temperature. Performance counters already exist in many processors for debugging and performance characterization, require only minimal computation to interpret for temperature monitoring, and these calculations only need to operate at low frequency, so the marginal cost of this additional temperature-sensing capability is negligible. Performance counters monitor activity data (access count) of most on-chip functional units and therefore allow highresolution, localized temperature sensing across a microprocessor. This in turn allows tracking of localized hotspots. Fine-grained, localized sensing is needed because different units can become hotspots depending on benchmarks. This is especially true if a malicious program intentionally induces high activity in a selected functional unit. This paper presents measurements from a commercial system to illustrate the accuracy of performance counters as additional temperature sensors. KEY WORDS: thermal management, localized hotspot, temperature measurement, performance counter

