Results 1 - 10
of
39
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline
, 2004
"... The progression of implementation technologies into the sub-100 nanometer lithographies renew the importance of understanding and protecting against single-event upsets in digital systems. In this work, the effects of transient faults on high performance microprocessors is explored. To perform a tho ..."
Abstract
-
Cited by 77 (4 self)
- Add to MetaCart
The progression of implementation technologies into the sub-100 nanometer lithographies renew the importance of understanding and protecting against single-event upsets in digital systems. In this work, the effects of transient faults on high performance microprocessors is explored. To perform a thorough exploration, a highly detailed register transfer level model of a deeply pipelined, out-of-order microprocessor was created. Using fault injection, we determined that fewer than 15% of single bit corruptions in processor state result in software visible errors. These failures were analyzed to identify the most vulnerable portions of the processor, which were then protected using simple low-overhead techniques. This resulted in a 75% reduction in failures. Building upon the failure modes seen in the microarchitecture, fault injections into software were performed to investigate the level of masking that the software layer provides. Together, the baseline microarchitectural substrate and software mask more than 9 out of 10 transient faults from affecting correct program execution. 1.
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures
, 2004
"... Previous proposals for soft-error tolerance have called for redundantly executing a program as two concurrent threads on a superscalar microarchitecture. In a balanced superscalar design, the extra workload from redundant execution induces a severe performance penalty due to increased contention for ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
Previous proposals for soft-error tolerance have called for redundantly executing a program as two concurrent threads on a superscalar microarchitecture. In a balanced superscalar design, the extra workload from redundant execution induces a severe performance penalty due to increased contention for resources throughout the datapath. This paper identifies and analyzes four key factors that affect the performance of redundant execution, namely 1) issue bandwidth and functional unit contention, 2) issue queue and reorder buffer capacity contention, 3) decode and retirement bandwidth contention, and 4) coupling between redundant threads' dynamic resource requirements. Based on this analysis, we propose the SHREC microarchitecture for asymmetric and staggered redundant execution. This microarchitecture addresses the four factors in an integrated design without requiring prohibitive additional hardware resources. In comparison to conventional single-threaded execution on a state-ofthe -art superscalar microarchitecture with comparable cost, SHREC reduces the average performance penalty to within 4% on integer and 15% on floating-point SPEC2K benchmarks by sharing resources more efficiently between the redundant threads.
A mechanism for online diagnosis of hard faults in microprocessors
- In Proc. of the 38th Annual International Symposium on Microarchitecture
, 2005
"... We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigurable unit (FDU) granularity, and deconfigure ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigurable unit (FDU) granularity, and deconfigure FDUs with hard faults. In our reliable microprocessor design, we use DIVA dynamic verification to detect and correct errors. Our new scheme for diagnosing hard faults tracks instructions’ core structure occupancy from decode until commit. If a DIVA checker detects an error in an instruction, it increments a small saturating error counter for every FDU used by that instruction, including that DIVA checker. A hard fault in an FDU quickly leads to an above-threshold error counter for that FDU and thus diagnoses the fault. For deconfiguration, we use previously developed schemes for functional units and buffers, and we present a scheme for deconfiguring DIVA checkers. Experimental results show that our reliable microprocessor quickly and accurately diagnoses each hard fault that is injected and continues to function, albeit with somewhat degraded performance. 1
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
- In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
, 2007
"... We have developed Argus, a novel approach for providing low-cost, comprehensive error detection for simple cores. The key to Argus is that the operation of a von Neumann core consists of four fundamental tasks—control flow, dataflow, computation, and memory access—that can be checked separately. We ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
We have developed Argus, a novel approach for providing low-cost, comprehensive error detection for simple cores. The key to Argus is that the operation of a von Neumann core consists of four fundamental tasks—control flow, dataflow, computation, and memory access—that can be checked separately. We prove that Argus can detect any error by observing whether any of these tasks are performed incorrectly. We describe a prototype implementation, Argus-1, based on a single-issue, 4-stage, in-order processor to illustrate the potential of our approach. Experiments show that Argus-1 detects transient and permanent errors in simple cores with much lower impact on performance (<4 % average overhead) and chip area (<17 % overhead)
BulletProof: A Defect-Tolerant CMP Switch Architecture
- In Proceedings of the 12th International Symposium on High Performance Computer Architecture
, 2006
"... As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and sho ..."
Abstract
-
Cited by 23 (12 self)
- Add to MetaCart
As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-timesto -failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeo# across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than nave triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.
ReStore: Symptom-Based Soft Error Detection in Microprocessors
- IEEE Trans. Dependable Secur. Comput
"... Abstract—Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft er ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Abstract—Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor—in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow misspeculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures. Index Terms—Simulation, fault tolerance, fault injection, redundant design. 1
Opportunities and challenges for better than worst-case design
- In Asia and South Pacific Design Automation Conference
, 2005
"... The progressive trend of fabrication technologies towards the nanometer regime has created a number of new physical design challenges for computer architects. Design complexity, uncertainty in environmental and fabrication conditions, and single-event upsets all conspire to compromise system correct ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
The progressive trend of fabrication technologies towards the nanometer regime has created a number of new physical design challenges for computer architects. Design complexity, uncertainty in environmental and fabrication conditions, and single-event upsets all conspire to compromise system correctness and reliability. Recently, researchers have begun to advocate a new design strategy called Better Than Worst-Case design that couples a complex core component with a simple reliable checker mechanism. By delegating the responsibility for correctness and reliability of the design to the checker, it becomes possible to build provably correct designs that effectively address the challenges of deep submicron design. In this paper, we present the concepts of Better Than Worst-Case design and highlight two exemplary designs: the DIVA checker and Razor logic. We show how this approach to system implementation relaxes design constraints on core components, which reduces the effects of physical design challenges and creates opportunities to optimize performance and power characteristics. We demonstrate the advantages of relaxed design constraints for the core components by applying typical-case optimization (TCO) techniques to an adder circuit. Finally, we discuss the challenges and opportunities posed to CAD tools in the context of Better Than Worst-Case design. In particular, we describe the additional support required for analyzing run-time characteristics of designs and the many opportunities which are created to incorporate typical-case optimizations into synthesis and verification. I.
Ultra low-cost defect protection for microprocessor pipelines
- In Proceedings of the 12th International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS
, 2006
"... The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4 % to 18 % slowdown).
EVAL: Utilizing processors with variation-induced timing errors
- In Proceedings of the 41st IEEE/ACM International Symposium on Microarchitecture (MICRO-41
, 2008
"... Parameter variation in integrated circuits causes sections of a chip to be slower than others. If, to prevent any resulting timing errors, we design processors for worst-case parameter values, we may lose substantial performance. An alternate approach explored in this paper is to design for closer t ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Parameter variation in integrated circuits causes sections of a chip to be slower than others. If, to prevent any resulting timing errors, we design processors for worst-case parameter values, we may lose substantial performance. An alternate approach explored in this paper is to design for closer to nominal values, and provide some transistor budget to tolerate unavoidable variationinduced errors. To assess this approach, this paper first presents a novel framework that shows how microarchitecture techniques can trade off variation-induced errors for power and processor frequency. Then, the paper introduces an effective technique to maximize performance and minimize power in the presence of variationinduced errors, namely High-Dimensional dynamic adaptation. For efficiency, the technique is implemented using a machinelearning algorithm. The results show that our best configuration increases processor frequency by 56 % on average, allowing the processor to cycle 21 % faster than without variation. Processor performance increases by 40 % on average, resulting in a performance that is 14 % higher than without variation — at only a 10.6 % area cost. 1
An Architectural Framework for Providing Reliability and Security Support
- In DSN
, 2004
"... This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the Reliability and Security Engine (RSE), which is implemented as an integral part of a modern microprocessor. The RSE interacts with the processor through an in ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the Reliability and Security Engine (RSE), which is implemented as an integral part of a modern microprocessor. The RSE interacts with the processor through an input/output interface. The CHECK instruction, a special extension of the instruction set architecture of the processor is the interface of the application with the RSE. The detection mechanisms described here in detail are: (1) the Memory Layout Randomization (MLR) Module, which randomizes the memory layout of a process in order to foil attackers who assume a fixed system layout, thus protecting against many security threats, (2) the Data Dependency Tracking (DDT) Module, which tracks the dependencies among threads of a process and maintains checkpoints of shared memory pages in order to rollback the threads when an offending (potentially malicious) thread is terminated, (3) the Instruction Checker Module (ICM), which checks an instruction for its validity or the control-flow of the program just as the instruction enters the pipeline for execution, and (4) Adaptive Heartbeat Monitor (AHBM), which enables heart-beating for checking the liveness of operating systems and/or application processes/threads. Performance simulations for the studied modules indicate low overhead of the proposed solutions. 1

