Results 1 - 10
of
101
Reunion: Complexity-effective multicore redundancy
- In Proc. of 39th MICRO
, 2006
"... To protect processor logic from soft errors, multicore redundant architectures execute two copies of a program on separate cores of a chip multiprocessor (CMP). Maintaining identical instruction streams is challenging because redundant cores operate independently, yet must still receive the same inp ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
To protect processor logic from soft errors, multicore redundant architectures execute two copies of a program on separate cores of a chip multiprocessor (CMP). Maintaining identical instruction streams is challenging because redundant cores operate independently, yet must still receive the same inputs (e.g., load values and shared-memory invalidations). Past proposals strictly replicate load values across two cores, requiring significant changes to the highly-optimized core. We make the key observation that, in the common case, both cores load identical values without special hardware. When the cores do receive different load values (e.g., due to a data race), the same mechanisms employed for soft error detection and recovery can correct the difference. This observation permits designs that relax input replication, while still providing correct redundant execution. In this paper, we present Reunion, an execution model that provides relaxed input replication and preserves the existing memory interface, coherence protocols, and consistency models. We evaluate a CMP-based implementation of the Reunion execution model with full-system, cycle-accurate simulation. We show that the performance overhead of relaxed input replication is only 5 % and 6 % for commercial and scientific workloads, respectively. 1.
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design
- In Proc. Intl. Conf. on Architectural Support for Programming Languages and Operating Systems(ASPLOS
, 2008
"... With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults in different microarchitectural structures of a modern processor propagate through the application and OS. This paper aims to provide such a characterization, resulting in identifying low-cost detection methods and providing guidelines for implementation of the recovery and diagnosis components of such a reliability solution. We focus on hard faults because they are increasingly important and have different system implications than
Facelift: Hiding and Slowing Down Aging in Multicores
- In Proceedings of the International Symposium on Microarchitecture (MICRO
, 2008
"... Processors progressively age during their service life due to normal workload activity. Such aging results in gradually slower circuits. Anticipating this fact, designers add timing guardbands to processors, so that processors last for a number of years. As a result, aging has important design and c ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Processors progressively age during their service life due to normal workload activity. Such aging results in gradually slower circuits. Anticipating this fact, designers add timing guardbands to processors, so that processors last for a number of years. As a result, aging has important design and cost implications. To address this problem, this paper shows how to hide the effects of aging and how to slow it down. Our framework is called Facelift. It hides aging through aging-driven application scheduling. It slows down aging by applying voltage changes at key times — it uses a non-linear optimization algorithm to carefully balance the impact of voltage changes on the aging rate and on the critical path delays. Moreover, Facelift can gainfully configure the chip for a short service life. Simulation results indicate that Facelift leads to more cost-effective multicores. We can take a multicore designed for a 7-year service life and, by hiding and slowing down aging, enable it to run, on average, at a 14–15% higher frequency during its whole service life. Alternatively, we can design the multicore for a 5 to 7-month service life and still use it for 7 years. 1
Self-calibrating online wearout detection
- In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40
, 2007
"... Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage scaling, will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout mechanisms projec ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage scaling, will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout mechanisms projected to plague future technology generations are progressive, meaning that the circuit-level effects of wearout develop and intensify with age over the lifetime of the chip. This work leverages the progression of wearout over time in order to present a low-cost hardware structure that identifies increasing propagation delay, which is symptomatic of many forms of wearout, to accurately forecast the failure of microarchitectural structures. To motivate the use of this predictive technique, an HSPICE analysis of the effects of one particular failure mechanism, gate oxide breakdown, on gates from a standard cell library characterized for a 90 nm process is presented. This gate-level analysis is then used to demonstrate the aggregate change in output delay of high-level structures within a synthesized Verilog model of an embedded microprocessor core. Leveraging this analysis, a selfcalibrating hardware structure for conducting statistical analysis of output delay is presented and its efficacy in predicting the failure of a variety of structures within the microprocessor core is evaluated. 1.
Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding
"... In deep sub-micron ICs, growing amounts of ondie memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale mu ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In deep sub-micron ICs, growing amounts of ondie memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multibit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32x32 bits with significantly smaller performance, area, and power overheads than conventional techniques. 1.
The future of microprocessors
- Communications of the ACM
, 2011
"... Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Energy efficiency is the new fundamental limiter of processor performance, way beyond numbers of processors.
Fault-tolerant typed assembly language
- SIGPLAN Not
, 2007
"... A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for pr ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34 % longer to execute than the unreliable version.
Detecting emerging wearout faults
- In Proceedings of the IEEE Workshop on Silicon Errors in Logic - System Effects
, 2007
"... Abstract — Aggressive CMOS scaling accelerates transistor and interconnect wearout, resulting in shorter and less predictable lifetimes for microprocessors. Studies show that wearout faults have a gradual onset, manifesting initially as timing faults before eventually leading to hard breakdown. Prio ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract — Aggressive CMOS scaling accelerates transistor and interconnect wearout, resulting in shorter and less predictable lifetimes for microprocessors. Studies show that wearout faults have a gradual onset, manifesting initially as timing faults before eventually leading to hard breakdown. Prior work suggests detecting wearout faults as they begin to affect normal operation, but these techniques require complex circuit and/or microarchitectural changes. Our proposal, FIRST (Fingerprints In Reliability and Self Test), uses existing design-for-test hardware (scanout chains) and infrequent periodic tests under reduced frequency guardbands to observe marginal behavior that is an indication of wearout. FIRST is a low-overhead, complexityeffective methodology for detecting emerging wearout faults before they affect normal operation. We discuss the operation of FIRST error detection, present a model for wearout fault simulation, and demonstrate FIRST’s effectiveness on a portion of a commercial microprocessor design. I.
Online timing analysis for wearout detection
- In Proc. of the 2nd Workshop on Architectural Reliability (WAR
, 2006
"... CMOS feature size scaling has long been the source of dramatic performance gains. However, because voltage levels have not scaled in step, feature size scaling has come at the cost of increased operating temperatures and current densities. Further, since most common wearout mechanisms are highly dep ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
CMOS feature size scaling has long been the source of dramatic performance gains. However, because voltage levels have not scaled in step, feature size scaling has come at the cost of increased operating temperatures and current densities. Further, since most common wearout mechanisms are highly dependent upon both temperature and current density, reliability issues, and in particular microprocessor lifetime, have come into question. In this work, we explore the effects of wearout upon a fully synthesized, placed and routed implementation of an embedded microprocessor core and present a generic wearout detection unit. Since most common failure mechanisms may be characterized by a period of increased latency through ailing transistors and interconnects before breakdown, this wearout detection unit serves to predict imminent failure by conducting online timing analysis. In addition to measuring signal propagation latency, it also includes a unique two-level sampling unit which is used to smooth out timing anomalies that may be caused by phenomenon such as temperature spikes, electrical noise, and clock jitter. 1.
ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Peter Kogge, Editor & Study Lead
, 2008
"... exchange and its publication does not constitute the Government’s approval or disapproval of its ideas or findings NOTICE Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Governm ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
exchange and its publication does not constitute the Government’s approval or disapproval of its ideas or findings NOTICE Using Government drawings, specifications, or other data included in this document for any purpose other than Government procurement does not in any way obligate the U.S. Government. The fact that the Government formulated or supplied the drawings, specifications, or other data does not license the holder or any other person or corporation; or convey any rights or permission to manufacture, use, or sell any patented invention that may relate to them. APPROVED FOR PUBLIC RELEASE, DISTRIBUTION UNLIMITED. This page intentionally left blank. DISCLAIMER The following disclaimer was signed by all members of the Exascale Study Group (listed below): I agree that the material in this document reflects the collective views, ideas, opinions and findings of the study participants only, and not those of any of the universities, corporations, or other institutions with which they are affiliated. Furthermore, the material in this document does not reflect the official views, ideas, opinions and/or findings of DARPA, the Department of Defense, or of the United States government.

