Results 1 - 10
of
45
Transient-fault recovery for chip multiprocessors
- In Proceedings of the 30th Annual International Symposium on Computer Architecture
, 2003
"... To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for t ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
To address the increasing susceptibility of commodity chip multiprocessors (CMPs) to transient faults, we propose Chiplevel Redundantly Threaded multiprocessor with Recovery (CRTR). CRTR extends the previously-proposed CRT for transient-fault detection in CMPs, and the previously-proposed SRTR for transient-fault recovery in SMT. All these schemes achieve fault tolerance by executing and comparing two copies, called leading and trailing threads, of a given application. Previous recovery schemes for SMT do not perform well on CMPs. In a CMP, the leading and trailing threads execute on different processors to achieve load balancing and reduce the probability of a fault corrupting both threads; whereas in an SMT, both threads execute on the same processor. The inter-processor communication required to compare the threads introduces
Transient-Fault Recovery Using Simultaneous Multithreading
- IN PROCEEDINGS OF THE 29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2002
"... We propose a scheme for transient-fault recovery called Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient-fault detection, called Simultaneously and Redundantly Threaded (SRT) processors. SRT replicates an application int ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
We propose a scheme for transient-fault recovery called Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient-fault detection, called Simultaneously and Redundantly Threaded (SRT) processors. SRT replicates an application into two communicating threads, one executing ahead of the other. The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared. In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection. In contrast, SRTR must not occurs, since a faulty instruction cannot be undone once the instruction commits. To avoid
Techniques to reduce the soft error rate of a high-performance microprocessor
- In Proceedings of the 31st annual International Symposium on Computer Architecture
, 2004
"... Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more like ..."
Abstract
-
Cited by 67 (5 self)
- Add to MetaCart
Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure),
SWIFT: Software Implemented Fault Tolerance
- In Proceedings of the 3rd International Symposium on Code Generation and Optimization
, 2005
"... To improve performance and reduce power, processor designers employ advances that shrink feature sizes, lower voltage levels, reduce noise margins, and increase clock rates. However, these advances make processors more susceptible to transient faults that can affect correctness. While reliable syste ..."
Abstract
-
Cited by 43 (7 self)
- Add to MetaCart
To improve performance and reduce power, processor designers employ advances that shrink feature sizes, lower voltage levels, reduce noise margins, and increase clock rates. However, these advances make processors more susceptible to transient faults that can affect correctness. While reliable systems typically employ hardware techniques to address soft-errors, software techniques can provide a lower-cost and more flexible alternative. This paper presents a novel, software-only, transient-fault-detection technique, called SWIFT. SWIFT efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs. SWIFT also provides a high level of protection and performance with an enhanced control-flow checking mechanism. We evaluate an implementation of SWIFT on an Itanium 2 which demonstrates exceptional fault coverage with a reasonable performance cost. Compared to the best known single-threaded approach utilizing an ECC memory system, SWIFT demonstrates a 51% average speedup.
Opportunistic Transient-Fault Detection
- In Proceedings of the 32nd Annual International Symposium on Computer Architecture
, 2005
"... CMOS scaling increases susceptibility of microprocessors to transient faults. Most current proposals for transient-fault detection use full redundancy to achieve perfect coverage while incurring significant performance degradation. However, most commodity systems do not need or provide perfect cover ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
CMOS scaling increases susceptibility of microprocessors to transient faults. Most current proposals for transient-fault detection use full redundancy to achieve perfect coverage while incurring significant performance degradation. However, most commodity systems do not need or provide perfect coverage. A recent paper explores this leniency to reduce the soft-error rate of the issue queue during L2 misses while incurring minimal performance degradation. Whereas the previous paper reduces soft-error rate without using any redundancy, we target better coverage while incurring similarly-minimal performance degradation by opportunistically using redundancy. We propose two semi-complementary techniques, called partial explicit redundancy (PER) and implicit redundancy through reuse (IRTR), to explore the trade-off between soft-error rate and performance. PER opportunistically exploits low-ILP phases and L2 misses to introduce explicit redundancy with minimal performance degradation. Because PER covers the entire pipeline and exploits not only L2 misses but all low-ILP phases, PER achieves better coverage than the previous work. To achieve coverage in high-ILP phases as well, we propose implicit redundancy through reuse (IRTR). Previous work exploits the phenomenon of instruction reuse to avoid redundant execution while falling back on redundant execution when there is no reuse. IRTR takes reuse to the extreme of performance-coverage trade-off and completely avoids explicit redundancy by exploiting reuse’s implicit redundancy within the main thread for fault detection with virtually no performance degradation. Using simulations with SPEC2000, we show that PER and IRTR achieve better tradeoff between soft-error rate and performance degradation than the previous schemes. 1
Design and evaluation of hybrid fault-detection systems
- In Proceedings of the 32th Annual International Symposium on Computer Architecture
, 2005
"... As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and softwareonly fault-detection mechanisms to identify and mitigate the deleterious effec ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and softwareonly fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two faultdetection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the faultdetection design space. 1
A complexity-effective approach to alu bandwidth enhancement for instruction-level temporal redundancy
- In Proceedings of the International Symposium on Computer Architecture (ISCA
, 2004
"... Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insuffici ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23 % of the overall IPC loss. Keywords: Complexity-effective design, Instruction Reuse, Temporal Redundancy.
Configurable isolation: building high availability systems with commodity multi-core processors
- In Proceedings of the 34th Annual International Symposium on Computer Architecture
, 2007
"... High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, costeffective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high.
Software-controlled fault tolerance
- ACM Transactions on Architecture and Code Optimization
, 2005
"... Traditional fault tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Traditional fault tolerance techniques typically utilize resources ineffectively because they cannot adapt to the changing reliability and performance demands of a system. This paper proposes software-controlled fault tolerance, a concept allowing designers and users to tailor their performance and reliability for each situation. Several software-controllable fault detection techniques are then presented: SWIFT, a software-only technique, and CRAFT, a suite of hybrid hardware/ software techniques. Finally, the paper introduces PROFiT, a technique which adjusts the level of protection and performance at fine granularities through software control. When coupled with software-controllable techniques like SWIFT and CRAFT, PROFiT offers attractive and novel reliability options.
Engineering over-clocking: Reliability-performance trade-offs for high-performance register files
- In International Conference on Dependable Systems and Networks
, 2005
"... Register files are in the critical path of most high-performance processors and their latency is one of the most important factors that limit their size. Our goal is to develop error correction mechanisms at the architecture level. Utilizing this increased robustness, the clock frequencies of the ci ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Register files are in the critical path of most high-performance processors and their latency is one of the most important factors that limit their size. Our goal is to develop error correction mechanisms at the architecture level. Utilizing this increased robustness, the clock frequencies of the circuits are pushed beyond the point of allowing full voltage swing. This increases the errors observed due to noise and other external factors. The resulting errors are then corrected through the error correction mechanisms. We first develop a realistic model for error probability in register files for a given clock frequency. Then, we present the overall architecture, which allows the error detection computation to be overlapped with other computation in the pipeline. We develop novel techniques that utilize the fact that at a given instance many physical registers are not used in superscalar processors. These underutilized registers are used to store the values of active registers. Our simulation results show that for a fixed architecture the access times to the registers can be reduced by as much as 80 % while increasing the number of execution cycles by 0.12%. On the other hand, by reducing the register file access pipeline stages by 75%, the average number of execution cycles of SPEC applications can be reduced by

