Results 1 - 10
of
24
Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware
"... Emerging high-performance architectures are anticipated to contain unreliable components that may exhibit soft errors, which silently corrupt the results of computations. Full detection and masking of soft errors is challenging, expensive, and, for some applications, unnecessary. For example, approx ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
(Show Context)
Emerging high-performance architectures are anticipated to contain unreliable components that may exhibit soft errors, which silently corrupt the results of computations. Full detection and masking of soft errors is challenging, expensive, and, for some applications, unnecessary. For example, approximate computing applications (such as multimedia processing, machine learning, and big data analytics) can often naturally tolerate soft errors. We present Rely, a programming language that enables developers to reason about the quantitative reliability of an application – namely, the probability that it produces the correct result when executed on unreliable hardware. Rely allows developers to specify the reliability requirements for each value that a function produces. We present a static quantitative reliability analysis that verifies quantitative requirements on the reliability of an application, enabling a developer to perform sound and verified reliability engineering. The analysis takes a Rely program with a reliability specification and a hardware specification that characterizes the reliability of the underlying hardware components and verifies that the program satisfies its reliability specification when executed on the underlying unreliable hardware platform. We demonstrate the application of quantitative reliability analysis on six computations implemented
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection
"... As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handl ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
Runtime Asynchronous Fault Tolerance via Speculation
"... Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical becaus ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83 % on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications. 1.
Addressing Failures in Exascale Computing
"... The current approach to resilience for large high-performance computing (HPC) machines is based on global application checkpoint/restart. The state of each application is checkpointed periodically; if the application fails, then it is restarted from the last checkpoint. Preserving this approach is h ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
The current approach to resilience for large high-performance computing (HPC) machines is based on global application checkpoint/restart. The state of each application is checkpointed periodically; if the application fails, then it is restarted from the last checkpoint. Preserving this approach is highly desirable because it requires no change in application software. The success of this method depends crucially on the following assumptions: 1. The time to checkpoint is mean time before failure (MTBF). 2. The time to restart (which includes the time to restore the system to a consistent state) isMTBF; 3. The checkpoint is correct—errors that could corrupt the checkpointed state are detected before the checkpoint is committed. 4. Committed output data is correct (output is committed when it is read). It was not clear that these assumptions are currently satisfied. In particular, can one ignore silent data corruptions (SDCs)? It is clear that satisfying these assumptions will be harder in the future for the following reasons: • MTBF is decreasing faster than disk checkpoint time. • MTBF is decreasing faster than recovery time—especially recovery from global system failures.
Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults ∗
"... Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a nonnegligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application’s execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly
Efficient soft error protection for commodity embedded microprocessors using profile information
- In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES ’12
, 2012
"... Successive generations of processors use smaller transistors in the quest to make more powerful computing systems. It has been previ-ously studied that smaller transistors make processors more suscep-tible to soft errors (transient faults caused by high energy particle strikes). Such errors can resu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Successive generations of processors use smaller transistors in the quest to make more powerful computing systems. It has been previ-ously studied that smaller transistors make processors more suscep-tible to soft errors (transient faults caused by high energy particle strikes). Such errors can result in unexpected behavior and incorrect results. With smaller and cheaper transistors becoming pervasive in mainstream computing, it is necessary to protect these devices against soft errors; an increasing rate of faults necessitates the protection of applications running on commodity processors against soft errors. The existing methods of protecting against such faults generally have high area or performance overheads and thus are not directly applicable in the embedded design space. In order to protect against soft errors, the detection of these errors is a necessary first step so that a recovery can be triggered. To solve the problem of detecting soft errors cheaply, we propose a profiling-based software-only application analysis and transformation solution. The goal is to develop a low cost solution which can be de-ployed for off-the-shelf embedded processors. The solution works by intelligently duplicating instructions that are likely to affect the pro-gram output, and comparing results between original and duplicated instructions. The intelligence of our solution is garnered through the use of control flow, memory dependence, and value profiling to un-derstand and exploit the common-case behavior of applications. Our solution is able to achieve 92 % fault coverage with a 20 % instruction overhead. This represents a 41 % lower performance overhead than the best prior approaches with approximately the same fault coverage.
Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs
"... The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS proce ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The general-purpose computing on graphics processing units (GPGPUs) are increasingly used to accelerate parallel applications. This makes reliability a growing concern in GPUs as they are originally designed for graphics processing with relaxed requirements for execution correctness. With CMOS processing technologies continuously scaling down to the nano-scale, on-chip soft error rate (SER) has been predicted to increase exponentially. GPGPUs with hundreds of cores integrated into a single chip are prone to manifest high SER. This paper aims to enhance the GPGPU reliability in light of soft errors. We leverage the GPGPU microarchitecture characteristics, and propose energy-efficient protection mechanisms for two typical SRAM-based structures (i.e. instruction buffer and registers) which suffer high susceptibility. We develop Similarity-AWare Protection (SAWP) scheme that leverages the instruction similarity to provide the nearfull ECC protection to the instruction buffer with quite little area and power overhead. Based on the observation that shared memory usually exhibits low utilization, we propose SHAred memory to Register Protection (SHARP) scheme, it intelligently leverages shared memory to hold the ECCs of registers. Experimental results show that our techniques have the strong capability of substantially improving the structure vulnerability, and significantly reducing the power consumption compared to the full ECC protection mechanism.
4. TITLE AND SUBTITLE Runtime Speculative Software-Only Fault Tolerance
, 2012
"... Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing an ..."
Abstract
- Add to MetaCart
(Show Context)
Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
Survey of Error and Fault Detection Mechanisms Locality, Parallelism and Hierarchy Group Survey of Error and Fault Detection Mechanisms
"... Abstract This report describes diverse error detection mechanisms that can be utilized within a resilient system to protect applications against various types of errors and faults, both hard and soft. These detection mechanisms have different overhead costs in terms of energy, performance, and area ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract This report describes diverse error detection mechanisms that can be utilized within a resilient system to protect applications against various types of errors and faults, both hard and soft. These detection mechanisms have different overhead costs in terms of energy, performance, and area, and also differ in their error coverage, complexity, and programmer effort. In order to achieve the highest efficiency in designing and running a resilient computer system, one must understand the trade-offs among the aforementioned metrics for each detection mechanism and choose the most efficient option for a given running environment. To accomplish such a goal, we first enumerate many error detection techniques previously suggested in the literature.
Understanding Reliability Implication of Hardware Error in Virtualization Infrastructure
"... Abstract Hardware errors are no longer the exceptions in modern cloud data centers. Although virtualization provides software failure isolation across different virtual machines (VM), the virtualization infrastructure including the hypervisor and privileged VMs remains vulnerable to hardware errors ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Hardware errors are no longer the exceptions in modern cloud data centers. Although virtualization provides software failure isolation across different virtual machines (VM), the virtualization infrastructure including the hypervisor and privileged VMs remains vulnerable to hardware errors. Making matters worse is that such errors are unlikely bounded by virtualization boundary and may lead to loss of work in multiple guest VMs due to unexpected and/or mishandled failures. To understand reliability implication of hardware errors in virtualized systems, in this paper we develop a simulation-based framework that enables a comprehensive fault injection study on the hypervisor with a wide range of configurations. Our analysis shows that, in current systems, many hardware errors can propagate through various paths for an extended time before an observed failure (e.g., whole system crash). We further discuss the challenges of designing error tolerance techniques for the hypervisor.