Results 1 -
6 of
6
Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic
"... Abstract—In high-end computing, the collective surface area, smaller fabrication sizes, and increasing density of components have led to an increase in the number of observed bit flips. Such flips result in silent errors, i.e., a potentially incorrect result, if mechanisms are not in place to detect ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—In high-end computing, the collective surface area, smaller fabrication sizes, and increasing density of components have led to an increase in the number of observed bit flips. Such flips result in silent errors, i.e., a potentially incorrect result, if mechanisms are not in place to detect them. These phenomena are believed to occur more frequently in DRAM, but logic gates, arithmetic units, and other circuits are candidates for bit flips as well. Previous work has focused on algorithmic techniques for detecting and correcting bit flips in specific data structures. This work takes a novel approach to this problem. We focus on quantifying the impact of a single bit flip on specific floating-point operations. We analyze the error induced by flipping specific bits in the IEEE floating-point representation in an architecture-agnostic manner, i.e., without requiring proprietary information such as bit flip rates and the vendor-specific circuit designs. We initially study dot products of vectors and demonstrate that not all bit flips create a large error and, more importantly, the relative magnitude of the vectors and vector length can be exploited to minimize the error caused by a bit flip. We also construct an analytic model for the expected relative error caused by a bit flip in the dot product and compare this model against empirical data generated by Monte Carlo sampling. We then extend our analysis to stationary iterative methods and prove that these methods converge to the correct solution in the presence of faulty arithmetic. In general, this effort presents the first step towards rigorously quantifying the impact of bit flips on numerical methods. Our eventual goal is to utilize these results to provide insight into the vulnerability of leadership-class computing systems to silent faults and, ultimately, to provide a theoretical basis for future silent data corruption research. I.
Toward Exascale Resilience: 2014 Update
, 2014
"... Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. ..."
Abstract
- Add to MetaCart
Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions. The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that ex-ascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
Exploring Void Search for Fault Detection on Extreme Scale Systems
"... culated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance te ..."
Abstract
- Add to MetaCart
(Show Context)
culated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total checkpoints. In this paper we present a new approach for fault detection based on the Void Search (VS) algorithm. VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We evaluate our algorithm using real environmental logs from Mira Blue Gene/Q supercomputer at Argonne National Laboratory. Our experiments show that our approach can detect almost all faults (i.e., sensitivity close to 1) with a low false positive rate (i.e., specificity values above 0.7). We also compare our algorithm with a number of existing detection algorithms, and find that ours outperforms all of them.
A System Software Approach to Proactive Memory-Error Avoidance
"... Abstract—Today’s HPC systems use two mechanisms to ad-dress main-memory errors. Error-correcting codes make cor-rectable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high f ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Today’s HPC systems use two mechanisms to ad-dress main-memory errors. Error-correcting codes make cor-rectable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and offlines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.
Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression
"... Abstract—Increasing number of cores in parallel computer systems are allowing scientific simulations to be executed with increasing spatial and temporal granularity. However, this also implies that increasing larger-sized datasets need to be output, stored, managed, and then visualized and/or analyz ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Increasing number of cores in parallel computer systems are allowing scientific simulations to be executed with increasing spatial and temporal granularity. However, this also implies that increasing larger-sized datasets need to be output, stored, managed, and then visualized and/or analyzed using a variety of methods. In examining the possibility of using compression to accelerate all of these steps, we focus on two important questions: “Can compression help save time when data is output from, or input into, a parallel program?”, and “How can a scientist’s effort in using compression with a parallel program be minimized?”. We focus on PnetCDF, and show how transparent compression can be supported, thus allowing an existing simulation program to start outputting and storing data in a compressed fashion, and similarly, allow a data analysis application to read compressed data. We address challenges in supporting compression when parallel writes are being per-formed. In our experiments, we first analyze the effects of using compression with microbenchmarks, and then, continue our evaluation using a scientific simulation application, and two data analysis applications. While we obtain up to a factor of 2 improvement in performance for microbenchmarks, the execution time of simulation application is improved up to 22%, and the maximum speedup of data analysis applications is 1.83 (with an average speedup of 1.36). I.
Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems
"... Abstract—Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression rat ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression ratios. File-level compression can barely support efficient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed file. Block-level compression provides flexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efficient random accesses to the compressed scientific data without sacrificing its compression ratio. In essence, virtual chunks are logical blocks identified by appended references without breaking the physical continuity of the file content. These additional references allow the decompres-sion to start from an arbitrary position (efficient random access), and retain the file’s physical entirety to achieve high compression ratio on par with file-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel file system, and a module in the FusionFS distributed file system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup. I.