Results 1 
6 of
6
Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
, 1997
"... Networks of workstations (NOWs) offer a cost effective platform for highperformance, longrunning parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present highperformance implementations of several faulttolera ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
Networks of workstations (NOWs) offer a cost effective platform for highperformance, longrunning parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present highperformance implementations of several faulttolerant algorithms for distributed scientific computing. The faulttolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the faulttolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load or availability. As long as there are at least n processors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for faulttolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.
AlgorithmBased Fault Tolerance for FailStop Failures
, 2008
"... Failstop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that failstop process failures in ScaLAPACK matrix matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previou ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
Failstop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that failstop process failures in ScaLAPACK matrix matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithmbased fault tolerance research that, for matrixmatrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplcation algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix matrix multiplcation algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that failstop process failures in ScaLAPACK matrixmatrix multiplcation can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.
Self Adaptive Application Level Fault Tolerance for Parallel and
 Distributed Computing,, IEEE International Parallel and Distributed Processing Symposium
, 2007
"... Most application level fault tolerance schemes in literature are nonadaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Most application level fault tolerance schemes in literature are nonadaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O bandwidth. However, from an application point of view, it is often desirable for fault tolerant high performance applications to be able to achieve high performance under whatever system environment it executes with as low fault tolerance overhead as possibile. In this paper, we demonstrate that, in order to achieve high reliability with as low performance penalty as possible, fault tolerant schemes in applications need to be able to adapt themselves to different system environments. We propose a framework under which different fault tolerant schemes can be incorporated in applications using an adaptive method. Under this framework, applications are able to choose near optimal fault tolerance schemes at run time according to the specific characteristics of the platform on which the application is executing. 1.
Highly Scalable SelfHealing Algorithms for High Performance Scientific Computing
"... Abstract—As the number of processors in today’s highperformance computers continues to grow, the meantimetofailure of these computers is becoming significantly shorter than the execution time of many current highperformance computing applications. Although today’s architectures are usually robu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract—As the number of processors in today’s highperformance computers continues to grow, the meantimetofailure of these computers is becoming significantly shorter than the execution time of many current highperformance computing applications. Although today’s architectures are usually robust enough to survive node failures without suffering complete system failure, most of today’s highperformance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building selfhealing highperformance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FTMPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of ReedSolomon erasure codes over floatingpoint numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2dlog pe:kðð þ 2 Þm þ Þ to ð1 þ Oð ffiffi p p pffiffiffiÞÞ m 2:kð þ 2 Þm, where is the communication latency, 1 is the network bandwidth between processes, 1 is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to ð1 þ Oð 1ffiffiffi p ÞÞ:kð þ 2 Þm, which is independent of the total number of
Optimal Real Number Codes for Fault Tolerant Matrix Operations
"... Today s long running high performance computing applications typically tolerate failstop failures by checkpointing. However, applications such as dense linear algebra computations often modify a large mount of memory between checkpoints and checkpointing usually introduces considerable overhead whe ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Today s long running high performance computing applications typically tolerate failstop failures by checkpointing. However, applications such as dense linear algebra computations often modify a large mount of memory between checkpoints and checkpointing usually introduces considerable overhead when the number of processors used for computation is large. It has been demonstrated in [13] that single failstop failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing at a decreasing overhead rate of 1 / √ p, where p is the number of processors used for computation. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a realnumber erasure correction code. However, the floatingpoint representation of a real number in today’s high performance computers introduces round off errors which can be enlarged and cause the loss of precision of possibly all digits during recovery when the number of processors in the system is large. In this paper, we present a class of ReedSolomon style realnumber erasure correcting codes which is numerically optimal during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. We prove that it is impossible even for the numerically best minimum redundancy erasure correcting codes to correct all erasure patterns when the total number of processors is large. We give the conditions that guarantee to correct all two erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes. 1.