Results 1 - 10
of
17
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
, 1996
"... this paper, we use the terms event logging and message logging interchangeably ..."
Abstract
-
Cited by 474 (24 self)
- Add to MetaCart
this paper, we use the terms event logging and message logging interchangeably
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
- Software – Practice & Experience
, 1997
"... It is well-known that Reed-Solomon codes may be used to provide error correction for multiple failures in RAID-like systems. The coding technique itself, however, is not as well-known. To the coding theorist, this technique is a straightforward extension to a basic coding paradigm and needs no speci ..."
Abstract
-
Cited by 148 (26 self)
- Add to MetaCart
It is well-known that Reed-Solomon codes may be used to provide error correction for multiple failures in RAID-like systems. The coding technique itself, however, is not as well-known. To the coding theorist, this technique is a straightforward extension to a basic coding paradigm and needs no special mention. However, to the systems programmer with no training in coding theory, the technique may be a mystery. Currently, there are no references that describe how to perform this coding that do not assume that the reader is already well-versed in algebra and coding theory. This paper is intended for the systems programmer. It presents a complete specification of the coding algorithm plus details on how it may be implemented. This specification assumes no prior knowledge of algebra or coding theory. The goal of this paper is for a systems programmer to be able to implement Reed-Solomon coding for reliability in RAID-like systems without needing to consult any external references. Problem Specification Let there be storage devices, ¡£¢¥¤¦¡¨§©¤�������¤¦¡¨�, each of which holds � bytes. These are called the “Data De-vices. ” � Let there be � � more storage devices
Diskless Checkpointing
, 1997
"... Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkp ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.
A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide-Area Storage Applications
- In DSN-2004: The International Conference on Dependable Systems and Networks
, 2004
"... As peer-to-peer and widely distributed storage systems proliferate, the need to perform efficient erasure coding, instead of replication, is crucial to performance and efficiency. Low-Density Parity-Check (LDPC) codes have arisen as alternatives to standard erasure codes, such as Reed-Solomon codes, ..."
Abstract
-
Cited by 37 (6 self)
- Add to MetaCart
As peer-to-peer and widely distributed storage systems proliferate, the need to perform efficient erasure coding, instead of replication, is crucial to performance and efficiency. Low-Density Parity-Check (LDPC) codes have arisen as alternatives to standard erasure codes, such as Reed-Solomon codes, trading off vastly improved decoding performance for inefficiencies in the amount of data that must be acquired to perform decoding. The scores of papers written on LDPC codes typically analyze their collective and asymptotic behavior. Unfortunately, their practical application requires the generation and analysis of individual codes for finite systems. This paper attempts to illuminate the practical considerations of LDPC codes for peer-to-peer and distributed storage systems. The three main types of LDPC codes are detailed, and a huge variety of codes are generated, then analyzed using simulation. This analysis focuses on the performance of individual codes for finite systems, and addresses several important heretofore unanswered questions about employing LDPC codes in real-world systems. 1
Improving goodput by co-scheduling CPU and network capacity
- International Journal of High Performance Computing Applications
, 1999
"... In a cluster computing environment, executable, checkpoint, and data files must be transferred between application submission and execution sites. As the memory footprint of cluster applications increases, saving and restoring the state of a computation in such an environment may require substantial ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
In a cluster computing environment, executable, checkpoint, and data files must be transferred between application submission and execution sites. As the memory footprint of cluster applications increases, saving and restoring the state of a computation in such an environment may require substantial network resources at both the start and the end of a CPU allocation. During the allocation, the application may also consume network bandwidth to periodically transfer a checkpoint back to the submission site or checkpoint server and to access remote data files. Under most circumstances, the application cannot use the allocated CPU while these transfers are in progress. Furthermore, if the application is unable to transfer a checkpoint or successfully migrate at preemption time, work already accomplished by the application is lost. The authors define
Note: Correction to the 1997 tutorial on reed-solomon coding
- Software – Practice & Experience
, 2005
"... ..."
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
- Journal of Parallel and Distributed Computing
, 2001
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In parti ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. Keywords: Checkpointing, performance prediction, parameter selection, parallel computation, Markov chain, exponential failure and repair distributions. 1
RENEW: A tool for fast and efficient implementation of checkpoint protocols
- In Proceedings of the 28th IEEE Fault-Tolerant Computing Symposium (FTCS
, 1998
"... This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a pr ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
This paper describes the design, implementation, and evaluation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. To achieve this goal, RENEW provides a flexible set of operations that facilitates the integration of a protocol in the system with reduced programming effort. To support a broad range of applications, RE-NEW exports, as its external interface, the industry endorsed Message Passing Interface (MPI). Three distinct classes of protocols were evaluated using the RENEW environment with SPEC and NAS benchmarks on a network of workstations connected by ATM. It was observed that the communication-induced protocol emulated the behavior of the coordinated protocol, with comparable performance. The message logging protocol degraded the performance. Even though the message logging protocol was slower due to log replay, all three protocols required a similar amount of time to restore the applicationto the same state as before failure occurred and recovery was initiated. 1
The average availability of parallel checkpointing systems and its importance in selecting runtime parameters
- IN 29TH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING
, 1999
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particu ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we briefly present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today’s parallel computing environments and software, and present case studies of using the model to select runtime parameters.
Assessing the performance of erasure codes in the wide-area
- In DSN-05: International Conference on Dependable Systems and Networks
, 2005
"... The problem of efficiently retrieving a file that has been broken into blocks and distributed across the wide-area pervades applications that utilize Grid, peer-to-peer, and distributed file systems. While the use of erasure codes to improve the fault-tolerance and performance of wide-area file syst ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The problem of efficiently retrieving a file that has been broken into blocks and distributed across the wide-area pervades applications that utilize Grid, peer-to-peer, and distributed file systems. While the use of erasure codes to improve the fault-tolerance and performance of wide-area file systems has been explored, there has been little work that assesses the performance and quantifies the impact of modifying various parameters. This paper performs such an assessment. We modify our previously defined framework for studying replication in the wide-area [6] to include both Reed-Solomon and Low-Density Parity-Check (LDPC) erasure codes. We then use this framework to compare Reed-Solomon and LDPC erasure codes in three wide-area, distributed settings. We conclude that although LDPC codes have an advantage over Reed-Solomon codes in terms of decoding cost, this advantage does not always translate to the best overall performance in wide-area storage situations. 1.

