## A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems (1997)

### Cached

### Download Links

- [www.cs.utk.edu]
- [web.eecs.utk.edu]
- [www.ssrc.ucsc.edu]
- [www.cs.utk.edu]
- [www.ie.u-ryukyu.ac.jp]
- [www.cs.uml.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Software – Practice & Experience |

Citations: | 175 - 33 self |

### BibTeX

@ARTICLE{Plank97atutorial,

author = {James S. Plank},

title = {A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems},

journal = {Software – Practice & Experience},

year = {1997},

volume = {27},

pages = {995--1012}

}

### Years of Citing Articles

### OpenURL

### Abstract

It is well-known that Reed-Solomon codes may be used to provide error correction for multiple failures in RAID-like systems. The coding technique itself, however, is not as well-known. To the coding theorist, this technique is a straightforward extension to a basic coding paradigm and needs no special mention. However, to the systems programmer with no training in coding theory, the technique may be a mystery. Currently, there are no references that describe how to perform this coding that do not assume that the reader is already well-versed in algebra and coding theory. This paper is intended for the systems programmer. It presents a complete specification of the coding algorithm plus details on how it may be implemented. This specification assumes no prior knowledge of algebra or coding theory. The goal of this paper is for a systems programmer to be able to implement Reed-Solomon coding for reliability in RAID-like systems without needing to consult any external references. Problem Specification Let there be storage devices, ¡£¢¥¤¦¡¨§©¤�������¤¦¡¨�, each of which holds � bytes. These are called the “Data De-vices. ” � Let there be � � more storage devices

### Citations

725 |
A Case for Redundant Arrays of Inexpensive Disks (RAID
- Patterson, Gibson, et al.
- 1988
(Show Context)
Citation Context ...ly new. It came to the fore with “Redundant Arrays of Inexpensive Disks” (RAID) where batteries of small, inexpensive disks combine high storage capacity, bandwidth, and reliability all at a low cost =-=[4, 5, 6]-=-. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems [9, 10, 11, ... |

453 | Ecient dispersal of information for security, load balancing and fault tolerance
- Rabin
- 1989
(Show Context)
Citation Context ...on RAID-like systems. However, the technique itself is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin [17], and then by Rabin =-=[18]-=- in terms of information dispersal. Preparata [19] then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique... |

453 | n introduction to disk drive modeling
- Ruemmler, Wilkes
- 1994
(Show Context)
Citation Context ...: the Gaussian Elimination and the recalculation. 1 We do not include any equations for the time to perform disk reads/writes because the complexity of disk operation precludes a simple encapsulation =-=[25]-=-. 11 � is:sSince at leasts� ¦ rows ofs� are identity rows, the Gaussian Elimination takess§ ¦ § �ssteps. ¦ As is likely to be small this should be very fast (i.e. milliseconds). The subsequent recalcu... |

439 |
Algebraic Coding Theory
- Berlekamp
- 1968
(Show Context)
Citation Context ...h that if any of fail, then the contents of the failed devices can be reconstructed from the non-failed devices. Introduction ¤ ¢ � ¤�������¤ § � � Error-correcting codes have been around for decades =-=[1, 2, 3]-=-. However, the technique of distributing data among multiple storage devices to achieve high-bandwidth input and output, and using one or more error-correcting devices for failure recovery is relative... |

295 | RAID: Highperformance, reliable secondary storage - Chen, Lee, et al. - 1994 |

270 | The Zebra striped network file system
- Hartman, Ousterhout
- 1995
(Show Context)
Citation Context ... storage capacity, bandwidth, and reliability all at a low cost [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth =-=[7, 8]-=-, and to design fast distributed checkpointing systems [9, 10, 11, 12]. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed... |

266 |
A case for redundant arrays of inexpensive disks
- Patterson, Gibson, et al.
- 1988
(Show Context)
Citation Context ...F (24 ): 3 7 = gfilog[gflog[3]+gflog[7]] = gfilog[4+10] = gfilog[14] = 9 13 10 = gfilog[gflog[13]+gflog[10]] = gfilog[13+9] = gfilog[7] = 11 13 10 = gfilog[gflog[13]-gflog[10]] = gfilog[13-9] = gfilog=-=[4]-=- = 3 3 7 = gfilog[gflog[3]-gflog[7]] = gfilog[4-10] = gfilog[9] = 14 Therefore, a multiplication or division requires one conditional, three table lookups (twoA TUTORIAL ON REED–SOLOMON CODING 1001 #... |

256 |
Introduction to Coding Theory
- Lint
- 1998
(Show Context)
Citation Context ...as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz [20] and Burkhard [13], with citations of standard texts on error correcting codes =-=[1, 2, 3, 21, 22]-=- for completeness. There is one problem with all the above discussions of this technique — they require the reader to have a thorough knowledge of algebra and coding theory. Any programmer with a bach... |

139 |
Redundant Disk Arrays: Reliable, Parallel Secondary Storage
- Gibson
- 1992
(Show Context)
Citation Context ...ly new. It came to the fore with “Redundant Arrays of Inexpensive Disks” (RAID) where batteries of small, inexpensive disks combine high storage capacity, bandwidth, and reliability all at a low cost =-=[4, 5, 6]-=-. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems [9, 10, 11, ... |

111 | On secret sharing systems
- Karnin, Greene, et al.
- 1983
(Show Context)
Citation Context ...ed in almost all papers on RAID-like systems. However, the technique itself is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin =-=[17]-=-, and then by Rabin [18] in terms of information dispersal. Preparata [19] then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solom... |

90 | Some applications of Rabin’s fingerprinting method
- Broder
- 1993
(Show Context)
Citation Context ...ficient software solution that is easy to implement and does not consume much physical memory. For larger values ofs� � , other approaches (hardware or software) may be necessary. See References [2], =-=[27]-=- and [28] for examples of other approaches. 14sAcknowledgements The author thanks Joel Friedman, Kai Li, Michael Puening, Norman Ramsey, Brad Vander Zanden and Michael Vose for their valuable comments... |

86 |
Reed-Solomon codes and their applications
- Wicker, Bhargava
- 1999
(Show Context)
Citation Context ...as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz [20] and Burkhard [13], with citations of standard texts on error correcting codes =-=[1, 2, 3, 21, 22]-=- for completeness. There is one problem with all the above discussions of this technique — they require the reader to have a thorough knowledge of algebra and coding theory. Any programmer with a bach... |

85 | The tickertaip parallel raid architecture
- Cao, Lin, et al.
- 1994
(Show Context)
Citation Context ... storage capacity, bandwidth, and reliability all at a low cost [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth =-=[7, 8]-=-, and to design fast distributed checkpointing systems [9, 10, 11, 12]. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed... |

77 | Disk array storage system reliability
- Burkhard, Menon
- 1993
(Show Context)
Citation Context ...ration per write to any single device. Its main disadvantage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important =-=[13]-=-. Several techniques have been developed for this [13, 14, 15, 16], the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksu... |

60 |
EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures
- Blaum, Brady, et al.
- 1994
(Show Context)
Citation Context ... minimal device overhead. In other words, there are some combinations � ¥ � of device failures that the system cannot tolerate. An important coding technique for two device failures is EVENODD coding =-=[15]-=-. This technique tolerates all two device failures with just two checksum devices, and all coding operations are XOR’s. Thus, it too is faster than RS-Raid coding. To the author’s knowledge, there is ... |

38 |
ErrorCorrecting Codes, Second Edition
- Peterson, Weldon
- 1972
(Show Context)
Citation Context ...h that if any of fail, then the contents of the failed devices can be reconstructed from the non-failed devices. Introduction ¤ ¢ � ¤�������¤ § � � Error-correcting codes have been around for decades =-=[1, 2, 3]-=-. However, the technique of distributing data among multiple storage devices to achieve high-bandwidth input and output, and using one or more error-correcting devices for failure recovery is relative... |

29 |
Failure correction techniques for large disk arrays
- Gibson, Hellerstein, et al.
- 1989
(Show Context)
Citation Context ...antage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important [13]. Several techniques have been developed for this =-=[13, 14, 15, 16]-=-, the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksum devices is a technique based on Reed-Solomon coding. This fact i... |

25 | Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques
- Plank
- 1996
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

22 | Algorithm-based diskless checkpointing for fault-tolerant matrix computations
- Plank, Kim, et al.
- 1995
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

22 | RAID Organization and Performance
- Schwarz, Burkhard
- 1992
(Show Context)
Citation Context ...between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique has recently been discussed in varying levels of detail by Gibson [5], Schwarz =-=[20]-=- and Burkhard [13], with citations of standard texts on error correcting codes [1, 2, 3, 21, 22] for completeness. There is one problem with all the above discussions of this technique — they require ... |

20 |
Evaluation of checkpoint mechanisms for massively parallel machines
- Chiueh, Deng
- 1996
(Show Context)
Citation Context ...st [4, 5, 6]. Since then, the technique has been used to design multicomputer and network file systems with high reliability and bandwidth [7, 8], and to design fast distributed checkpointing systems =-=[9, 10, 11, 12]-=-. We call all such systems “RAID-like” systems. The above problem is central to all RAID-like systems. When storage is distributed amongsdevices, the chances of one of these devices failing becomes si... |

19 |
The Theory of ErrorCorrecting Codes, Part I
- MacWilliams, Sloane
- 1977
(Show Context)
Citation Context ... device using FD = C. For example, suppose the first word of D1 is 3, the first word of D2 is 13, and the first word of D3 is 9. Then we use F to calculate the first words of C1;C2,andC3: 3 5 C1 = (1)=-=(3)-=- (1)(13) (1)(9) = 3 13 9 = 0011 1101 1001 = 0111 = 7 C2 = (1)(3) (2)(13) (3)(9) = 3 9 8 = 0011 1001 1000 = 0010 = 2 C3 = (1)(3) (4)(13) (5)(9) = 3 1 11 = 0011 0001 1011 = 1001 = 9 Suppose we change D2... |

18 |
Efficient placement of parity and data to tolerate two disk failures in disk array systems
- Park
- 1995
(Show Context)
Citation Context ...antage is that it cannot recover from more than one simultaneous failure. 2sAssgrows, the ability to tolerate multiple failures becomes important [13]. Several techniques have been developed for this =-=[13, 14, 15, 16]-=-, the concentration being small values of � . The most general technique for tolerating � simultaneous failures with exactly � checksum devices is a technique based on Reed-Solomon coding. This fact i... |

18 | Faster Checkpointing with N + 1 Parity - Plank, Li - 1994 |

14 |
on-line failure recovery in redundant disk arrays
- Fast
- 1993
(Show Context)
Citation Context ...isks, which is the minimum value for tolerating � failures. As in all RAID systems, the encoding information may be distributed among thes� � disks to avoid having the checksum disks become hot spots =-=[5, 26]-=-. The final operation of concern is recovery. Here, we assume ¦ ¥ � that failures have occurred and the system must recover the contents of ¦ the disks. In the RS-Raid algorithm, recovery consists of ... |

13 |
Holographic Dispersal and Recovery of Information
- Preparata
- 1989
(Show Context)
Citation Context ...f is harder to come by. The technique has an interesting history. It was first presented in terms of secret sharing by Karnin [17], and then by Rabin [18] in terms of information dispersal. Preparata =-=[19]-=- then showed the relationship between Rabin’s method and Reed-Solomon codes, hence the labeling of the technique as Reed-Solomon coding. The technique has recently been discussed in varying levels of ... |

8 |
Faster Checkpointing with
- Plank, Li
- 1994
(Show Context)
Citation Context ...g[14] = 9 13 10 = gfilog[gflog[13]+gflog[10]] = gfilog[13+9] = gfilog[7] = 11 13 10 = gfilog[gflog[13]-gflog[10]] = gfilog[13-9] = gfilog[4] = 3 3 7 = gfilog[gflog[3]-gflog[7]] = gfilog[4-10] = gfilog=-=[9]-=- = 14 Therefore, a multiplication or division requires one conditional, three table lookups (twoA TUTORIAL ON REED–SOLOMON CODING 1001 #define NW (1 << w) /* In other words, NW equals 2 to the w-th p... |

8 | Fast, on-line failure recovery in redundant disk arrays - HOLLAND, GIBSON, et al. - 1993 |

6 | Maximal and near-maximal shift register sequences: efficient event counters and easy discrete logarithms
- Clark, Weng
- 1994
(Show Context)
Citation Context ...oftware solution that is easy to implement and does not consume much physical memory. For larger values ofs� � , other approaches (hardware or software) may be necessary. See References [2], [27] and =-=[28]-=- for examples of other approaches. 14sAcknowledgements The author thanks Joel Friedman, Kai Li, Michael Puening, Norman Ramsey, Brad Vander Zanden and Michael Vose for their valuable comments and disc... |

3 |
Codes for Error Control and Synchronization
- Wiggert
- 1988
(Show Context)
Citation Context ...ecognizes this shutting down. This is as opposed to an error, in which a device failure is manifested by storing and retrieving incorrect values that can only be recognized by sort of embedded coding =-=[2, 23]-=-. The calculation of the contents of each checksum device � � requires a function � � applied to all the data devices. Figure 1 shows an example configuration using this technique (which we henceforth... |

2 |
Applied Parallel Research
- Plank, Li
- 1994
(Show Context)
Citation Context |