Results 1  10
of
59
Algorithms for Parallel Memory I: TwoLevel Memories
, 1992
"... We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is n ..."
Abstract

Cited by 236 (31 self)
 Add to MetaCart
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a largescale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performance can be significantly better than those obtained by the wellknown but nonopti...
ExternalMemory Computational Geometry
, 1993
"... In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important largescale problems. We discuss our algor ..."
Abstract

Cited by 121 (20 self)
 Add to MetaCart
In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important largescale problems. We discuss our algorithms primarily in the contex't of single processor/single disk machines, a domain in which they are not only the first known optimal results but also of tremendous practical value. Our methods also produce the first known optimal algorithms for a wide range of twolevel and hierarchical muir{level memory models, including parallel models. The algorithms are optimal both in terms of I/0 cost and internal computation.
LH*RS  a highavailability scalable distributed data structure
"... (SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes p ..."
Abstract

Cited by 56 (11 self)
 Add to MetaCart
(SDDS). An LH*RS file is hash partitioned over the distributed RAM of a multicomputer, e.g., a network of PCs, and supports the unavailability of any of its k ≥ 1 server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The highavailability management uses a novel parity calculus that we have developed, based on the ReedSalomon erasure correcting coding. The resulting parity storage overhead is about the minimal ever possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LH*RS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.
Disk scrubbing in large archival storage systems
 In Proceedings of the 12th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS ’04
, 2004
"... Large archival storage systems experience long periods of idleness broken up by rare data accesses. In such systems, disks may remain powered off for long periods of time. These systems can lose data for a variety of reasons, including failures at both the device level and the block level. To deal w ..."
Abstract

Cited by 54 (15 self)
 Add to MetaCart
Large archival storage systems experience long periods of idleness broken up by rare data accesses. In such systems, disks may remain powered off for long periods of time. These systems can lose data for a variety of reasons, including failures at both the device level and the block level. To deal with these failures, we must detect them early enough to be able to use the redundancy built into the storage system. We propose a process called “disk scrubbing” in a system in which drives are periodically accessed to detect drive failure. By scrubbing all of the data stored on all of the disks, we can detect block failures and compensate for them by rebuilding the affected blocks. Our research shows how the scheduling of disk scrubbing affects overall system reliability, and that “opportunistic ” scrubbing, in which the system scrubs disks only when they are powered on for other reasons, performs very well without the need to power on disks solely to check them. 1.
MDS Array Codes with Independent Parity Symbols
 IEEE TRANS. ON INFORMATION THEORY
, 1996
"... A new family of MDS array codes is presented. The code arrays contain p information columns and T independent parity columns, each column consisting of p 1 bits, where p is a prime. We extend a previously known construction for 1 he case T = 2 to three and more parity columns. It is shown that whe ..."
Abstract

Cited by 51 (14 self)
 Add to MetaCart
A new family of MDS array codes is presented. The code arrays contain p information columns and T independent parity columns, each column consisting of p 1 bits, where p is a prime. We extend a previously known construction for 1 he case T = 2 to three and more parity columns. It is shown that when r = 3 such extension is possible for any prime p. For larger values of T, we give necessary and sufficient conditions for our codes to be MDS, and then prove that if p belongs to a certain class of primes these conditions are satisfied up to T 5 8. One of the advantages of the new codes is that encoding and decoding may be accomplished using simple cyclic shifts and XOR operations on the columns of the code array. We develop efficient decoding procedures for the case of two and threecolumn errors. This again extends the previously known results for the case of ii singlecolumn error. Another primary advantage of our codes is related to the problem of efficient information updates. We present upper and lower bounds on the average number of parity bits which have to be updated in an MDS code over GF (2^m), following an update in a single information bit. This average number is of importance in many storage applications which require frequent updates of information. We show that the upper bound obtained from our codes is close to the lower bound and, most importantly, does not depend on the size of the code symbols.
Fast Concurrent Access to Parallel Disks
"... High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is ..."
Abstract

Cited by 50 (11 self)
 Add to MetaCart
High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for the efficient adaptation of singledisk external memory algorithms to multiple disks. We solve this problem for arbitrary access patterns by randomly mapping blocks of a logical address space to the disks. We show that a shared buffer of O(D) blocks suffices to support efficient writing. The analysis uses the properties of negative association to handle dependencies between the random variables involved. This approach might be of independent interest for probabilistic analysis in general. If two randomly allocated copies of each block exist, N arbitrary blocks can be read within dN=De + 1 I/O steps with high probability. The redundancy can be further reduced from 2 to 1 + 1=r for any integer r without a big impact on reading efficiency. From the point of view of external memory models, these results rehabilitate Aggarwal and Vitter's "singledisk multihead" model [1] that allows access to D arbitrary blocks in each I/O step. This powerful model can be emulated on the physically more realistic independent disk model [2] with small constant overhead factors. Parallel disk external memory algorithms can therefore be developed in the multihead model first. The emulation result can then be applied directly or further refinements can be added.
Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering
 In Proceedings of the 24th International Symposium on Computer Architecture
, 1996
"... We present Datum, a novel method for tolerating multiple disk failures in disk arrays. Datum is the first known method that can mask any given number of failures, requires an optimal amount of redundant storage space, and spreads reconstruction accesses uniformly over disks in the presence of failur ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
We present Datum, a novel method for tolerating multiple disk failures in disk arrays. Datum is the first known method that can mask any given number of failures, requires an optimal amount of redundant storage space, and spreads reconstruction accesses uniformly over disks in the presence of failures without needing large layout tables in controller memory. Our approach is based on information dispersal, a coding technique that admits an efficient hardware implementation. As the method does not restrict the configuration parameters of the disk array, many existing RAID organizations are particular cases of Datum. A detailed performance comparison with two other approaches shows that Datum's response times are similar to those of the best competitor when two or less disks fail, and that the performance degrades gracefully when more than two disks fail. 1 Introduction Disk arrays [15] offer significant advantages over conventional disks. Fragmentation of the total storage space into ...
AIDAbased RealTime FaultTolerant Broadcast Disks
 In Proceedings of RTAS'96: The 1996 IEEE RealTime Technology and Applications Symposium
, 1996
"... The proliferation of mobile computers and wireless networks requires the design of future distributed realtime applications to recognize and deal with the significant asymmetry between downstream and upstream communication capacities, and the significant disparitybetween server and client storag ..."
Abstract

Cited by 41 (13 self)
 Add to MetaCart
The proliferation of mobile computers and wireless networks requires the design of future distributed realtime applications to recognize and deal with the significant asymmetry between downstream and upstream communication capacities, and the significant disparitybetween server and client storage capacities.
Evaluation of Distributed Recovery in LargeScale Storage Systems
 IN PROCEEDINGS OF THE 13TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING (HPDC
, 2004
"... Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarante ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild time lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recovery approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasurecoding redundancy schemes to dramatically lower the probability of data loss in largescale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques, system designers will be better able to build multipetabyte storage systems with much higher reliability at lower cost than previously possible.
Pinwheel scheduling for faulttolerant broadcast disks in realtime database systems
, 1997
"... The design of programs for broadcast disks which incorporate realtime and faulttolerance requirements is considered. Ageneralized model for realtime faulttolerant broadcast disks is de ned. It is shown that designing programs for broadcast disks speci ed inthis model is closely related to the sch ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
The design of programs for broadcast disks which incorporate realtime and faulttolerance requirements is considered. Ageneralized model for realtime faulttolerant broadcast disks is de ned. It is shown that designing programs for broadcast disks speci ed inthis model is closely related to the scheduling of pinwheel task systems. Some new results in pinwheel scheduling theory are derived, which facilitate the e cient generation of realtime faulttolerant broadcast disk programs. 1.