## Algorithms for Scalable Storage Servers (2004)

Venue: | In SOFSEM 2004: Theory and Practice of Computer Science |

Citations: | 5 - 1 self |

### BibTeX

@INPROCEEDINGS{Sanders04algorithmsfor,

author = {Peter Sanders},

title = {Algorithms for Scalable Storage Servers},

booktitle = {In SOFSEM 2004: Theory and Practice of Computer Science},

year = {2004},

pages = {82--101},

publisher = {Springer}

}

### OpenURL

### Abstract

We survey a set of algorithmic techniques that make it possible to build a high performance storage server from a network of cheap components. Such a storage server oers a very simple programming model. To the clients it looks like a single very large disk that can handle many requests in parallel with minimal interference between the requests.

### Citations

2114 |
The Theory of Error-Correcting Codes
- MacWilliams, Sloane
- 1977
(Show Context)
Citation Context ...lerated (see Sect. 8) or if we additionally want to reduce output latencies (see 11 Sect. 9). A disadvantage of codes with w > r+1 is that they are computationally more expensive than parity-encoding =-=[20, 15, 29, 10, 4, 8]-=-. Most of the scheduling algorithms for RDA we have discussed are easy to generalize for more general coding schemes. Only optimal scheduling needs some additional consideration. A formulation that is... |

679 | LEDA: a platform for combinatorial and geometric computing. volume 38
- Mehlhorn, Näher
- 1995
(Show Context)
Citation Context ...me using fairly standard data structures: Disks are viewed as nodes of a graph. Uncommitted requests are edges. Using an appropriate graph representation, edges can be removed in constant time (e.g., =-=[21]-=-). When a disk becomes a candidate for Rule 1, we remember it on a stack. The remaining nodes are kept in a priority queue ordered by their load. Insert, decrement-priority and delete-minimum can be i... |

613 |
Orlin: Network Flows
- Ahuja, Magnanti, et al.
- 1993
(Show Context)
Citation Context ...ur choices before we have seen all the requests. Optimal schedules can be found in polynomial time [12]: Suppose we want tosnd out whether k steps suce to retrieve all requests. Consider asow network =-=[2]-=- that consists of four layers: A source node in thesrst layer is connected to each of n request nodes. Each request node is connected to two out of D disk nodes | one edge for each disk that holds a c... |

491 | Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance
- Rabin
- 1997
(Show Context)
Citation Context ...lerated (see Sect. 8) or if we additionally want to reduce output latencies (see 11 Sect. 9). A disadvantage of codes with w > r+1 is that they are computationally more expensive than parity-encoding =-=[20, 15, 29, 10, 4, 8]-=-. Most of the scheduling algorithms for RDA we have discussed are easy to generalize for more general coding schemes. Only optimal scheduling needs some additional consideration. A formulation that is... |

479 | An introduction to disk drive modeling
- Ruemmler, Wilkes
- 1994
(Show Context)
Citation Context ...ime needed to retrieve request i. This generalization can be used to model several aspects of storage servers: { We might want to retrieve just parts of a logical block { Disks are divided into zones =-=[3-=-0] of dierent data density and correspondingly dierent data rate | blocks on the outer zones are faster to retrieve than blocks on the inner zones. We assume here that both copies of a block are store... |

335 | External memory algorithms and data structures: Dealing with MASSIVE data
- Vitter
(Show Context)
Citation Context ...e many algorithms explicitly designed to work eciently with coarsegrained block-wise access. Most use the model by Vitter and Shriver that allows identical parallel disks and asxed block size. Vitter =-=[42-=-] has written a good overview article. More overviews and several introductory articles are collected in an LNCS Tutorial [22]. 3 Write Buering 3.1 Greedy Writing m/D ... ... ... hash function 1 2 3 q... |

297 |
A Case for Redundant Arrays of Inexpensive Disks
- Patterson, Gibson, et al.
(Show Context)
Citation Context ...e system consists of D identical disks but Sect. 10 generalizes to the case of dierent capacity disks that can be added incrementally. 2 Related Work A widely used approach to storage server is RAID [=-=-=-27] (Redundant Arrays of Independent Disks). Dierent RAID levels (0-5) oer dierent combinations of two basic techniques: In mirroring (RAID Level 1), each disk has a mirror disk storing the same data.... |

263 | Balanced Allocations
- Azar, Broder, et al.
(Show Context)
Citation Context ... placement of data is a well known technique (e.g., [7, 23]). Combining random placement and redundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing =-=[6]-=-. For scheduling disk accesses, these techniques have been used for multimedia applications [40, 41, 19, 24, 8, 36]. The methods described here are mostly a summary of four papers [35, 32, 33, 16]. Se... |

249 | Algorithms for parallel memory I: Two-level memories
- VITTER, SHRIVER
- 1994
(Show Context)
Citation Context ...opied or output. 4 from this queue is submitted to the disk. Fig. 2 illustrates this strategy. In some sense, greedy writing is optimal: Theorem 1 ([16]). Consider the I/O model of Vitter and Shriver =-=[4-=-3] (xed block size,sxed output cost). Assume some sequence of block writes is to be performed in that logical order and at most m blocks can be buered by the storage server. Then greedy writing minimi... |

179 |
Disk Striping
- Salem, Garcia-Molina
- 1986
(Show Context)
Citation Context ...tion (RDA) introduced in Sect. 4 only that the latter stores each block independently on dierent disks. We will see that this leads to better performance in several respects. Striping (RAID Level 0) [=-=31-=-] is a simple and elegant way to exploit disk parallelism: Logical blocks are split into D equal sized pieces and each piece is stored on a dierent disk. This way, accesses to logical blocks are alway... |

145 | Massive arrays of idle disks for storage archives
- Colarelli, Grunwald
- 2002
(Show Context)
Citation Context ... bases in petabytes (10 15 bytes). Currently, the largest of these applications use huge tape libraries, but hard disks can now store the same data for a similar price oering much higher performance [=-=13]. To -=-store such amounts of data one would need about 10 000 disks. Systems with thousands of disks have already been build and there are projects for \mid-range" systems that would scale to 12 000 dis... |

108 | Sudden emergence of a giant k-core in a random graph
- PITTEL, SPENCER, et al.
- 1996
(Show Context)
Citation Context ...imal schedule. This will be proven in an upcoming paper [11] using dierential equation methods that have previously been used for the mathematically closely related problem of cores of random graphs [=-=28-=-]. 5 Variable Size Requests We now drop the assumption that we are dealing withsxed size jobs that take unit time to retrieve. Instead, let ` i 1 denote the time needed to retrieve request i. This ge... |

101 | On the construction of pseudo-random permutations: Luby-Racko® revisited
- Naor, Reingold
- 1999
(Show Context)
Citation Context ...tions on the disk. In order tosnd out which blocks need to be moved or reconstructed when a disk is added or replaced, we would like to have permutations that are easy to invert. Feistel permutations =-=[25-=-] are one way to achieve that: Assume for now that p D 0 is an integer and represent j as j = j a + j b p D 0 . Now consider the mapping i;1 ((j a ; j b )) = (j b ; j a + f i;1 (j b ) mod p D 0 ) whe... |

93 |
Algorithm 360. shortest path forest with topological ordering
- Dial
- 1969
(Show Context)
Citation Context ...re kept in a priority queue ordered by their load. Insert, decrement-priority and delete-minimum can be implemented to run in amortized constant time using a slight variant of a bucket priority queue =-=[14-=-]. If we would plot the performance of the sel ess algorithm in the same way as in Figure 5 it would be absolutely impossible to see a dierence, i.e., with very high probability the sel ess algorithms... |

86 | Coding techniques for handling failures in large disk arrays
- Hellerstein, Gibson, et al.
- 1994
(Show Context)
Citation Context ...lerated (see Sect. 8) or if we additionally want to reduce output latencies (see 11 Sect. 9). A disadvantage of codes with w > r+1 is that they are computationally more expensive than parity-encoding =-=[20, 15, 29, 10, 4, 8]-=-. Most of the scheduling algorithms for RDA we have discussed are easy to generalize for more general coding schemes. Only optimal scheduling needs some additional consideration. A formulation that is... |

72 | Zabback: Data Partitioning and Load Balancing in Parallel Disk Systems
- Scheuermann, Weikum, et al.
(Show Context)
Citation Context ...e, we have treated all data equal. But in reality, some data is accessed more frequently than other data. Besides the short term measure of caching, this leads to the question of data migration (e.g. =-=[37]-=-). Important data should be spread evenly over the disks, it should be allocated to the fastest zones of the disks, and it could be stored with higher redundancy. The bulk of the data that is accessed... |

66 |
EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures
- Blaum, Brady, et al.
- 1994
(Show Context)
Citation Context |

65 | Comparing random data allocation and data striping in multimedia servers
- Santos, Muntz
- 2000
(Show Context)
Citation Context ...dundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications =-=[40, 41, 19, 24, 8, 36]-=-. The methods described here are mostly a summary of four papers [35, 32, 33, 16]. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained blo... |

64 | Simple randomized mergesort on parallel disks
- BARVE, GROVE, et al.
- 1997
(Show Context)
Citation Context ... requirement are highly dynamic, automatic methods may even outperform the most careful manual assignment of data to disks. Load balancing by random placement of data is a well known technique (e.g., =-=[7, 23]-=-). Combining random placement and redundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing [6]. For scheduling disk accesses, these techniques have be... |

59 | Balanced Allocations: The Heavily Loaded Case
- Berenbrink, Czumaj, et al.
- 2000
(Show Context)
Citation Context ...est queue algorithm plans e for the disk with smaller load. Ties are broken arbitrarily. It can be shown that this algorithm produces a schedule that needs k = n D + log ln D +(1) expected I/O steps [=-=9]-=-. This is very good for large n but has an additive term that grows with the system size. s t disks requests 2,2 2,2 cap, flow 2,1 Fig. 4. Asow network showing howsve requests are allocated to three d... |

49 | Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering
- Alvarez, Burkhard, et al.
- 1997
(Show Context)
Citation Context |

43 | When does a dynamic programming formulation guarantee the existence of a fully polynomial time approximation scheme (FPTAS
- Woeginger
(Show Context)
Citation Context ...od approximation algorithms [3]. In particular, since we are dealing with a small constant number of partitions, fully polynomial time approximation schemes can be developed using standard techniques =-=[44]-=-. Maintaining reasonably balanced partitions while components enter (new hardware) or leave (failures) the system in an online fashion is a more complicated problem. In general, we will have to move c... |

40 | Approximation schemes for scheduling
- Alon, Azar, et al.
- 1997
(Show Context)
Citation Context ...t be assigned to the partitions one by one but in coarse grained units like controllers or even entire machines. Although this partitioning problem is NP-hard, there are good approximation algorithms =-=[3]-=-. In particular, since we are dealing with a small constant number of partitions, fully polynomial time approximation schemes can be developed using standard techniques [44]. Maintaining reasonably ba... |

31 | New algorithms for the disk scheduling problem
- Andrews, Bender, et al.
- 1996
(Show Context)
Citation Context ...just sort by track) or rotational delays only [39]. For both types of delays together we have an NP-hard variant of the traveling salesman problem with polynomial time solutions in some special cases =-=[5]. 5 I-=-t can also be shown that longer execution times only happen with very small probability. Proof. (Outline) The optimal greedy writing algorithm dominates a \throttled" algorithm where in each I/O ... |

30 |
Random duplicate assignment: An alternative to striping in video servers
- Korst
- 1997
(Show Context)
Citation Context ...dundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications =-=[40, 41, 19, 24, 8, 36]-=-. The methods described here are mostly a summary of four papers [35, 32, 33, 16]. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained blo... |

28 | RAMA: An easy-to-use, high-performance parallel file system
- Miller, Katz
- 1997
(Show Context)
Citation Context ... requirement are highly dynamic, automatic methods may even outperform the most careful manual assignment of data to disks. Load balancing by random placement of data is a well known technique (e.g., =-=[7, 23]-=-). Combining random placement and redundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing [6]. For scheduling disk accesses, these techniques have be... |

28 | An Optimal Proof of the LRU-K Page Replacement Algorithm
- O’Neil, O’Neil, et al.
- 1999
(Show Context)
Citation Context ...ts, good routing strategies can be challenging in inhomogeneous dynamically changing networks. Caching can make actual disk accesses super uous. This is a well understood topic for centralized memory =-=[17, 2-=-6] but distributed caching faces interesting tradeos between communication overhead and cache hit rate. There are many more important issues with a dierentsavor such as locking mechanisms to coordinat... |

25 | A parallel disk storage system for real-time multimedia applications
- Muntz, Santos, et al.
- 1998
(Show Context)
Citation Context ...dundancy hassrst been considered in parallel computing for PRAM emulation [18] and online load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications =-=[40, 41, 19, 24, 8, 36]-=-. The methods described here are mostly a summary of four papers [35, 32, 33, 16]. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained blo... |

20 | Duality between prefetching and queued writing with parallel disks
- Hutchinson, Sanders, et al.
- 2001
(Show Context)
Citation Context ...e load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications [40, 41, 19, 24, 8, 36]. The methods described here are mostly a summary of four papers =-=[35, 32, 33, 16]-=-. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained block-wise access. Most use the model by Vitter and Shriver that allows identical pa... |

16 | Reconciling simplicity and realism in parallel disk models
- Sanders
(Show Context)
Citation Context ...e load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications [40, 41, 19, 24, 8, 36]. The methods described here are mostly a summary of four papers =-=[35, 32, 33, 16]-=-. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained block-wise access. Most use the model by Vitter and Shriver that allows identical pa... |

13 | Asynchronous scheduling of redundant disk arrays
- Sanders
- 2000
(Show Context)
Citation Context ...e load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications [40, 41, 19, 24, 8, 36]. The methods described here are mostly a summary of four papers =-=[35, 32, 33, 16]-=-. Sect. 10 describes new results. There are many algorithms explicitly designed to work eciently with coarsegrained block-wise access. Most use the model by Vitter and Shriver that allows identical pa... |

12 |
Optimal Response Time Retrieval of Replicated Data
- Chen, Rotem
- 1994
(Show Context)
Citation Context ...s. We will see that optimal schedules do not have this problem | we can do better by not committing our choices before we have seen all the requests. Optimal schedules can be found in polynomial time =-=[12]-=-: Suppose we want tosnd out whether k steps suce to retrieve all requests. Consider asow network [2] that consists of four layers: A source node in thesrst layer is connected to each of n request node... |

11 | Block allocation in video servers for availability and throughput
- Tetzlaff, Flynn
- 1996
(Show Context)
Citation Context |

9 |
auf der Heide, Ecient PRAM simulation on a distributed memory machine
- Karp, Luby, et al.
- 1992
(Show Context)
Citation Context ...disks. Load balancing by random placement of data is a well known technique (e.g., [7, 23]). Combining random placement and redundancy hassrst been considered in parallel computing for PRAM emulation =-=[18]-=- and online load balancing [6]. For scheduling disk accesses, these techniques have been used for multimedia applications [40, 41, 19, 24, 8, 36]. The methods described here are mostly a summary of fo... |

9 |
eds.: Algorithms for Memory Hierarchies. Volume 2625 of LNCS Tutorial
- Meyer, Sanders, et al.
- 2003
(Show Context)
Citation Context ... Shriver that allows identical parallel disks and asxed block size. Vitter [42] has written a good overview article. More overviews and several introductory articles are collected in an LNCS Tutorial =-=[22-=-]. 3 Write Buering 3.1 Greedy Writing m/D ... ... ... hash function 1 2 3 queues D h Sequence S one of buffers is free of blocks write whenever m Fig. 2. Optimal Writing. Consider the implementation o... |

7 | A New Algorithm for the Recognition of Series Parallel Graphs
- Schoenmakers
- 1995
(Show Context)
Citation Context ...m, it can be shown that the requests can be retrieved in k steps if and only if there is no subset of disks such that more than jjk requested blocks have both their copies allocated to a disk in [38=-=-=-]. Hence, it suces to show that it is unlikely that such an overloaded subset exists. This is a tractable problem mostly because the number of blocks allocated to is binomially distributed. 0 0.2 0.4... |

7 |
Design and performance tradeos in clustered video servers
- Tewari, Mukherjee, et al.
- 1996
(Show Context)
Citation Context |

6 |
Competitive analysis of paging
- Irani
- 1998
(Show Context)
Citation Context ...ts, good routing strategies can be challenging in inhomogeneous dynamically changing networks. Caching can make actual disk accesses super uous. This is a well understood topic for centralized memory =-=[17, 2-=-6] but distributed caching faces interesting tradeos between communication overhead and cache hit rate. There are many more important issues with a dierentsavor such as locking mechanisms to coordinat... |

5 |
Fast concurrent access to parallel disks. Algorithmica, 35(1):21-55,2003. A Preliminary version appeared in SODA 2000
- Sanders, Egner, et al.
(Show Context)
Citation Context |

2 | Design of the PRESTO multimedia storage network
- Berenbrink, Brinkmann, et al.
(Show Context)
Citation Context |

1 |
Complexity of retrieval problems
- Aerts, Korst, et al.
- 2000
(Show Context)
Citation Context ...n the inner zones. We assume here that both copies of a block are stored on the same zone. 10 The bad news is that it is strongly NP-hard to assign requests to disks so that the I/O time is minimized =-=[1]-=-. The good news is that optimal scheduling is still possible if we allow request to be split, i.e., we are allowed to combine a request from pieces read from both copies. We make the simplifying assum... |

1 |
A random multigraph process for linear time almost optimal RDA disk scheduling. manuscript in preparation
- Cain, Sanders, et al.
- 2003
(Show Context)
Citation Context ...ing large sets of blocks eciently. Therefore it makes sense to look for fast algorithms that are close to optimal. Here we describe linear time algorithm that produces very close to optimal solutions =-=[11]-=-. The sel ess algorithm distinguishes between committed and uncommited requests. Uncommitted requests still have a choice between two disks. Committed requests have decided for one of the two choices.... |

1 |
On the near-optimality of the shortest-access-time drum scheduling discipline
- Stone, Fuller
(Show Context)
Citation Context ...mall amount of additional memory removes most complications in that respect. 4 For innite buer size, the problem is easy if we look at seek times only (just sort by track) or rotational delays only [3=-=9]-=-. For both types of delays together we have an NP-hard variant of the traveling salesman problem with polynomial time solutions in some special cases [5]. 5 It can also be shown that longer execution ... |