• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

FAB: Building distributed enterprise disk arrays from commodity components. (2004)

by Y Saito
Venue:In Proc. of the ASPLOS’04,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 123
Next 10 →

Dynamo: amazon’s highly available key-value store

by Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels - IN PROC. SOSP , 2007
"... Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites ..."
Abstract - Cited by 684 (0 self) - Add to MetaCart
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on ” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
(Show Context)

Citation Context

...o allows read and write operations to continue even during network partitions and resolves updated conflicts using different conflict resolution mechanisms. Distributed block storage systems like FAB =-=[18]-=- split large size objects into smaller blocks and stores each block in a highly available manner. In comparison to these systems, a key-value store is more suitable in this case because: (a) it is int...

Ceph: A scalable, highperformance distributed system,” in OSDI,

by Sage A Weil , Scott A Brandt , Ethan L Miller , Darrell D E Long , Carlos Maltzahn , 2006
"... Abstract We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for hetero ..."
Abstract - Cited by 275 (32 self) - Add to MetaCart
Abstract We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
(Show Context)

Citation Context

...of reads and file appends. Like Sorrento [26], it targets a narrow class of applications with non-POSIX semantics. Recently, many file systems and platforms, including Federated Array of Bricks (FAB) =-=[23]-=- and pNFS [9] have been designed around network attached storage [8]. Lustre [4], the Panasas file system [32], zFS [21], Sorrento, and Kybos [35] are based on the object-based storage paradigm [3] an...

detecting the unexpected in distributed systems

by Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Amin Vahdat - In NSDI’06: Proceedings of the 3rd conference on 3rd Symposium on Networked Systems Design & Implementation
"... Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system’s behavior and the programmer’s assumptions about that behavior. We present Pip 1, an infrastructure for comparing actual behavior and expected behavior to expose structural errors and performance pr ..."
Abstract - Cited by 141 (7 self) - Add to MetaCart
Bugs in distributed systems are often hard to find. Many bugs reflect discrepancies between a system’s behavior and the programmer’s assumptions about that behavior. We present Pip 1, an infrastructure for comparing actual behavior and expected behavior to expose structural errors and performance problems in distributed systems. Pip allows programmers to express, in a declarative language, expectations about the system’s communications structure, timing, and resource consumption. Pip includes system instrumentation and annotation tools to log actual system behavior, and visualization and query tools for exploring expected and unexpected behavior 2. Pip allows a developer to quickly understand and debug both familiar and unfamiliar systems. We applied Pip to several applications, including FAB, SplitStream, Bullet, and RanSub. We generated most of the instrumentation for all four applications automatically. We found the needed expectations easy to write, starting in each case with automatically generated expectations. Pip found unexpected behavior in each application, and helped to isolate the causes of poor performance and incorrect behavior. 1
(Show Context)

Citation Context

...xpectations are often more concise and readable than any other summary of system behavior, and bugs can be obvious just from reading them. We applied Pip to several distributed systems, including FAB =-=[25]-=-, SplitStream [4], Bullet [13, 15], and RanSub [14]. Pip automatically generated most of the instrumentation for all four applications. We wrote expectations to uncover unexpected behavior, starting i...

Efficient replica maintenance for distributed storage systems

by Byung-gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, Robert Morris - In Proc. of NSDI , 2006
"... This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server’s worth of ..."
Abstract - Cited by 122 (17 self) - Add to MetaCart
This paper considers replication strategies for storage systems that aggregate the disks of many nodes spread over the Internet. Maintaining replication in such systems can be prohibitively expensive, since every transient network or host failure could potentially lead to copying a server’s worth of data over the Internet to maintain replication levels. The following insights in designing an efficient replication algorithm emerge from the paper’s analysis. First, durability can be provided separately from availability; the former is less expensive to ensure and a more useful goal for many wide-area applications. Second, the focus of a durability algorithm must be to create new copies of data objects faster than permanent disk failures destroy the objects; careful choice of policies for what nodes should hold what data can decrease repair time. Third, increasing the number of replicas of each data object does not help a system tolerate a higher disk failure probability, but does help tolerate bursts of failures. Finally, ensuring that the system makes use of replicas that recover after temporary failure is critical to efficiency. Based on these insights, the paper proposes the Carbonite replication algorithm for keeping data durable at a low cost. A simulation of Carbonite storing 1 TB of data over a 365 day trace of PlanetLab activity shows that Carbonite is able to keep all data durable and uses 44 % more network traffic than a hypothetical system that only responds to permanent failures. In comparison, Total Recall and DHash require almost a factor of two more network traffic than this hypothetical system. 1
(Show Context)

Citation Context

...aim to continue operating despite some fixed number of failures and choose number of replicas so that a voting algorithm can ensure correct updates in the presence of partitions or Byzantine failures =-=[5, 17, 23, 24, 33]-=-. FAB [33] and Chain Replication [38] both consider how the number of possible replicas sets affects data durability. The two come to opposite conclusions: FAB recommends a small number of replica set...

Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems

by Cheng Huang - In Proceedings of the IEEE International Symposium on Network Computing and Applications. IEEE, Los Alamitos
"... We design flexible schemes to explore the tradeoffs between storage space and access efficiency in reliable data storage systems. Aiming at this goal, two new classes of erasure-resilient codes are introduced – Basic Pyramid Codes (BPC) and Generalized Pyramid Codes (GPC). Both schemes require sligh ..."
Abstract - Cited by 81 (9 self) - Add to MetaCart
We design flexible schemes to explore the tradeoffs between storage space and access efficiency in reliable data storage systems. Aiming at this goal, two new classes of erasure-resilient codes are introduced – Basic Pyramid Codes (BPC) and Generalized Pyramid Codes (GPC). Both schemes require slightly more storage space than conventional schemes, but significantly improve the critical performance of read during failures and unavailability. As a by-product, we establish a necessary matching condition to characterize the limit of failure recovery, that is, unless the matching condition is satisfied, a failure case is impossible to recover. In addition, we define a maximally recoverable (MR) property. For all ERC schemes holding the MR property, the matching condition becomes sufficient, that is, all failure cases satisfying the matching condition are indeed recoverable. We show that GPC is the first class of non-MDS schemes holding the MR property.
(Show Context)

Citation Context

...ence between old and new data (called delta), 1 write to update the block, 3 reads of the 3 redundant blocks to compute deltas, and then 3 writes to update the redundant blocks [Aguilera et al. 2005; =-=Saito et al. 2004-=-]). In contrast, update a data block in 3-replication systems requires three writes. Second, the performance of data access may suffer as well. In systematic ERC schemes, the original data is preserve...

Jockey: A user-space library for record-replay debugging

by Yasushi Saito - In AADEBUG’05: Proceedings of the sixth international symposium on Automated analysis-driven debugging , 2005
"... Jockey is an execution record/replay tool for debugging Linux programs. It records invocations of system calls and CPU instructions with timing-dependent effects and later replays them deterministically. It supports process checkpointing to diagnose long-running programs efficiently. Jockey is imple ..."
Abstract - Cited by 77 (0 self) - Add to MetaCart
Jockey is an execution record/replay tool for debugging Linux programs. It records invocations of system calls and CPU instructions with timing-dependent effects and later replays them deterministically. It supports process checkpointing to diagnose long-running programs efficiently. Jockey is implemented as a shared-object file that runs as a part of the target process. While this design is the key for achieving Jockey’s goal of safety and ease of use, it also poses challenges. This paper discusses some of the practical issues we needed to overcome in such environments, including low-overhead system-call interception, techniques for segregating resource usage between Jockey and the target process, and an interface for finegrain control of Jockey’s behavior.
(Show Context)

Citation Context

...–21, 2005, Monterey, California, USA. Copyright 2005 ACM 1-59593-050-7/05/0009 ...$5.00. yasushi.saito@gmail.com Jockey was originally developed as a debugging aid for FAB (Federated Array of Bricks) =-=[23]-=-. FAB is a high-availability disk array built on a cluster of commodity servers. It provides accesses to logical volumes to iSCSI clients using complex peer-to-peer-style replication and erasure-codin...

Ursa Minor: versatile cluster-based storage

by Michael Abd-el-malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, Jay J. Wylie , 2005
"... No single encoding scheme or fault model is optimal for all data. A versatile storage system allows them to be matched to access patterns, reliability requirements, and cost goals on a per-data item basis. Ursa Minor is a cluster-based storage system that allows data-specific selection of, and on-li ..."
Abstract - Cited by 76 (37 self) - Add to MetaCart
No single encoding scheme or fault model is optimal for all data. A versatile storage system allows them to be matched to access patterns, reliability requirements, and cost goals on a per-data item basis. Ursa Minor is a cluster-based storage system that allows data-specific selection of, and on-line changes to, encoding schemes and fault models. Thus, different data types can share a scalable storage infrastructure and still enjoy specialized choices, rather than suffering from “one size fits all. ” Experiments with Ursa Minor show performance benefits of 2–3 × when using specialized choices as opposed to a single, more general, configuration. Experiments also show that a single cluster supporting multiple workloads simultaneously is much more efficient when the choices are specialized for each distribution rather than forced to use a “one size fits all” configuration. When using the specialized distributions, aggregate cluster throughput nearly doubled.
(Show Context)

Citation Context

...e years. Petal [23], xFS [2], and NASD [13] are early systems that laid the groundwork for today’s cluster-based storage designs, including Ursa Minor’s. More recent examples include FARSITE [1], FAB =-=[34]-=-, EMC’s Centera [8], EqualLogic’s PS series product [9], Lustre [24], Panasas’ ActiveScale Storage Cluster [26], and the Google file system [12]. All of these systems provide the incremental scalabili...

Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage

by Mark W. Storer, Kevin M. Greenan, Ethan L. Miller, Kaladhar Voruganti - IN FAST-2008: 6TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES , 2008
"... As the world moves to digital storage for archival purposes, there is an increasing demand for reliable, lowpower, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequatel ..."
Abstract - Cited by 64 (14 self) - Add to MetaCart
As the world moves to digital storage for archival purposes, there is an increasing demand for reliable, lowpower, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many diskbased systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes. Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Pergamum adds NVRAM at each node to store data signatures, metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Pergamum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the correctness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Pergamum shows that it provides adequate performance.
(Show Context)

Citation Context

...iffers from that of existing MAID systems, which still have centralized controllers. Instead, our system, Pergamum, takes an approach similar to that used in high-performance scalable storage systems =-=[36, 46, 48]-=-, and is built from thousands of intelligent storage appliances connected by high-speed networks that cooperatively provide reliable, efficient, long-term storage. Each appliance, called a Pergamum to...

CRUSH: Controlled, scalable, decentralized placement of replicated data

by Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn - IN PROCEEDINGS OF THE 2006 ACM/IEEE CONFERENCE ON SUPERCOMPUTING (SC ’06 , 2006
"... Emerging large-scale distributed storage systems are faced with the task of distributing petabytes of data among tens or hundreds of thousands of storage devices. Such systems must evenly distribute data and workload to efficiently utilize available resources and maximize system performance, while f ..."
Abstract - Cited by 53 (14 self) - Add to MetaCart
Emerging large-scale distributed storage systems are faced with the task of distributing petabytes of data among tens or hundreds of thousands of storage devices. Such systems must evenly distribute data and workload to efficiently utilize available resources and maximize system performance, while facilitating system growth and managing hardware failures. We have developed CRUSH, a scalable pseudorandom data distribution function designed for distributed object-based storage systems that efficiently maps data objects to storage devices without relying on a central directory. Because large systems are inherently dynamic, CRUSH is designed to facilitate the addition and removal of storage while minimizing unnecessary data movement. The algorithm accommodates a wide variety of data replication and reliability mechanisms and distributes data in terms of userdefined policies that enforce separation of replicas across failure domains.

Low-overhead byzantine fault-tolerant storage

by James Hendricks - In SOSP , 2007
"... This paper presents an erasure-coded Byzantine fault-tolerant block storage protocol that is nearly as efficient as protocols that tolerate only crashes. Previous Byzantine fault-tolerant block storage protocols have either relied upon replication, which is inefficient for large blocks of data when ..."
Abstract - Cited by 51 (1 self) - Add to MetaCart
This paper presents an erasure-coded Byzantine fault-tolerant block storage protocol that is nearly as efficient as protocols that tolerate only crashes. Previous Byzantine fault-tolerant block storage protocols have either relied upon replication, which is inefficient for large blocks of data when tolerating multiple faults, or a combination of additional servers, extra computation, and versioned storage. To avoid these expensive techniques, our protocol employs novel mechanisms to optimize for the common case when faults and concurrency are rare. In the common case, a write operation completes in two rounds of communication and a read completes in one round. The protocol requires a short checksum comprised of cryptographic hashes and homomorphic fingerprints. It achieves throughput within 10 % of the crash-tolerant protocol for writes and reads in failure-free runs when configured to tolerate up to 6 faulty servers and any number of faulty clients.
(Show Context)

Citation Context

...width [37, 38]. A few distributed storage systems support erasure coding. For example, Zebra [19], xFS [4], and PanFS [29] support parity-based protection of data striped across multiple servers. FAB =-=[34]-=-, Ursa Minor [3], and RepStore [40] support more general m-of-n erasure coding. 2.1 Beyond crash faults A common assumption is that tolerance of crash faults is sufficient for distributed storage syst...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University