Results 1 - 10
of
11
Dynamo: amazon’s highly available key-value store
- IN PROC. SOSP
, 2007
"... Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites ..."
Abstract
-
Cited by 194 (0 self)
- Add to MetaCart
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems.
This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on ” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Nysiad: Practical protocol transformation to tolerate byzantine failures
- In NSDI
, 2008
"... The paper presents and evaluates Nysiad, 1 a system that implements a new technique for transforming a scalable distributed system or network protocol tolerant only of crash failures into one that tolerates arbitrary failures, including such failures as freeloading and malicious attacks. The techniq ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The paper presents and evaluates Nysiad, 1 a system that implements a new technique for transforming a scalable distributed system or network protocol tolerant only of crash failures into one that tolerates arbitrary failures, including such failures as freeloading and malicious attacks. The technique assigns to each host a certain number of guard hosts, optionally chosen from the available collection of hosts, and assumes that no more than a configurable number of guards of a host are faulty. Nysiad then enforces that a host either follows the system’s protocol and handles all its inputs fairly, or ceases to produce output messages altogether—a behavior that the system tolerates. We have applied Nysiad to a link-based routing protocol and an overlay multicast protocol, and present measurements of running the resulting protocols on a simulated network. 1
Cumulus: Filesystem Backup to the Cloud
"... In this paper we describe Cumulus, a system for efficiently implementing filesystem backups over the Internet. Cumulus is specifically designed under a thin cloud assumption—that the remote datacenter storing the backups does not provide any special backup services, but only provides a least-common- ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this paper we describe Cumulus, a system for efficiently implementing filesystem backups over the Internet. Cumulus is specifically designed under a thin cloud assumption—that the remote datacenter storing the backups does not provide any special backup services, but only provides a least-common-denominator storage interface (i.e., get and put of complete files). Cumulus aggregates data from small files for remote storage, and uses LFS-inspired segment cleaning to maintain storage efficiency. Cumulus also efficiently represents incremental changes, including edits to large files. While Cumulus can use virtually any storage service, we show that its efficiency is comparable to integrated approaches. 1
Enabling Security in Cloud Storage SLAs with CloudProof
- In Proc. USENIX ATC
, 2011
"... Several cloud storage systems exist today, but none of them provide security guarantees in their Service Level Agreements (SLAs). This lack of security support has been a major hurdle for the adoption of cloud services, especially for enterprises and cautious consumers. To fix this issue, we present ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Several cloud storage systems exist today, but none of them provide security guarantees in their Service Level Agreements (SLAs). This lack of security support has been a major hurdle for the adoption of cloud services, especially for enterprises and cautious consumers. To fix this issue, we present CloudProof, a secure storage system specifically designed for the cloud. In CloudProof, customers can not only detect violations of integrity, write-serializability, and freshness, they can also prove the occurrence of these violations to a third party. This proof-based system is critical to enabling security guarantees in SLAs, wherein clients pay for a desired level of security and are assured they will receive a certain compensation in the event of cloud misbehavior. Furthermore, since CloudProof aims to scale to the size of large enterprises, we delegate as much work as possible to the cloud and use cryptographic tools to allow customers to detect and prove cloud misbehavior. Our evaluation of CloudProof indicates that its security mechanisms have a reasonable cost: they incur a latency overhead of only ∼15 % on reads and writes, and reduce throughput by around 10%. We also achieve highly scalable access control, with membership management (addition and removal of members ’ permissions) for a large proprietary software with more than 5000 developers taking only a few seconds per month. 1
Tiered Fault Tolerance for Long-Term Integrity
"... Fault-tolerant services typically make assumptions about the type and maximum number of faults that they can tolerate while providing their correctness guarantees; when such a fault threshold is violated, correctness is lost. We revisit the notion of fault thresholds in the context of long-term arch ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Fault-tolerant services typically make assumptions about the type and maximum number of faults that they can tolerate while providing their correctness guarantees; when such a fault threshold is violated, correctness is lost. We revisit the notion of fault thresholds in the context of long-term archival storage. We observe that fault thresholds are inevitably violated in longterm services, making traditional fault tolerance inapplicable to the long-term. In this work, we undertake a “reallocation of the fault-tolerance budget ” of a long-term service. We split the service into service pieces, each of which can tolerate a different number of faults without failing (and without causing the whole service to fail): each piece can be either in a critical trusted fault tier, which must never fail, or an untrusted fault tier, which can fail massively and often, or other fault tiers in between. By carefully engineering the split of a long-term service into pieces that must obey distinct fault thresholds, we can prolong its inevitable demise. We demonstrate this approach with Bonafide, a long-term key-value store that, unlike all similar systems proposed in the literature, maintains integrity in the face of Byzantine faults without requiring self-certified data. We describe the notion of tiered fault tolerance, the design, implementation, and experimental evaluation of Bonafide, and argue that our approach is a practical yet significant improvement over the state of the art for long-term services. 1
Smoke and Mirrors: Reflecting Files at a Geographically Remote Location Without Loss of Performance
"... The Smoke and Mirrors File System (SMFS) mirrors files at geographically remote datacenter locations with negligible impact on file system performance at the primary site, and minimal degradation as a function of link latency. It accomplishes this goal using wide-area links that run at extremely hig ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The Smoke and Mirrors File System (SMFS) mirrors files at geographically remote datacenter locations with negligible impact on file system performance at the primary site, and minimal degradation as a function of link latency. It accomplishes this goal using wide-area links that run at extremely high speeds, but have long round-trip-time latencies—a combination of properties that poses problems for traditional mirroring solutions. In addition to its raw speed, SMFS maintains good synchronization: should the primary site become completely unavailable, the system minimizes loss of work, even for applications that simultaneously update groups of files. We present the SMFS design, then evaluate the system on Emulab and the Cornell National Lambda Rail (NLR) Ring testbed. Intended applications include wide-area file sharing and remote backup for disaster recovery. 1
Lithium: Virtual Machine Storage for the Cloud
"... To address the limitations of centralized shared storage for cloud computing, we are building Lithium, a distributed storage system designed specifically for virtualization workloads running in large-scale data centers and clouds. Lithium aims to be scalable, highly available, and compatible with co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
To address the limitations of centralized shared storage for cloud computing, we are building Lithium, a distributed storage system designed specifically for virtualization workloads running in large-scale data centers and clouds. Lithium aims to be scalable, highly available, and compatible with commodity hardware and existing application software. The design of Lithium borrows ideas and techniques originating from research into Byzantine Fault Tolerance systems and popularized by distributed version control software, and demonstrates their practical applicability to the performancesensitive problem of VM hosting. To our initial surprise, we have found that seemingly expensive techniques such as versioned storage and incremental hashing can lead to a system that is not only more robust to data corruption and host failures, but also often faster than naïve approaches and, for a relatively small cluster of just eight hosts, performs well compared with an enterprise-class Fibre Channel disk array.
DeltaFS
, 2009
"... DeltaFS combines a read-only network filesystem with a mechanism for storing local changes. It is intended for use on limited capacity devices with good net connectivity, such as netbooks, mobile devices, and virtual machines. The local footprint of the filesystem is proportionate to the volume of c ..."
Abstract
- Add to MetaCart
DeltaFS combines a read-only network filesystem with a mechanism for storing local changes. It is intended for use on limited capacity devices with good net connectivity, such as netbooks, mobile devices, and virtual machines. The local footprint of the filesystem is proportionate to the volume of changes made; so long as most files are unmodified, the local user can be given the illusion of having access to a very large drive. 1
Storing and Managing Data in a Distributed Hash Table
, 2008
"... Distributed hash tables (DHTs) have been proposed as a generic, robust storage infrastructure for simplifying the construction of large-scale, wide-area applications. For example, UsenetDHT is a new design for Usenet News developed in this thesis that uses a DHT to cooperatively deliver Usenet artic ..."
Abstract
- Add to MetaCart
Distributed hash tables (DHTs) have been proposed as a generic, robust storage infrastructure for simplifying the construction of large-scale, wide-area applications. For example, UsenetDHT is a new design for Usenet News developed in this thesis that uses a DHT to cooperatively deliver Usenet articles: the DHT allows a set of N hosts to share storage of Usenet articles, reducing their combined storage requirements by a factor of O(N). Usenet generates a continuous stream of writes that exceeds 1 Tbyte/day in volume, comprising over ten million writes. Supporting this and the associated read workload requires a DHT engineered for durability and efficiency. Recovering from network and machine failures efficiently poses a challenge for DHT replication maintenance algorithms that provide durability. To avoid losing the last replica, replica maintenance must create additional replicas when failures are detected. However,

