Results 1 - 10
of
28
A Fresh Look at the Reliability of Long-term Digital Storage
, 2006
"... Emerging Web services, such as email, photo sharing, and web site archives, must preserve large volumes of quickly accessible data indefinitely into the future. The costs of doing so often determine whether the service is economically viable. We make the case that these applications' demands on larg ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
Emerging Web services, such as email, photo sharing, and web site archives, must preserve large volumes of quickly accessible data indefinitely into the future. The costs of doing so often determine whether the service is economically viable. We make the case that these applications' demands on large scale storage systems over long time horizons require us to reevaluate traditional system designs. We examine threats to long-lived data from an end-to-end perspective, taking into account not just hardware and software faults but also faults due to humans and organizations. We present a simple model of long-term storage failures that helps us reason about various strategies for addressing some of these threats. Using this model we show that the most important strategies for increasing the reliability of long-term storage are detecting latent faults quickly, automating fault repair to make it cheaper and faster, and increasing the independence of data replicas.
Pergamum: Replacing tape with energy efficient, reliable, disk-based archival storage
- In FAST-2008: 6th Usenix Conference on File and Storage Technologies
, 2008
"... As the world moves to digital storage for archival purposes, there is an increasing demand for reliable, lowpower, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequatel ..."
Abstract
-
Cited by 31 (11 self)
- Add to MetaCart
As the world moves to digital storage for archival purposes, there is an increasing demand for reliable, lowpower, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many diskbased systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes. Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Pergamum adds NVRAM at each node to store data signatures, metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Pergamum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the correctness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Pergamum shows that it provides adequate performance. 1
Taper: Tiered approach for eliminating redundancy in replica synchronization
- In USENIX Conference on File and Storage Technologies
, 2005
"... We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recover ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recovery, and federated file systems. TA-PER is designed to be bandwidth efficient, scalable and content-based, and it does not require prior knowledge of the replica state. To achieve these properties, TA-PER provides: i) four pluggable redundancy elimination phases that balance the trade-off between bandwidth savings and computation overheads, ii) a hierarchical hash tree based directory pruning phase that quickly matches identical data from the granularity of directory trees to individual files, iii) a content-based similarity detection technique using Bloom filters to identify similar files, and iv) a combination of coarse-grained chunk matching with finer-grained block matches to achieve bandwidth efficiency. Through extensive experiments on various datasets, we observe that in comparison with rsync, a widely-used directory synchronization tool, TAPER reduces bandwidth by 15 % to 71%, performs faster matching, and scales to a larger number of replicas. 1
POTSHARDS: secure long-term storage without encryption
- In Proceedings of the 2007 USENIX Annual Technical Conference
, 2007
"... Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public’s desire to digitally record their personal histories. Unfortunately, many of the security mechanisms that modern systems rely upon, such as encryption, are poor ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public’s desire to digitally record their personal histories. Unfortunately, many of the security mechanisms that modern systems rely upon, such as encryption, are poorly suited for storing data for indefinitely long periods of time—it is very difficult to manage keys and update cryptosystems to provide secrecy through encryption over periods of decades. Worse, an adversary who can compromise an archive need only wait for cryptanalysis techniques to catch up to the encryption algorithm used at the time of the compromise in order to obtain “secure ” data. To address these concerns, we have developed POT-SHARDS, an archival storage system that provides longterm security for data with very long lifetimes without using encryption. Secrecy is achieved by using provably secure secret splitting and spreading the resulting shares across separately-managed archives. Providing availability and data recovery in such a system can be difficult; thus, we use a new technique, approximate pointers, in conjunction with secure distributed RAID techniques to provide availability and reliability across independent archives. To validate our design, we developed a prototype POTSHARDS implementation, which has demonstrated “normal ” storage and retrieval of user data using indexes, the recovery of user data using only the pieces a user has stored across the archives and the reconstruction of an entire failed archive. 1
Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
"... We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these sc ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks ’ hashes are kept in RAM and the sampling rate is low, we dramatically reduce the RAM to disk ratio for effective deduplication. At the same time, only a few seeks are required per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorporated into number of Hewlett-Packard backup products. 1
BlobSeer: How to enable efficient versioning for large object storage under heavy access concurrency
- in Data Management in Peerto-Peer Systems, St-Petersburg
, 2009
"... To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This paper addresses the problem of efficiently storing and accessing very large binary dat ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
To accommodate the needs of large-scale distributed P2P systems, scalable data management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This paper addresses the problem of efficiently storing and accessing very large binary data objects (blobs). It proposes an efficient versioning scheme allowing a large number of clients to concurrently read, write and append data to huge blobs that are fragmented and distributed at a very large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme, based on a distributed segment tree built on top of a Distributed Hash Table (DHT). Our approach has been implemented and experimented within our BlobSeer prototype on the Grid’5000 testbed, using up to 175 nodes. 1.
Extreme binning: Scalable, parallel deduplication for chunk-based file backup
- In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS
, 2009
"... Abstract—Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams w ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract—Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead. I.
Providing high reliability in a minimum redundancy archival storage system
- Proc.14 th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
, 2006
"... Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the r ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Inter-file compression techniques store files as sets of references to data objects or chunks that can be shared among many files. While these techniques can achieve much better compression ratios than conventional intra-file compression methods such as Lempel-Ziv compression, they also reduce the reliability of the storage system because the loss of a few critical chunks can lead to the loss of many files. We show how to eliminate this problem by choosing for each chunk a replication level that is a function of the amount of data that would be lost if that chunk were lost. Experiments using actual archival data show that our technique can achieve significantly higher robustness than a conventional approach combining data mirroring and intra-file compression while requiring about half the storage space. 1.
Secure Data Deduplication
- STORAGESS'08
, 2008
"... As the world moves to digital storage for archival purposes, there is an increasing demand for systems that can provide secure data storage in a cost-effective manner. By identifying common chunks of data both within and between files and storing them only once, deduplication can yield cost savings ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
As the world moves to digital storage for archival purposes, there is an increasing demand for systems that can provide secure data storage in a cost-effective manner. By identifying common chunks of data both within and between files and storing them only once, deduplication can yield cost savings by increasing the utility of a given amount of storage. Unfortunately, deduplication exploits identical content, while encryption attempts to make all content appear random; the same content encrypted with two different keys results in very different ciphertext. Thus, combining the space efficiency of deduplication with the secrecy aspects of encryption is problematic. We have developed a solution that provides both data security and space efficiency in single-server storage and distributed storage systems. Encryption keys are generated in a consistent manner from the chunk data; thus, identical chunks will always encrypt to the same ciphertext. Furthermore, the keys cannot be deduced from the encrypted chunk data. Since the information each user needs to access and decrypt the chunks that make up a file is encrypted using a key known only to the user, even a full compromise of the system cannot reveal which chunks are used by which users.
The Effectiveness of Deduplication on Virtual Machine Disk Images
"... Virtualization is becoming widely deployed in servers to efficiently provide many logically separate execution environments while reducing the need for physical servers. While this approach saves physical CPU resources, it still consumes large amounts of storage because each virtual machine (VM) ins ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Virtualization is becoming widely deployed in servers to efficiently provide many logically separate execution environments while reducing the need for physical servers. While this approach saves physical CPU resources, it still consumes large amounts of storage because each virtual machine (VM) instance requires its own multi-gigabyte disk image. Moreover, existing systems do not support ad hoc block sharing between disk images, instead relying on techniques such as overlays to build multiple VMs from a single “base ” image. Instead, we propose the use of deduplication to both reduce the total storage required for VM disk images and increase the ability of VMs to share disk blocks. To test the effectiveness of deduplication, we conducted extensive evaluations on different sets of virtual machine disk images with different chunking strategies. Our experiments found that the amount of stored data grows very slowly after the first few virtual disk images if only the locale or software configuration is changed, with the rate of compression suffering when different versions of an operating system or different operating systems are included. We also show that fixedlength chunks work well, achieving nearly the same compression rate as variable-length chunks. Finally, we show that simply identifying zero-filled blocks, even in ready-touse virtual machine disk images available online, can provide significant savings in storage.

