Results 1 - 10
of
28
Redundancy Elimination Within Large Collections of Files
, 2004
"... Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
Ongoing advancements in technology lead to everincreasing storage capacities. In spite of this, optimizing storage usage can still provide rich dividends. Several techniques based on delta-encoding and duplicate block suppression have been shown to reduce storage overheads, with varying requirements for resources such as computation and memory. We propose a new scheme for storage reduction that reduces data sizes with an effectiveness comparable to the more expensive techniques, but at a cost comparable to the faster but less effective ones. The scheme, called Redundancy Elimination at the Block Level (REBL), leverages the benefits of compression, duplicate block suppression, and delta-encoding to eliminate a broad spectrum of redundant data in a scalable and efficient manner. REBL generally encodes more compactly than compression (up to a factor of 14) and a combination of compression and duplicate suppression (up to a factor of 6.7). REBL also encodes similarly to a technique based on delta-encoding, reducing overall space significantly in one case. Furthermore, REBL uses super-fingerprints, a technique that reduces the data needed to identify similar blocks while dramatically reducing the computational requirements of matching the blocks: it turns comparisons into hash table lookups. As a result, using super-fingerprints to avoid enumerating matching data objects decreases computation in the resemblance detection phase of REBL by up to a couple orders of magnitude.
Practical, transparent operating system support for superpages
- SIGOPS Oper. Syst. Rev
, 2002
"... Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, and promises performance improvements for many applications. However, supporting superpages poses several challenges to the operating system, in terms of superpage allocation and promotion tradeoffs, fragmentation control, etc. We analyze these issues, and propose the design of an effective superpage management system. We implement it in FreeBSD on the Alpha CPU, and evaluate it on real workloads and benchmarks. We obtain substantial performance benefits, often exceeding 30%; these benefits are sustained even under stressful workload scenarios. 1
Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP
- In Proceedings of the First Symposium on Networked Systems Design and Implementation
, 2004
"... Organizations use Web caches to avoid transferring the same data twice over the same path. Numerous studies have shown that forward proxy caches, in practice, incur miss rates of at least 50%. Traditional Web caches rely on the reuse of responses for given URLs. Previous analyses of real-world trace ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Organizations use Web caches to avoid transferring the same data twice over the same path. Numerous studies have shown that forward proxy caches, in practice, incur miss rates of at least 50%. Traditional Web caches rely on the reuse of responses for given URLs. Previous analyses of real-world traces have revealed a complex relationship between URLs and reply payloads, and have shown that this complexity frequently causes redundant transfers to caches. For example, redundant transfers may result if a payload is aliased (accessed via different URLs), or if a resource rotates (alternates between different values) , or if HTTP's cache revalidation mechanisms are not fully exploited. We implement and evaluate a technique known in the literature as Duplicate Transfer Detection (DTD), with which a Web cache can use digests to detect and potentially eliminate all redundant payload transfers. We show how HTTP can support DTD with few or no protocol changes, and how a DTD-enabled proxy cache can interoperate with unmodified existing origin servers and browsers, thereby permitting incremental deployment. We present both simulated and experimental results that quantify the benefits of DTD.
Taper: Tiered approach for eliminating redundancy in replica synchronization
- In USENIX Conference on File and Storage Technologies
, 2005
"... We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recover ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
We present TAPER, a scalable data replication protocol that synchronizes a large collection of data across multiple geographically distributed replica locations. TAPER can be applied to a broad range of systems, such as software distribution mirrors, content distribution networks, backup and recovery, and federated file systems. TA-PER is designed to be bandwidth efficient, scalable and content-based, and it does not require prior knowledge of the replica state. To achieve these properties, TA-PER provides: i) four pluggable redundancy elimination phases that balance the trade-off between bandwidth savings and computation overheads, ii) a hierarchical hash tree based directory pruning phase that quickly matches identical data from the granularity of directory trees to individual files, iii) a content-based similarity detection technique using Bloom filters to identify similar files, and iv) a combination of coarse-grained chunk matching with finer-grained block matches to achieve bandwidth efficiency. Through extensive experiments on various datasets, we observe that in comparison with rsync, a widely-used directory synchronization tool, TAPER reduces bandwidth by 15 % to 71%, performs faster matching, and scales to a larger number of replicas. 1
Cumulus: Filesystem Backup to the Cloud
"... In this paper we describe Cumulus, a system for efficiently implementing filesystem backups over the Internet. Cumulus is specifically designed under a thin cloud assumption—that the remote datacenter storing the backups does not provide any special backup services, but only provides a least-common- ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this paper we describe Cumulus, a system for efficiently implementing filesystem backups over the Internet. Cumulus is specifically designed under a thin cloud assumption—that the remote datacenter storing the backups does not provide any special backup services, but only provides a least-common-denominator storage interface (i.e., get and put of complete files). Cumulus aggregates data from small files for remote storage, and uses LFS-inspired segment cleaning to maintain storage efficiency. Cumulus also efficiently represents incremental changes, including edits to large files. While Cumulus can use virtually any storage service, we show that its efficiency is comparable to integrated approaches. 1
Improving Mobile Database Access over Wide-area Networks without Degrading Consistency
- In Proceedings of the 5th International Conference on Mobile Systems, Applications and Services
, 2007
"... We report on the design, implementation, and evaluation of a system called Cedar that enables mobile database access with good performance over low-bandwidth networks. This is accomplished without degrading consistency. Cedar exploits the disk storage and processing power of a mobile client to compe ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
We report on the design, implementation, and evaluation of a system called Cedar that enables mobile database access with good performance over low-bandwidth networks. This is accomplished without degrading consistency. Cedar exploits the disk storage and processing power of a mobile client to compensate for weak connectivity. Its central organizing principle is that even a stale client replica can be used to reduce data transmission volume from a database server. The reduction is achieved by using content addressable storage to discover and elide commonality between client and server results. This organizing principle allows Cedar to use an optimistic approach to solving the difficult problem of database replica control. For laptop-class clients, our experiments show that Cedar improves the throughput of read-write workloads by 39 % to as much as 224 % while reducing response time by 28 % to as much as 79%.
Extreme binning: Scalable, parallel deduplication for chunk-based file backup
- In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS
, 2009
"... Abstract—Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams w ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract—Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead. I.
Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices
- In: Proceedings of the Sixth European Dependable Computing Conference
, 2006
"... Mobile devices are increasingly relied on but are used in contexts that put them at risk of physical damage, loss or theft. We consider a fault-tolerance approach that exploits spontaneous interactions to implement a collaborative backup service. We define the constraints implied by the mobile envir ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Mobile devices are increasingly relied on but are used in contexts that put them at risk of physical damage, loss or theft. We consider a fault-tolerance approach that exploits spontaneous interactions to implement a collaborative backup service. We define the constraints implied by the mobile environment,analyze how they translate into the storage layer of such a backup system and examine various design options. The paper concludes with a presentation of our prototype implementation of the storage layer, an evaluation of the impact of several compression methods,and directions for future work. 1.
Slinky: Static linking reloaded
- In USENIX 2005 Annual Technical Conference
, 2005
"... Static linking has many advantages over dynamic linking. It is simple to understand, implement, and use. It ensures that an executable is self-contained and does not depend on a particular set of libraries during execution. As a consequence, the user executes exactly the same executable image as was ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Static linking has many advantages over dynamic linking. It is simple to understand, implement, and use. It ensures that an executable is self-contained and does not depend on a particular set of libraries during execution. As a consequence, the user executes exactly the same executable image as was tested by the developer, diminishing the risk that the user’s environment will affect correct behavior. The major disadvantages of static linking are increases in the memory required to run an executable, network bandwidth to transfer it, and disk space to store it. In this paper we describe the Slinky system that uses digest-based sharing to combine the simplicity of static linking with the space savings of dynamic linking: although Slinky executables are completely self-contained, minimal performance and disk-space penalties are incurred if two executables use the same library. We have developed a Slinky prototype that consists of tools for adding digests to executables, a slight modification of the Linux kernel to use those digests to share code pages, and tools for transferring files between machines based on digests of their contents. Results show that our prototype has no measurable performance decrease relative to dynamic linking, a comparable memory footprint, a 20 % storage space increase, and a 34 % increase in the network bandwidth required to transfer the packages. We believe that Slinky obviates many of the justifications for dynamic linking, making static linking a superior technology for software organization and distribution. 1
Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service
"... We have developed a new storage system called the Jumbo Store (JS) based on encoding directory tree snapshots as graphs called HDAGs whose nodes are small variable-length chunks of data and whose edges are hash pointers. We store or transmit each node only once and encode using landmark-based chunki ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We have developed a new storage system called the Jumbo Store (JS) based on encoding directory tree snapshots as graphs called HDAGs whose nodes are small variable-length chunks of data and whose edges are hash pointers. We store or transmit each node only once and encode using landmark-based chunking plus some new tricks. This leads to very efficient incremental upload and storage of successive snapshots: we report compression factors over 16x for real data; a comparison shows that our incremental upload sends only 1/5 as much data as Rsync. To demonstrate the utility of the Jumbo Store, we have integrated it into HP Labs ’ prototype Utility Rendering Service (URS), which accepts rendering data in the form of directory tree snapshots from small teams of animators, renders one or more requested frames using a processor farm, and then makes the rendered frames available for download. Efficient incremental upload is crucial to the URS’s usability and responsiveness because of the teams ’ slow Internet connections. We report on the JS’s performance during a major field test of the URS where the URS was offered to 11 groups of animators for 10 months during an animation showcase to create high-quality short animations. 1

