Results 1 
4 of
4
Estimation of Deduplication Ratios in Large Data Sets
"... Abstract—We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task – It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspect ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task – It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel twophased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1 % relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of fullfile deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2 % relative error while reading only 27 % of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting largescale academic studies related to deduplication ratios. I.
Distinctvalue synopses for multiset operations
 Communications of the ACM
"... doi:10.1145/1562764.1562787 The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
doi:10.1145/1562764.1562787 The task of estimating the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science and elsewhere. We provide DV estimation techniques for the case in which the dataset of interest is split into partitions. We create for each partition a synopsis that can be used to estimate the number of DVs in the partition. By combining and extending a number of results in the literature, we obtain both suitable synopses and DV estimators. The synopses can be created in parallel, and can be easily combined to yield synopses and DV estimates for “compound ” partitions that are created from the base partitions via arbitrary multiset union, intersection, or difference operations. Our synopses can also handle deletions of individual partition elements. We prove that our DV estimators are unbiased, provide error bounds, and show how to select synopsis sizes in order to achieve a desired estimation accuracy. Experiments and theory indicate that our synopses and estimators lead to lower computational costs and more accurate DV estimates than previous approaches. 1.
PrivacyPreserving Mobility Monitoring using Sketches of Stationary Sensor Readings
"... Abstract. Two fundamental tasks of mobility modeling are (1) to track the number of distinct persons that are present at a location of interest and (2) to reconstruct flows of persons between two or more different locations. Stationary sensors, such as Bluetooth scanners, have been applied to both t ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Two fundamental tasks of mobility modeling are (1) to track the number of distinct persons that are present at a location of interest and (2) to reconstruct flows of persons between two or more different locations. Stationary sensors, such as Bluetooth scanners, have been applied to both tasks with remarkable success. However, this approach has privacy problems. For instance, Bluetooth scanners store the MAC address of a device that can in principle be linked to a single person. Unique hashing of the address only partially solves the problem because such a pseudonym is still vulnerable to various linking attacks. In this paper we propose a solution to both tasks using an extension of linear counting sketches. The idea is to map several individuals to the same position in a sketch, while at the same time the inaccuracies introduced by this overloading are compensated by using several independent sketches. This idea provides, for the first time, a general set of primitives for privacy preserving mobility modeling from Bluetooth and similar addressbased devices. 1
Synopsis Diffusion of . . .
, 2008
"... Previous approaches for computing duplicatesensitive aggregates in wireless sensor networks have used a tree topology, in order to conserve energy and to avoid doublecounting sensor readings. However, a tree topology is not robust against node and communication failures, which are common in sensor ..."
Abstract
 Add to MetaCart
Previous approaches for computing duplicatesensitive aggregates in wireless sensor networks have used a tree topology, in order to conserve energy and to avoid doublecounting sensor readings. However, a tree topology is not robust against node and communication failures, which are common in sensor networks. In this article, we present synopsis diffusion, a general framework for achieving significantly more accurate and reliable answers by combining energyefficient multipath routing schemes with techniques that avoid doublecounting. Synopsis diffusion avoids doublecounting through the use of order and duplicateinsensitive (ODI) synopses that compactly summarize intermediate results during innetwork aggregation. We provide a surprisingly simple test that makes it easy to check the correctness of an ODI synopsis. We show that the properties of ODI synopses and synopsis diffusion create implicit acknowledgments of packet delivery. Such acknowledgments enable energyefficient adaptation of message routes to dynamic message loss conditions, even in the presence of asymmetric links. Finally, we illustrate using extensive simulations the