Results 1  10
of
26
Graph sketches: sparsification, spanners, and subgraphs
 In PODS
, 2012
"... When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., tho ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements. In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of
ZeroOne Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and polylogarithmic memory. No such characterization was known despite a tremendous amount of research on frequencybased functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zeroone law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in polylogarithmic time and space and ask, for which G in this class is there an (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for εapproximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for εapproximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full mergeε ε ability. We also extend our results to geometric summaries such as εapproximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for εapproximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS0540347, IIS07
Panprivate algorithms via statistics on sketches
 In Proceedings of the 30th symposium on Principles of database systems of data
, 2011
"... Consider fully dynamic data, where we track data as it gets inserted and deleted. There are well developed notions of private data analyses with dynamic data, for example, using differential privacy. We want to go beyond privacy, and consider privacy together with security, formulated recently as pa ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Consider fully dynamic data, where we track data as it gets inserted and deleted. There are well developed notions of private data analyses with dynamic data, for example, using differential privacy. We want to go beyond privacy, and consider privacy together with security, formulated recently as panprivacy by Dwork et al. (ICS 2010). Informally, panprivacy preserves differential privacy while computing desired statistics on the data, even if the internal memory of the algorithm is compromised (say, by a malicious breakin or insider curiosity or by fiat by the government or law). We study panprivate algorithms for basic analyses, like estimating distinct count, moments, and heavy hitter count, with fully dynamic data. We present the first known panprivate algorithms for these problems in the fully dynamic model. Our algorithms rely on sketching techniques popular in streaming: in some cases, we add suitable noise to a previously known sketch, using a novel approach of calibrating noise to the underlying problem structure and the projection matrix of the sketch; in other cases, we maintain certain statistics on sketches; in yet others, we define novel sketches. We also present the first known lower bounds explicitly for pan privacy, showing our results to be nearly optimal for these problems. Our lower bounds are stronger than those implied by differential privacy or dynamic data streaming alone and hold even if unbounded memory and/or unbounded processing time are allowed. The lower bounds use a noisy decoding argument and exploit a connection between panprivate algorithms and data sanitization.
Periodicity and cyclic shifts via linear sketches
 In APPROXRANDOM
, 2011
"... Abstract. We consider the problem of identifying periodic trends in data streams. We say a signal a ∈ R n is pperiodic if ai = ai+p for all i ∈ [n − p]. Recently, Ergün et al. [4] presented a onepass, O(polylog n)space algorithm for identifying the smallest period of a signal. Their algorithm requ ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract. We consider the problem of identifying periodic trends in data streams. We say a signal a ∈ R n is pperiodic if ai = ai+p for all i ∈ [n − p]. Recently, Ergün et al. [4] presented a onepass, O(polylog n)space algorithm for identifying the smallest period of a signal. Their algorithm required a to be presented in the timeseries model, i.e., ai is the ith element in the stream. We present a more general linear sketch algorithm that has the advantages of being applicable to a) the turnstile stream model, where coordinates can be incremented/decremented in an arbitrary fashion and b) the parallel or distributed setting where the signal is distributed over multiple locations/machines. We also present sketches for (1+ɛ) approximating the ℓ2 distance between a and the nearest pperiodic signal for a given p. Our algorithm uses O(ɛ −2 polylog n) space, comparing favorably to an earlier timeseries result that used O(ɛ −5.5√ p polylog n) space for estimating the Hamming distance to the nearest pperiodic signal. Our last periodicity result is an algorithm for estimating the periodicity of a sequence in the presence of noise. We conclude with a smallspace algorithm for identifying when two signals are exact (or nearly) cyclic shifts of one another. Our algorithms are based on bilinear sketches [10] and combining Fourier transforms with stream processing techniques such as ℓp sampling and sketching [11, 13]. 1
Tight Bounds for Distributed Functional Monitoring
"... We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coo ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coordinator’s task is to continuously maintain an approximate output to a function computed over the union of the k streams. The goal is to minimize the number of bits communicated. Let the pth frequency moment be defined as Fp f
ArthurMerlin Streaming Complexity
, 2013
"... We study the power of ArthurMerlin probabilistic proof systems in the data stream model. We show a canonical AM streaming algorithm for a wide class of data stream problems. The algorithm offers a tradeoff between the length of the proof and the space complexity that is needed to verify it. As an a ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We study the power of ArthurMerlin probabilistic proof systems in the data stream model. We show a canonical AM streaming algorithm for a wide class of data stream problems. The algorithm offers a tradeoff between the length of the proof and the space complexity that is needed to verify it. As an application, we give an AM streaming algorithm for the Distinct Elements problem. Given a data stream of length m over alphabet of size n, the algorithm uses Õ(s) space and a proof of size Õ(w), for every s, w such that s · w ≥ n (where Õ hides a polylog(m, n) factor). We also prove a lower bound, showing that every MA streaming algorithm for the Distinct Elements problem that uses s bits of space and a proof of size w, satisfies s · w = Ω(n). As a part of the proof of the lower bound for the Distinct Elements problem, we show a new lower bound of Ω ( √ n) on the MA communication complexity of the Gap Hamming Distance problem, and prove its tightness. Keywords:
Periodicity in Streams
 RANDOM 2010
, 2010
"... In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1pass randomized streaming algorithm that use ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1pass randomized streaming algorithm that uses O(log 2 n) space and reports the shortest period if the given stream is periodic. At the heart of this result is a 1pass O(log n log m) space streaming pattern matching algorithm. This algorithm uses similar ideas to Porat and Porat’s algorithm in FOCS 2009 but it does not need an offline preprocessing stage and is considerably simpler.
In the second part, we study distance to pperiodicity under the Hamming metric, where we estimate the minimum number of character substitutions needed to make a given sequence pperiodic. In streaming terminology, this problem can be described as computing the cascaded aggregate L1 ◦ F res(1) 1 over a matrix Ap×d given in column ordering. For this problem, we present a randomized streaming algorithm with approximation factor 2 + ɛ that takes O~(eps^{2}) space. We also show a 1+eps randomized streaming algorithm which uses O(eps^{5.5}p^{1/2}) space.
Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances ABSTRACT
"... Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been propos ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Fingerprinting is a widelyused technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity ” of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a file given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n) · polylog n dimensions where D(n) ≪ n is the number of divisors of n. The striking fact is that this is nearoptimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a longstanding open problem on the low distor
Sketching and Streaming HighDimensional Vectors
, 2011
"... A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sk ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
A sketch of a dataset is a smallspace data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a smallspace algorithm given just one pass over the data, a socalled streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first spaceoptimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity.