Results 1 -
9 of
9
Zero-One Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and poly-logarithmic memory. No such characterization was known despite a tremendous amount of research on frequency-based functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zero-one law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in poly-logarithmic time and space and ask, for which G in this class is there an (1±ɛ)-approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)-approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full merge-ε ε ability. We also extend our results to geometric summaries such as ε-approximations and ε-kernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS-05-40347, IIS-07-
Periodicity in Streams
, 2010
"... In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1-pass randomized streaming algorithm that use ..."
Abstract
- Add to MetaCart
In this work we study sublinear space algorithms for detecting periodicity over data streams. A sequence of length n is said to be periodic if it consists of repetitions of a block of length p for some p ≤ n 2. In the first part of this paper, we give a 1-pass randomized streaming algorithm that uses O(log 2 n) space and reports the shortest period if the given stream is periodic. At the heart of this result is a 1-pass O(log n log m) space streaming pattern matching algorithm. This algorithm uses similar ideas to Porat and Porat’s algorithm in FOCS 2009 but it does not need an offline pre-processing stage and is considerably simpler. In the second part, we study distance to p-periodicity under the Hamming metric, where we estimate the minimum number of character substitutions needed to make a given sequence p-periodic. In streaming terminology, this problem can be described as computing the cascaded aggregate L1 ◦ F res(1) 1 over a matrix Ap×d given in column ordering. For this problem, we present a randomized streaming algorithm with approximation factor 2 + ɛ that takes 1
Datacubes and Query Optimizations Query optimization can be decomposed
, 2011
"... Abstract. The view size estimation plays an important role in query optimization. It has been observed that many data follow a power law distribution. In this paper, we consider the balls in bins problem where we place balls into N bins when the bin selection probabilities follow a power law distrib ..."
Abstract
- Add to MetaCart
Abstract. The view size estimation plays an important role in query optimization. It has been observed that many data follow a power law distribution. In this paper, we consider the balls in bins problem where we place balls into N bins when the bin selection probabilities follow a power law distribution. As a generalization to the coupon collector’s problem, we address the problem of determining the expected number of balls that need to be thrown in order to have at least one ball in each of the N bins. We prove that Θ ( Nα lnN c α N) balls are needed to achieve this where αistheparameterofthepowerlawdistributionandc α N = α−1 α−Nα−1 for α ̸ = 1 and c α N = 1 for α = 1. Next, when fixing the number of balls lnN that are thrown to T, we provide closed form upper and lower bounds on the expected number of bins that have at least one occupant. For n
Nonnumerical Algorithms and Problems—computations
"... A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct network connections. While it is fairly easy to maintain an accurate estimate of the current total number of distinct connections using streaming algorithms that exhibit both a low space and computatio ..."
Abstract
- Add to MetaCart
A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct network connections. While it is fairly easy to maintain an accurate estimate of the current total number of distinct connections using streaming algorithms that exhibit both a low space and computational complexity, identifying the network entities that are involved in the largest number of distinct connections efficiently is considerably harder. In this paper, we study the problem of finding all entities whose number of distinct (outgoing or incoming) network connections is at least a specific fraction of the total number of distinct connections. These entities are referred to as heavy distinct hitters. Since this problem is hard in general, we focus on randomized approximation techniques and propose a sampling-based and a sketchbased streaming algorithm. Both algorithms output a list of the potential heavy distinct hitters including the estimated counts of the corresponding number of distinct connections. We prove that, depending on the required level of accuracy of the output list, the space complexities of the presented algorithms are asymptotically optimal up to small logarithmic factors. Additionally, the algorithms are evaluated and compared using real network data in order to determine their usefulness in practice.
Sketching and Streaming High-Dimensional Vectors
, 2011
"... A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sk ..."
Abstract
- Add to MetaCart
A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-called streaming algorithm. Sketching and streaming have found numerous applications in network traffic monitoring, data mining, trend detection, sensor networks, and databases. In this thesis, I describe several new contributions in the area of sketching and streaming algorithms. • The first space-optimal streaming algorithm for the distinct elements problem. Our algorithm also achieves O(1) update and reporting times. • A streaming algorithm for Hamming norm estimation in the turnstile model which achieves the best known space complexity.
Tight Bounds for Distributed Functional Monitoring
"... We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coo ..."
Abstract
- Add to MetaCart
We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coordinator’s task is to continuously maintain an approximate output to a function computed over the union of the k streams. The goal is to minimize the number of bits communicated. Let the p-th frequency moment be defined as Fp f
Mergeable Coresets
, 2011
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a summary on the two data sets combined together, while preserving the error and size guarantees. This property means tha ..."
Abstract
- Add to MetaCart
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a summary on the two data sets combined together, while preserving the error and size guarantees. This property means that the summary can be treated like other algebraic objects such as sum and max, which is especially useful for computing summaries on massive distributed data. Many data summaries are trivially mergeable by construction, most notably those based on linear transformations. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary log(εn)) that has a restricted form of mergeability, and a randomized one of size O ( 1 ε ε) with full mergeability. We also extend our results to geometric summaries such as ε-approximations and ε-kernels. of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O ( 1 ε 1
Consistent Subset Sampling ∗
"... Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I ⊆ U it should be possible to quickly compute I ∩ S, i.e., the elements in I satisfying the sampli ..."
Abstract
- Add to MetaCart
Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I ⊆ U it should be possible to quickly compute I ∩ S, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time Θ(b k). Using a carefully designed hash function, for a given sampling probability p ∈ (0, 1], we show how to improve the time complexity to Θ(b ⌈k/2 ⌉ log log b + pb k) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is Θ(b ⌈k/4 ⌉). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. 1

